Saturday, April 24, 2010

Java vs. Python ( vs. AWK)

Development time, effort and ease of use :
Python wins


Speed and performance :
Java wins

This issue arose just recently at work. We needed a simple command line tool to read from standard input, do some simple logic and write output to a specified file. Developing this in Python was a breeze. Perhaps all of 20 lines of code:


import os, sys;
def hashCode(string):
  ''' Exact copy of Java's String.hashCode() method '''
  h = 0;
  for ii in range(len(string)) :
    h = 31 * h + ord(string[ii])
  return h;



try :
  files = {}
  files[0] = open('my_output_file_A.txt', 'w')
  files[1] = open('my_output_file_B.txt', 'w')
  cnt = 0;
  for l in sys.stdin: ## read from stdin
    key = l.split( '\t' )[0] ## Split on TAB and get first column
    ext = hashCode(key.strip()) % 2 ## hash key; 0 = file A. 1 = file B
    files[ext].write(l)
    cnt+=1;
  print 'Total lines :', cnt;
except : KeyboardInterrupt

Python performance on 10 Million lines of input : 58 sec.

Not bad really. I was happy. Until I happened to be doing something very similar with only basic unix commands (i.e. AWK).

Time for similar adjustments on same 10 Million lines : ~10 secs.

What's up with that? Perhaps it was the hashCode function? Profiling showed 1/2 the time was spent on I/O and only 1/4 on the function. So even with zero I/O and no hashCode method we're still slower than a comparable command line script?

Hmm, let me see how this would perform in a compiled language instead of an interpreted one. Porting this code over to Java took a bit longer than the Python counter part. Long story short:
BufferedReader br = new BufferedReader( new InputStreamReader(System.in));
while( (line=br.readLine()) != null ) {
cnt++;
idx = line.indexOf( '\t' );
mod = line.substring(0, idx).hashCode() % 2;
if (mod < 0 ) mod = mod + 2 /* Python/corrected modulus */
 files[mod].write(line + '\n'); /* BufferedWriter[2] Code ommitted for space */
}
br.close();

No comments:

Post a Comment