.toString(): Java vs. Python ( vs. AWK)

Development time, effort and ease of use :

Python wins

Speed and performance :

Java wins

This issue arose just recently at work. We needed a simple command line tool to read from standard input, do some simple logic and write output to a specified file. Developing this in Python was a breeze. Perhaps all of 20 lines of code:

import os, sys;
def hashCode(string):
  ''' Exact copy of Java's String.hashCode() method '''
  h = 0;
  for ii in range(len(string)) :
   h = 31 * h + ord(string[ii])
  return h;

try :

files = {}

files[0] = open('my_output_file_A.txt', 'w')

files[1] = open('my_output_file_B.txt', 'w')

cnt = 0;

for l in sys.stdin: ## read from stdin

key = l.split( '\t' )[0] ## Split on TAB and get first column

ext = hashCode(key.strip()) % 2 ## hash key; 0 = file A. 1 = file B

files[ext].write(l)

cnt+=1;

print 'Total lines :', cnt;

except : KeyboardInterrupt

Python performance on 10 Million lines of input : 58 sec.

Not bad really. I was happy. Until I happened to be doing something very similar with only basic unix commands (i.e. AWK).

Time for similar adjustments on same 10 Million lines : ~10 secs.

What's up with that? Perhaps it was the hashCode function? Profiling showed 1/2 the time was spent on I/O and only 1/4 on the function. So even with zero I/O and no hashCode method we're still slower than a comparable command line script?

Hmm, let me see how this would perform in a compiled language instead of an interpreted one. Porting this code over to Java took a bit longer than the Python counter part. Long story short:

BufferedReader br = new BufferedReader( new InputStreamReader(System.in));

while( (line=br.readLine()) != null ) {

cnt++;

idx = line.indexOf( '\t' );

mod = line.substring(0, idx).hashCode() % 2;

if (mod < 0 ) mod = mod + 2 /* Python/corrected modulus */

files[mod].write(line + '\n'); /* BufferedWriter[2] Code ommitted for space */

}

br.close();

.toString()

Saturday, April 24, 2010

Java vs. Python ( vs. AWK)

No comments:

Post a Comment

Blog Archive