Saturday, April 24, 2010

CDB: wicked fast lookups

CDB is D.J. Bernstein's constant database implementation. It is basically a very fast in memory index. Wicked fast in fact.


If you're on Ubuntu/Debian install it with the command:
sudo apt-get install tinycdb
CDB works as a two-level hash. You want constant time lookups to gain performance at the cost of extra RAM. It uses a 2K header region at the front of the file as the first index and secondary variable length hash lookup right after this header region. So when given a key to lookup, it is simply 2 constant-time lookups for CDB to realize where in the file is the k,v pair. And then at most 1 or 2 disk reads to get your value.

The reason for the initial header being 2K is due to how inodes in most operating systems work. It's so small it will always be cached in memory so the OS will not need to go to disk to retrieve it on each lookup. The second level hash is right after so you wont be doing too much reading to start using the CDB. That means a very fast start up times as well.

CDB uses a very simple layout for its keys and values. The format for reading and writing a cdb is:

+klen,vlen:key->val\n
That's 2 integers for the length of the key and value followed by the key and value themselves. Obviously none of those other characters are stored internally, only 8 bytes is added before the bytes of the key/value.

The tinycdb implementation has a -m option that allows the input to be even simpler :
key value \n
That's simply a key followed by any whitespace character and then the value on each line. This makes command line usage of CDB a breeze:

$ echo -e "key1 foo\nkey2 bar" | cdb -cm lookups.cdb ### -c to create a cdb
$ cdb -qm lookups.cdb key1 ### -q to query a cdb
foo
$ cdb -qm lookups.cdb key2
bar

On top of the 4+4 bytes for int lengths, are 16 more bytes for the hashing algorithms. That's a total of 24 bytes additional to the size of the keys and values + 2K for the initial header region.

CDB works with up to 4GB of RAM in the default C implementation. That's due to the max value of 32-bit unsigned integers in C. If you want more, there are two options: a 64-bit implementation and an L-shift implementation.

64-bit can handle an ungodly amount of lengths - which is also it's disadvantage - that's a lot of hot air unused bytes. The L-shift works by adding say 1,2 or 3 bits that is being represented for the addresses. That's 33,34 and 35 bits. Effectively getting you up to 8 times the space. 8*4GB = 32GB of reference-able space. Both L-shift and 64-bit implementations are more than enough for common uses, but have disadvantages of not using a different format than the default CDB implementation so you lose portability.

CDB has APIs written in Python, Ruby, Perl and Java to name a few. Unfortunately, the Java implementation on the website, while of interest, is abysmally slow. But we'll fix that in a later post...

One last secret ingredient CDB uses for it's excellent performance : mmap().

mmap() is a kernal system call that can be used to ask the OS to map an entire file or part of a file to a region in memory. Think Virtual Memory and swap space. Operating Systems have been doing this for a long while now and are particularly good at managing memory as it is their job.

Basically, from the point of the developer the file is presented to you as a simple contiguous array of bytes. But in reality the OS is swapping chunks of the file to and from memory and disk on the fly in an 'as needed' basis. That's a lot of complexity pushed away from you thus making development much simpler.

You're not gonna get much faster than this when working with very large files on disk.

Java vs. Python ( vs. AWK)

Development time, effort and ease of use :
Python wins


Speed and performance :
Java wins

This issue arose just recently at work. We needed a simple command line tool to read from standard input, do some simple logic and write output to a specified file. Developing this in Python was a breeze. Perhaps all of 20 lines of code:


import os, sys;
def hashCode(string):
  ''' Exact copy of Java's String.hashCode() method '''
  h = 0;
  for ii in range(len(string)) :
    h = 31 * h + ord(string[ii])
  return h;



try :
  files = {}
  files[0] = open('my_output_file_A.txt', 'w')
  files[1] = open('my_output_file_B.txt', 'w')
  cnt = 0;
  for l in sys.stdin: ## read from stdin
    key = l.split( '\t' )[0] ## Split on TAB and get first column
    ext = hashCode(key.strip()) % 2 ## hash key; 0 = file A. 1 = file B
    files[ext].write(l)
    cnt+=1;
  print 'Total lines :', cnt;
except : KeyboardInterrupt

Python performance on 10 Million lines of input : 58 sec.

Not bad really. I was happy. Until I happened to be doing something very similar with only basic unix commands (i.e. AWK).

Time for similar adjustments on same 10 Million lines : ~10 secs.

What's up with that? Perhaps it was the hashCode function? Profiling showed 1/2 the time was spent on I/O and only 1/4 on the function. So even with zero I/O and no hashCode method we're still slower than a comparable command line script?

Hmm, let me see how this would perform in a compiled language instead of an interpreted one. Porting this code over to Java took a bit longer than the Python counter part. Long story short:
BufferedReader br = new BufferedReader( new InputStreamReader(System.in));
while( (line=br.readLine()) != null ) {
cnt++;
idx = line.indexOf( '\t' );
mod = line.substring(0, idx).hashCode() % 2;
if (mod < 0 ) mod = mod + 2 /* Python/corrected modulus */
 files[mod].write(line + '\n'); /* BufferedWriter[2] Code ommitted for space */
}
br.close();

First Post

Wow. First post.

On the internet for over a decade and only now I'm bloggin'. Hope this doesn't become addictive.

So, as you might have guessed, this blog will lean on the technology side.
Main focus will be on whatever I'm working on; but primarily Java, Python, Unix and approaches to specific problems - and hopefully lots of feedback.

To be more precise, my goal is to bring up a problem, describe a solution and why it is a good solution. Then hopefully either myself or someone out there will describe a better solutions still. And then we face off and have a developer flame war ;)

Emphasis will be on overall approach. In particular - simplicity. Simple wins. As a friend of mine once said,
"...sure it's sexy to build a giant complex application - but if you can do the same with duct tape you win."