Saturday, April 24, 2010

CDB: wicked fast lookups

CDB is D.J. Bernstein's constant database implementation. It is basically a very fast in memory index. Wicked fast in fact.


If you're on Ubuntu/Debian install it with the command:
sudo apt-get install tinycdb
CDB works as a two-level hash. You want constant time lookups to gain performance at the cost of extra RAM. It uses a 2K header region at the front of the file as the first index and secondary variable length hash lookup right after this header region. So when given a key to lookup, it is simply 2 constant-time lookups for CDB to realize where in the file is the k,v pair. And then at most 1 or 2 disk reads to get your value.

The reason for the initial header being 2K is due to how inodes in most operating systems work. It's so small it will always be cached in memory so the OS will not need to go to disk to retrieve it on each lookup. The second level hash is right after so you wont be doing too much reading to start using the CDB. That means a very fast start up times as well.

CDB uses a very simple layout for its keys and values. The format for reading and writing a cdb is:

+klen,vlen:key->val\n
That's 2 integers for the length of the key and value followed by the key and value themselves. Obviously none of those other characters are stored internally, only 8 bytes is added before the bytes of the key/value.

The tinycdb implementation has a -m option that allows the input to be even simpler :
key value \n
That's simply a key followed by any whitespace character and then the value on each line. This makes command line usage of CDB a breeze:

$ echo -e "key1 foo\nkey2 bar" | cdb -cm lookups.cdb ### -c to create a cdb
$ cdb -qm lookups.cdb key1 ### -q to query a cdb
foo
$ cdb -qm lookups.cdb key2
bar

On top of the 4+4 bytes for int lengths, are 16 more bytes for the hashing algorithms. That's a total of 24 bytes additional to the size of the keys and values + 2K for the initial header region.

CDB works with up to 4GB of RAM in the default C implementation. That's due to the max value of 32-bit unsigned integers in C. If you want more, there are two options: a 64-bit implementation and an L-shift implementation.

64-bit can handle an ungodly amount of lengths - which is also it's disadvantage - that's a lot of hot air unused bytes. The L-shift works by adding say 1,2 or 3 bits that is being represented for the addresses. That's 33,34 and 35 bits. Effectively getting you up to 8 times the space. 8*4GB = 32GB of reference-able space. Both L-shift and 64-bit implementations are more than enough for common uses, but have disadvantages of not using a different format than the default CDB implementation so you lose portability.

CDB has APIs written in Python, Ruby, Perl and Java to name a few. Unfortunately, the Java implementation on the website, while of interest, is abysmally slow. But we'll fix that in a later post...

One last secret ingredient CDB uses for it's excellent performance : mmap().

mmap() is a kernal system call that can be used to ask the OS to map an entire file or part of a file to a region in memory. Think Virtual Memory and swap space. Operating Systems have been doing this for a long while now and are particularly good at managing memory as it is their job.

Basically, from the point of the developer the file is presented to you as a simple contiguous array of bytes. But in reality the OS is swapping chunks of the file to and from memory and disk on the fly in an 'as needed' basis. That's a lot of complexity pushed away from you thus making development much simpler.

You're not gonna get much faster than this when working with very large files on disk.

6 comments:

  1. Moe - wtf man???? you have no better thing to do???

    ReplyDelete
  2. LOL.

    I know right? But sometimes your working with something so much you wish you could share the info with others.

    Got a problem with learning?

    ReplyDelete
  3. this post made me cry and it completely changed my life... i heard botond is no longer a virgin after reading this post and filed an application to be a Buddhist monk..

    ReplyDelete
  4. Somehow, I'm thinking the comments will be more interesting than the blog itself :P Where's the Starcraft 2 strategies post you promised!? I'm eagerly awaiting that entry!

    ReplyDelete
  5. Do you have a patch for the lshift implementation to increase the size?

    ReplyDelete
  6. I have no patch nor any access to the L-Shift version of this code.

    I did something even better: split your 4GB+ data over multiple files. Use simply hash function over the keys to predictively distribute your keys over 2,4,8 files. Created separate CDBs for each of the files. Now when you need to do a lookup you use your same (hashCode() % 2,4,8) over on the key to figure out which CDB to use and VIOLA!

    This way you can get 8/16/32GB of mmapp'ed data.

    Be careful, as mmap'ing will still eventually cause swapping once you fill up physical memory. At least that's automatic...

    ReplyDelete