Live Compression of a GDBM Database

Submitted by Bill St. Clair on Mon, 04 Feb 2008 02:02:10 GMT  <== Loom ==> 

A GDBM database file is an easy way to persistently store key/value pairs. Loom uses one for its backing store. One problem with these databases is that they become fragmented over time, with lots of unfilled empty space: holes. They need to be periodically compressed. Patrick Chkeroff and I came up with a neat mechanism to do live compression. He has seen the Loom database file get 20 times bigger than it needed to be. He plans to integrate something similar to this mechanism into the Loom code.

The basic idea mirrors Lisp two-space incremental garbage collectors, but is a little simpler. Call the original GDBM database the "old" database. We're going to open a second database, called the "new" database. And change the database access code so that it will over time copy the old database to the new one, leaving the new one with a compressed version of the data in the old one, plus any changes the user code has made during the copy operation.

There's a working example in GDBM.php, in my Loom folder. You can copy and paste from that page or download it and my other Loom related code from billstclair.com/loom.

The idea is simple. There are three operations on a database, read, write, and delete (implemented in the code by get() and put() functions, where put() of a blank value means delete).

read($key):


  1. If $key has a value in the new database, return it.

  2. If $key has a value in the old database, return it.

  3. Otherwise, return false, not in the database.

write($key, $value):


  1. Write the value to the new database.

  2. Delete the key from the old database.

delete($key)


  1. Delete the key from the old database.

  2. Delete the key from the new database.

To each of the operations, I also add a call to the copysome() function, which copies some values from the old database to the new one. After all values have been copied, we close both databases, delete the old one, which is now empty, rename new to old, and reopen the new old database. When the user requests another copy, we open another new database and start again.

copysome:


  1. Fetch the first key/value pair in the old database

  2. If the key does not yet have a value in the new database, insert the key/value pair there.

  3. Delete the key from the old database.
  4. Repeat for a configurable number of copies, or until the old database is empty

At no time is it necessary to close the database for off-line compression. All you need is enough disk space to store two copies of the data, and a slight slow-down in access speed during the copy.

Add comment Edit post Add post

Comments (1):

db expansion

Submitted by Patrick on Tue, 05 Feb 2008 19:52:40 GMT

"He has seen the Loom database file get 20 times bigger than it needed to be."

Only under extreme repeated cycles of deletion and insertion. Also, this did not happen in the Loom database itself, but in a database on another server where we run a hedge fund accounting business. There we have a particular application which deletes a massive recursive structure and re-creates it from scratch. Repeated runs of that application would cause "holes" in the database file which apparently were difficult to re-use in a timely manner -- hence the expansion of the database file.

The GDBM database is one of the fastest, simplest, and most reliable pieces of code I've seen. The only drawback I have seen is the fragmentation under extreme circumstances, and that drawback is neatly avoided with the incremental copy technique.

Edit comment