A Dash of Technology: NoSQL

Showing posts with label NoSQL. Show all posts

Thursday, October 24, 2013

Multi-column (SQL-like) sorting in Redis

Recently, I received an email from a wayward Redis user asking about using Redis Sets and Sorted Sets to sort multiple columns of data, with as close to the same semantics as a traditional SQL-style "order by" clause. Well it is possible, with limitations, keep reading to find out how.

What is Redis?

For those people who don't quite know what Redis is already, the TLDR version is: an in-memory data structure server that maps from string keys to one of 5 different data structures, providing high-speed remote access to shared data, and optional on-disk persistence. In a lot of ways, you can think of Redis like a version of Memcached where your data doesn't disappear if your machine restarts, and which supports a wider array of commands to store, retrieve, and manipulate data in different ways.

The setup

With that out of the way, our intrepid Redis user had come to me with a pretty reasonable problem to have; he needed to build an application to display a listing of businesses, sorted by several criteria. In his case, he had "price", "distance[1]", and "rating". In many cases that we have all seen in recent years with individual retailer searches, never mind restaurant searches on Yelp and similar applications, when searching for something in the physical world, there are a few things you care about primarily. These usually break down preferentially as lowest distance, lowest price, highest rating. In a relational database/SQL world, these fields would all be columns in a table (or spread out over several tables or calculated in real-time), so we are going to be referring to them as "sort columns" from here on.

Now, depending on preferences, you can sometimes get column preference and ascending/descending changes, which is why we need to build a system that can support reordering columns *and* switching the order of each individual column. Say that we really want the highest rating, lowest distance, lowest price? We need to support that too, and we can.

The concept

Because we are dealing with sort orders, we have two options. We can either use the Redis SORT command, or we can use sorted sets. There are ways of building this using the SORT command, but it is much more complicated and requires quite a bit of precomputation, so we'll instead use sorted sets.

We will start by making sure that every business has an entry in each of 3 different sorted sets representing price, distance, and rating. If a business has an "id" of 5, has a price of 20, distance of 4, and a rating of 8, then at some point the commands "ZADD price 20 5", "ZADD distance 4 5", and "ZADD rating 8 5" will have been called.

Once all of our data is in Redis, we then need to determine the maximum value of each of our sort columns. If you have ranges that you know are fixed, like say that you know that all of your prices and distance will all be 0 to 100, and your rating will always be 0 to 10, then you can save yourself a round-trip. We'll go ahead and build this assuming that you don't know your ranges in advance.

We are trying to gather our data range information in advance in order to carve up the 53 bits of precision [2] available in the floating-point doubles that are available in the sorted set scores. If we know our data ranges, then we know how many "bits" to dedicate to each column, and we know whether we can actually sort our data exactly, without losing precision.

If you remember our price, distance, and range information, you can imagine that (borrowing our earlier data) if we have price=20, distance=4, rating=8, and we want to sort by distance, price, -rating, we want to construct a "score" that will sort the same as the "tuple" comparison (20, 4, -8). By gathering range information, we could (for example) translate that tuple into a score of "20042", which you can see is basically the concatenation of "20", "04", and 10-8 (we subtract from 10 here because the rating column is reversed, and it helps to understand how we got the values).

Note: because of our construction, scores that are not whole numbers may not produce completely correct sorts.

The code

Stepping away from the abstract and into actual code, we are going to perform computationally what I just did above with some string manipulation. We are going to numerically shift our data into columns, accounting for the magnitude of the data, as well as negative values in the columns (which won't affect our results). As a bonus, this method will even tell you if it believes that you could have a lower-quality sort because your data range is too wide[3].

import math
import warnings

def sort_zset_cols(conn, result_key, sort=('dist', 'price', '-score')):
    current_multiplier = 1
    keys = {result_key: 0}
    sort = list(reversed(sort))

    # Gets the max/min values in a sort column
    pipe = conn.pipeline(True)
    for sort_col in sort:
        pipe.zrange(sort_col, 0, 0, withscores=True)
        pipe.zrange(sort_col, -1, -1, withscores=True)
    ranges = pipe.execute()

    for i, sort_col in enumerate(sort):
        # Auto-scaling for negative values
        low, high = ranges[i*2][1], ranges[i*2+1][1]
        maxv = int(math.ceil(max(abs(low), abs(high))))

        # Adjusts the weights based on the magnitude and sort order of the
        # column
        old_multiplier = current_multiplier
        desc = sort_col.startswith('-')
        sort_col = sort_col.lstrip('-')
        current_multiplier *= maxv

        # Assign the sort key a weight based on all of the lower-priority
        # sort columns
        keys[sort_col] = -old_multiplier if desc else old_multiplier

    if current_multiplier >= 2**53:
        warnings.warn("The total range of your values is outside the "
            "available score precision, and your sort may not be precise")

    # The sort results are available in the passed result_key
    return conn.zinterstore(result_key, keys)

If you prefer to check the code out at Github, here is the gist. Two notes about this code:

If the maximum or minimum values in any of the indexed columns becomes more extreme between the data range check and the actual query execution, some entries may have incorrect ordering (this can be fixed by translating the above to Lua and use Redis 2.6 and later support for Lua scripting)
If any of your data is missing in any of the indexes, then that entry will not appear in the results

Within the next few weeks, I'll be adding this functionality to rom, my Python Redis object mapper.

Interested in more tips and tricks with Redis? My book, Redis in Action (Amazon link), has dozens of other examples for new and seasoned users alike.

[1] For most applications, the distance criteria is something that would need to be computed on a per-query basis, and our questioning developer already built that part, so we'll assume that is available already.
[2] Except for certain types of extreme-valued doubles, you get 52 bits of actual precision, and 1 bit of implied precision. We'll operate under the assumption that we'll always be within the standard range, so we'll always get the full 53 bits.
[3] There are ways of adjusting the precision of certain columns of the data (by scaling values), but that can (and very likely would) result in scores with fractional components, which may break our sort order (as mentioned in the notes).

Edit on 2015-08-09: Changed the "maxv =" assignment above to be correct in all cases, and made sure the revered sort (for calculating ranges) is a list for repeated-iteration.

Thursday, January 12, 2012

Creating a Lock with Redis

This post is a partially rewritten excerpt from Redis in Action, the book that I am writing for Manning Publications, which does not currently have a release date. Over time, I hope to include other excerpts from the book as time allows.

Redis Transactions

In the world of concurrent and parallel programming, there will be some point where you will write something that requires access to a resource without any other thread or process accessing it. In Redis this is actually very common; multiple readers, multiple writers, all dealing with the same data structures. Redis includes a combination of five commands for handling optimistic data locking. Those commands are WATCH, UNWATCH, MULTI, EXEC, and DISCARD.

The typical path for code that intends to modify data in Redis that requires checking values before updating will have a 4-step process. First you WATCH data that you want to modify/check on. Next you check your data. If it isn't what you need to continue, you UNWATCH and return. If it was what you were looking for, you start a MULTI transaction, send your commands, then EXEC them. Below is a simple example that transfers money between two accounts, making sure to prevent overdrafts.

def transfer_funds(conn, sender, recipient, amount):
    pipe = conn.pipeline(True)
    while 1:
        try:
            pipe.watch(sender)
            if pipe.hget(sender, 'money') < amount:
                pipe.unwatch()
                return False

            pipe.multi()
            pipe.hincrby(sender, 'money', -amount)
            pipe.hincrby(recipient, 'money', amount)
            pipe.execute()
            return True
        except redis.exceptions.WatchError:
            pass

The only command not used above is the DISCARD command, which will discard any commands passed after MULTI. As you can see, we will attempt to loop through the above money transfer until it either fails due to no money, or succeeds. This is fine, but if you have some account that is updated often, or if you have some shared data structure that is constantly being updated, you can fall into the except clause and have to retry the transaction repeatedly.

Locking

In other software, rather than using WATCH/MULTI/EXEC, there is a primitive called a Lock, which allows us to gain exclusive access to a resource. You acquire the lock, perform your operation, then release the lock. Because of the variety of commands in Redis, you can build a lock using the available tools. Unfortunately, using those tools correctly is not easy. In my research about locks in Redis, I haven't found a single implementation available online that is 100% correct (until now). Some of the problems with locks that I have seen are as follows:

A process acquired a lock, operated on data, but took too long and the lock was automatically released. The process doesn't know that it lost the lock, or may even release the lock that some other process has since acquired.
A process acquired a lock for an operation that takes a long time, and crashed. Other processes that want the lock don't know what process had the lock, so cannot detect that the process failed, and wastes time waiting for the lock to be released.
One process had a lock, but it timed out. Other processes try to acquire the lock simultaneously, and multiple processes are able to get the lock.
Because of a combination of #1 and #3, many processes now hold the believed exclusive lock, leading to data corruption or other incorrect behavior.

Even if each of these problems had a 1 in a million chance of occurring, Redis' high performance (typically 100,000 to 225,000 operations/second) can cause those problems to occur under heavy load surprisingly often (up to a few times per minute), so it is important to get locking right.

Building a mostly correct lock in Redis is pretty easy. Building a completely correct lock in Redis isn't actually much more difficult, but it requires being extra careful about the operations we use to build it (in the book, we first build a simple lock to show the basics, then add in the full functionality, here we will jump to the fully-featured lock).

The first part of making sure that no other code can run is to "acquire" the lock. The natural building block to use for acquiring a lock is the SETNX command, which will only set a value if the value doesn't already exist. We will set the value to be a unique identifier to ensure that no other process can get the lock. If we were able to set the value (we have acquired the lock), then we immediately set the expiration of the key to ensure that if we take too long with our operation, the lock is eventually released. But if our client happens to crash (and the worst place for it to crash for us is between SETNX and EXPIRE), we still want the lock to eventually time out. To handle that situation, any time a client fails to get the lock, the client will check the expiration on the lock, and if it's not set, set it. Because clients are going to be checking and setting timeouts if they fail to get a lock, the lock will always have a timeout, and will eventually expire, letting other clients get a timed out lock.

But what if multiple clients set the expiration time at the same time? That is fine, they will be running essentially at the same time, so the expiration will be set for the same time.

def acquire_lock(conn, lockname, identifier, atime=10, ltime=10):
    end = time.time() + atime
    while end > time.time():
        if conn.setnx(lockname, identifier):
            conn.expire(lockname, ltime)
            return identifier
        elif not conn.ttl(lockname):
            conn.expire(lockname, ltime)

        time.sleep(.001)

    return False

To release the lock, we have to be at least as careful as when acquiring the lock. Between the time when we acquired the lock and when we are trying to release it, someone may have done bad things to the lock. To release the lock, we actually need to WATCH the lock key, then check to make sure that the value is still the same as what we set it to before we delete it. This also prevents us from releasing a lock multiple times.

def release_lock(conn, lockname, identifier):
    pipe = conn.pipeline(True)

    while True:
        try:
            pipe.watch(lockname)
            if pipe.get(lockname) == identifier:
                pipe.multi()
                pipe.delete(lockname)
                pipe.execute()
                return True

            pipe.unwatch()
            break

        except redis.exceptions.WatchError:
            pass

    # we lost the lock
    return False

And as a bonus, a Python context manager with the updated transfer_funds() code using it...

import contextlib, uuid

class LockException(Exception):
    pass

@contextlib.contextmanager
def redis_lock(conn, lockname, atime=10, ltime=10):
    identifier = str(uuid.uuid4())
    if acquire_lock(**locals()) != identifier:
        raise LockException("could not acquire lock")
    try:
        yield identifier
    finally:
        if not release_lock(conn, lockname, identifier):
            raise LockException("lock was lost")

def transfer_funds(conn, sender, recipient, amount):
    with redis_lock(conn, 'lock:' + sender):
        if conn.hget(sender, 'money') < amount:
            return False

        pipe = conn.pipeline(True)
        pipe.hincrby(sender, 'money', -amount)
        pipe.hincrby(recipient, 'money', amount)
        pipe.execute()
        return True

If you generated your identifier correctly (UUID like we did, or IP address + process id + thread id, etc.), this lock is correct. Not almost correct, not mostly correct, not a little bit wrong. Completely correct. Other locks that I've seen for Redis have one of a couple different mistakes, usually either accidentally resetting the timeout when you shouldn't, or deleting the lock when you shouldn't. Those kinds of errors lead to the 4 problems I listed earlier.

Simplifying Locking

After reading this, you may think that this is a little more work than should be necessary to build a basic lock with timeouts. You are not alone. Two requests have been made to try to help the situation. The first is allowing SETNX to take an optional 3rd argument that is the expiration time if the item was set. That reduces the Redis commands in acquire_lock() to one command that is looped over (the 10 lines above turn into 4 lines). The second is a new command DELC <key> <value>, which will only delete the given key if the current value is the same as the passed value. This reduces the commands in release_lock() to one command that is executed once in the body of the function (the 15 lines above turn into 2 lines). You can read and follow the discussion on the Redis mailing list.

Thank You

If you liked this article about Redis, the section on locks that will be included in the book expands the discussion and includes more examples. If Redis changes to incorporate the new commands and features to make locking easier, you can look forward to another article revisiting classic Redis locking behavior. If you want to see more posts like this, you can buy my book, Redis in Action from Manning Publications today!

Tuesday, February 15, 2011

Some Redis Use-cases

About 6 months ago, the organizer of the LA NoSQL meetup was looking for presenters. Since my coworkers and I had been using Redis fairly heavily for a few months, I offered to do a presentation on Redis. Sadly, that presentation never happened, as the event was delayed and then cancelled for one reason or another. Because I've had the slides, and because I think the information is still useful, I thought I would reformat and rewrite it as a blog post.

What is Redis? From redis.io:

Redis is an open source, advanced key-value store. It is often referred to as a data structure server since keys can contain strings, hashes, lists, sets and sorted sets.

For those of you who are familiar with memcached, it's sort of like that, but more. That is, while you can only store strings/integers with memcached, you can store a few pre-defined data structures with Redis, which are optionally persisted to disk.

Let's start with the basic strings/integers. You can use strings as a serialized cached database row, maybe session storage, a resettable counter (incr with getset), or using setnx for distributed locks. Really, anything that you might have used memcached for, you can use Redis for.

Many people have found Redis lists to be useful as a simple fifo work queue (with the ability to insert/pop from either end, move items from one list to another atomically, limit list length, etc.). Lists can also be the source (and are always the result of when using the STORE option) of a sort call, which by itself can simply be the input keys, or even automatically pull results from string keys or hashes.

Simple 0/1 queue:

def add_work(item):
    rconn.lpush('work-queue', item)

def get_work():
    return rconn.rpop('work-queue')

def retry(item):
    rconn.rpush('work-queue', item)

There is also the set datatype, which has all of the common union, intersection, and difference operations available across set keys. Common use-cases include de-duplicating items for work queues, keeping 'classes' of items (rather than keeping a user:id:sex -> m, you can use user:sex:m -> {id1, id2, ...}), or even as a set of documents for a Redis-backed search engine.

More complex 1+ queue:

def add_work(item):
    rconn.lpush('work-queue', item)

def get_work():
    return rconn.rpoplpush('work-queue', 'in-progress')

def done(item)
    rconn.sadd('done', item)

def retry_failed():
    while rconn.llen('in-progress'):
        check = rconn.lindex('in-progress', -1)
        if not rconn.sismember('done', check):
            rconn.rpoplpush('in-progress', 'work-queue')
        else:
            rconn.srem('done', rconn.rpop('in-progress'))

Another very useful datatype in Redis is the hash. In Redis, hashes are string keys to string or integer values. Useful as a way of gathering similar kinds of data together, a hash can store a row from a database table with each column an entry, which allows for sorting and retrieval via sort and the various hash access methods. Add in the ability to increment/decrement columns in hashes, pull full hashes, etc., the existence of an object/model mapper, and Redis can be easily replace many uses of a traditional SQL database. Throw in Jak Sprat's Alchemy Database, which adds a SQL layer with Lua scripting inside Redis, and for small data sets, a Redis solution may be all you need.

Ad-hoc data sorts:

def insert(data):
    rconn.sadd('known-ids', data['id'])
    rconn.hmset('data:%s'%(data['id'],), data)

def sort_fetch(column, desc=True, num=10):
    results = rconn.sort('known-ids', start=0, num=num, desc=desc, by='data:*->%s'%(column,))
    p = rconn.pipeline(False)
    map(p.hgetall, ['data:'+id for id in results])
    return p.execute()

For those use-cases where having a sortable score over unique items is useful, Redis has the zset or sorted set data type, where each member in the set also has an associated float/double score, which produces an ordering over all keys in the sorted set, and which you can query by member, score, or rank. Some common use cases include priority queues, tag clouds, timeouts, rate limiting, and Redis-backed scored search engine.

Rate limiting:

def can_use(key, count, limit, timeout):
    if rconn.zrank('reset', key) == None:
      pipe.zadd('reset', key, time.time() + timeout)
    pipe.hincrby('counts', key, count)
    return pipe.execute()[-1] <= limit

def reset_counters():
    default = ((None, None),)
    key, ts = (rconn.zrange('reset', 0, 0, withscores=True) or default)[0]
    while ts is not None and ts <= time.time():
      pipe.zrem('reset', ts)
      pipe.hdel('counts', ts)
      pipe.zrange('reset', 0, 0, withscores=True) 
      key, ts = (pipe.execute()[-1] or default)[0]

Redis does support key expiration, but in pre-Redis 2.2 versions, key expiration can have confusing behavior. Use the following to manually expire keys... Manual key expiration:

def set_expire(key, timeout):
    rconn.zadd('expire', key, time.time()+timeout)

def expire_keys():
    p = rconn.pipeline(True)
    for key in rconn.zrangebyscore('expire', 0, time.time()-1):
        p.delete(key)
        p.zrem('expire', key)
    p.execute()

With these simple ideas and structures, even more complex behavior can be defined. Things like per-user prioritized queues, counting semaphores (for limiting worker counts in this case), per-page/site recent viewer lists, finding jobs that a user has the skills to perform (aka the pizza topping problem), navigation trees, and many more.

If you want to see more posts like this, you can buy my book, Redis in Action from Manning Publications today!

Sunday, October 10, 2010

YogaTable as a Database Server

As promised in my last update, YogaTable is no longer an embedded database. Included in the source is a new server component, which listens for requests on a configurable host and port, defaulting to localhost:8765 .

I have included a client for Python, which has everything necessary for basic and advanced YogaTable use. The protocol is basically JSON over HTTP GET/POST, which makes it straightforward for interacting with using just about any language. I am in the process of documenting what is necessary to write new clients, and will be writing a client for Javascript, as well as a more advanced Python client library. Some simple benchmarks with Apache Bench tell me that YogaTable can perform 60 single inserts/second, and around 2500 bulk inserts/second, but that's in mostly ideal conditions.

One of the features that I am most excited about is being able to script the modification of multiple rows in the database with Lisp. I've taken a merged version of Peter Norvig's lis.py and lispy.py, improved the performance, removed some unnecessary features (some of which were unnecessary for database updates), added some other features, and ... Well, let's just see what it looks like. The following is an example from the YogaTable's tests. It shows how you can transactionally update two rows in the database at the same time, and more specifically, how one could implement transferring money from one account to another.

First, let's set up our rows in the database.

d1 = {'value':decimal.Decimal('200.00')}
d2 = {'value':decimal.Decimal('0.00')}
ids = zip(*self.table.insert([d1, d2]))[0]
d1['_id'] = ids[0]
d2['_id'] = ids[1]

Now, let's set up our shared data, and prepare for the output of our test.

shared = {'transfer':decimal.Decimal('45.23')}
d1['value'] -= shared['transfer']
d2['value'] += shared['transfer']

Let's actually perform the conditional update...

out = self.table.update([
    {'_id':ids[0],
     '__ops':'''
        (load types)
        (define zero (decimal `0.00))
        (define balance (getv `doc `value zero))
        (define transfer (getv `shared `transfer zero))
        (if (>= balance transfer)
            (begin
                (setv `doc `value (- balance transfer))
                (setv `shared `transferred #t)))
        '''},
    {'_id':ids[1],
     '__ops':'''
        (load types)
        (define zero (decimal `0.00))
        (define balance (getv `doc `value zero))
        (define transfer (getv `shared `transfer zero))
        (if (getv `shared `transferred #f)
            (setv `doc `value (+ balance transfer)))
        (delv `shared `transferred)
        (delv `shared `transfer)
        '''}], shared=shared)

The Lisp in here may look a little strange, as some of it is nonstandard. The first few lines of the operations for the rows loads the 'types' module, which offers access to the Python decimal.Decimal datatype (among others), pulls some balance information, and determines how much money is supposed to be transfered. The last few lines in the first operation verifies that there is enough money in the account, then deducts the money, and sets the shared variable 'transferred' to True.

The second operation checks to see if 'transferred' is True, and if so, adds the transferred balance to the second row. The two 'delv' lines in the second operation are merely there to remove the known shared variables so that if someone were to accidentally include a third row, then it wouldn't have access to this data.

And that's it. Money transfers in YogaTable. No need for 2-stage commits.

At this point, you are probably wondering where YogaTable is going as a piece of software. When Google first released AppEngine, one of the things that I was most intrigued by was it's Datastore. Some features I'd never seen before (indexes on all of the values in a list, in particular), and I wished that it was available outside of AppEngine. I'd been meaning to write an AppEngine Datastore-like backend for a long time, and some early versions of YogaTable were actually meant to allow for people to take the Google AppEngine SDK and plug my backend into it. It was meant as a way of scaling the SDK beyond trivial applications, and really, to allow for the full set of features and functionality offered by Appengine's Datastore to people who didn't want to run in Google's datacenters. That is not where YogaTable is going.

After having used MongoDB in production, I realized that the current software offerings for databases was missing something. Something that wasn't tied down to schemas like classic relational databases. Something that wasn't limited if you happened to *only* have a 32 bit machine. Something that could offer enough power for building a moderately-used web site (one million hits/day), but was flexible enough to not get in your way while you were developing it.

And thus, YogaTable was born. Aside from the design requirement of never performing table scans, and it's current lack of built-in replication/clustering, YogaTable today offers sufficient features to get almost any idea from concept to a million hits/day. And with the introduction of a Lisp interpreter, YogaTable is able to offer functionality that is otherwise very difficult in other systems (the simple multi-row update shown above requires a tricky 2-stage commit using AppEngine's Datastore).

There is still work to be done on YogaTable. Mostly, I need to document everything. From there, next steps include replication, clients in a few different languages, support for read-only replicas, automatic master/slave failover, clustering... But all in good time. Documentation first, features next.

I hope everyone stays interested, I know that I'm having fun.

Tuesday, September 14, 2010

YogaTable Part 2: An Embedded NoSQL Database

It's been far too long since my previous post about YogaTable, but in the process of writing tests, testing, and cleaning up some of my original code, I had discovered a few bugs with how some searches were being performed. Throw in some weekend work on Binary Space Partitions, and you have a recipe for a delayed post.

This version introduces an "embedded" interface to YogaTable. Similar to how SQLite operates, you specify where you would like your tables to be stored, and you receive a Database instance: >>> import embedded; db = embedded.Database(path). That Database instance has implicitly-defined tables, which can be accessed via db.table_name.insert/ update/ delete/ search/ add_index/ drop_index/... .

Why an embedded database? Well, for starters, it's a good stepping stone from a storage engine to a full database server. Once we have an embedded database (especially one with a straightforward interface like YogaTable has), a database serving daemon is just a protocol definition away. And if you implement your embedded database correctly (like thread safety, etc.), then many of the hard parts relating to multiple clients and routing responses are already solved.

In order to handle index building for new indexes on old data, or index deletion for deleted indexes, and to handle threaded users of the embedded database, we chose to push the processing for each table off into it's own process via Python's convenient multiprocessing library. Commands are forwarded to each of these processors via multiprocessing.Queue instances (one per table), with all responses for all tables coming back via a single queue. Threads that make requests against a table don't all wait on the same queue, but each waits on it's own standard Queue. Responses are routed to the appropriate caller via a special routing thread, which also handles cleanup for threads that have exited. You can see how routing, process startup, etc., happens in embedded.py.

By pushing all processing into secondary processes, we are also able to leverage multiple processors (with multiple tables or databases) and multiple disks (with multiple databases), which should hopefully reduce latency and increase throughput for heavy users. We do gain some latency with processes thanks to a serialization and deserialization step for each direction, but the convenience of not needing to write a multi-table load-balancer cannot be understated. Remember, part of this project's purpose is to build on top of and re-use known-good components whenever we can, and letting the OS handle table-level processor and disk scheduling is the right thing to do. To see how query processing occurs and how we balance queries with index creation and deletion, check out lib/processor.py.

On the other hand, SQLite has well-known limitations with regards to multiple readers and/or writers. As in: don't do it (without thinking really hard about it). So we're not. Each table processor is single threaded (aside from the effectively transparent communication threads), and has a query-driven event loop. All requests are processed in-order as they come in to the processor. When idle, the query processor picks up any outstanding index creation or deletion requests. The next set of changes will include configuration options to allow for the balancing of request processing vs. indexing vs. cleanup.

If you aren't in the mood for yet another another embedded database, don't worry. YogaTable is embedded-only just until my next post, whose update will include a RESTful interface for easy multi-language clients, runtime status information, and if I'm feeling frisky, a simple console for querying the database.

On a more personal note, I very much enjoyed building the processing and request routing pieces. I'd wanted to build a request router for a Linda-like Tuple Space that I had implemented in the fall of 2005 (blog discussion:1 2 3), but that project never made it's way into the real world. One of the reasons why it never made it's way off my hard drive was partly because of the multiprocessing package, which I saw as a better way of handling tasks for which Linda-like systems were designed.

Wednesday, August 18, 2010

Introducing YogaTable, the flexible NoSQL database

In my last post, I talked about building a new NoSQL database from scratch, and how I would describe the decisions I made during it's construction, as well as post code along the way. This is the first of the series introducing YogaTable. Strangely enough, the toughest part of all of it was coming up with a name. At the current revision, insertion, updating, and deletion are all supported, and there are tests. Querying is not yet in the repository, though I have written the query builder and tests. Those subjects will be in the next post.

The first design decision I have made with this system is that I will not be building a durable data store; that is, one that offers write-or-fail semantics given modern hardware, handling of system crashes, etc. It's a hard problem. Systems which make use of more lazy approaches to durability (like mmap in the case of MongoDB) *will* fail at some point due to the laziness, even ignoring bad disks. Fixing this issue with replication is great, but it requires more hardware, and in the case of sane underlying systems (real hardware, and/or good virtualization solutions like vmware, xen, etc.), doing it right the first time allows us to not have to band-aid over it with replication. I will instead use an existing system that does it's absolute best to be durable and transactional by design.

I have also decided that I will not be building a B-Tree indexing system from scratch. Like the durable data store issue just mentioned, B-Trees are terribly difficult to get right. So to reduce the amount of time it will take to go from no code to fully-working system, I'm just not going to write a new one. I will instead use a system that already includes support for B-Tree indexes.

Those of you who know me will already guess that YogaTable will be written in Python, primarily because I know Python better than I know any other language, and secondarily because Python includes all of the tools necessary to build YogaTable out of the box on a modern machine. In fact, Python includes two transactional data stores with B-Tree indexes as part of the standard library: bsddb.btree and sqlite3.

Because I am not without a sense of irony, and because I have had bsddb.btree bite me with data corruption in the past (when not using the transactional semantics that are not documented in the standard library), I'm going to use Python's interface to SQLite 3, which has been included with the core Python distribution since 2.5 . As such, YogaTable will be "NoSQL" for the external interface to the database, rather than the actual underlying data store. Also, because of this choice, it will be fairly straightforward for someone to replace SQLite with another SQL database to offer functionality that I might not have gotten around to adding quite yet (like read-only slaves via MySQL, etc). Once YogaTable is feature complete (according to my earlier requirements and desires), it is my intent to use this in a small-medium scale production environment for my own projects, fixing bugs as they crop up (or as others report them).

I'm sure someone is going to think and/or point out how backwards it is to use a SQL database to store data for a NoSQL database. And that's reasonable. But this world is filled with unsolved difficult programming problems. I could spend literally months rewriting either the durable data store or the B-Tree index. On the other hand, I have the utmost respect for those who have already built said systems, and have happily used them in dozens of projects. Two secrets to software and systems engineering: pick your battles, and stand on the shoulders of giants. I'm going to do both, at the price of a little SQL.

Now that we have a place to store data and indexes, we've got to decide how it's going to be laid out. For the sake of simplicity with regards to backup, etc., and because I know a little something about how SQLite 3 works, I'm going to lean towards simplicity. Each table, information about it's indexes, and the indexes themselves will be stored in a single sqlite3 database file. This file will have three tables; the data table, the index metadata table, and a single table that includes the index data. Each table will have various B-Tree indexes over them to ensure that our access patterns are fast. The general 3-table layout and the db abstraction guts are available in lib/om.py. Also, as a "new to me" bug, I learned that Python's sqlite3 library doesn't handle list/tuple arguments for 'IN' queries, requiring some str(tuple(data)) shenanigans.

For the index table, we will be prefixing the index data with an index id, which will ensure that we are searching the proper subset of index rows when we search. Now, we could have placed each index in a separate file, then use SQLite's 'ATTACH DATABASE' command, but then we would have had to query all index tables/databases whenever we perform an update/delete, and that's a pain in the ass (never mind a performance killer whenever we have more than one or two indexes). We do miss being able to determine the size of each index individually, but that wasn't one of our requirements (though we could keep track of this manually). For details about how we generate index rows, check out lib/pack.py.

To offer greater flexibility, etc., we will not be storing any Python-centric data in our database. Data will be stored as JSON, and will be automatically converted to/from JSON by sqlite3 adapters/converters. Also, for the sake of everyone's sanity, we've included support for dates, datetimes, times, and decimals. Documentation is forthcoming for future non-Python users to easily convert to/from these formats. For the impatient, check out lib/adapt.py.

Stay tuned for the next post where I will be discussing the sql query generator, and automatic background index construction.

ETA: updated links to reflect some moved files.

Monday, August 9, 2010

Building a New NoSQL Database from Scratch

Over the course of the last few months, I've been writing a new NoSQL database. Most readers will laugh and say one of a few different things. Maybe something like, "NoSQL is a fad", "there are already so many NoSQL options, creating a new one is pointless", or even "unless you bring something new to the table, your adventure is going nowhere". To sum up my response: yes, it is offering something new.

Every existing SQL/NoSQL solution has tradeoffs. From the MyISAM storage backend for MySQL (no ACID compliance), to Postgres 8.4 and prior's lack of replication tools (I know I am looking forward to Postgres 9), to secondary index setup/creation in Cassandra, to MongoDB's use of memory-mapped files without transaction log, ... each database trades different advantages for other disadvantages. For the existing nontrivial noSQL solutions that I have examined (Cassandra, MongoDB, CouchDB, S3, Voldemort, SimpleDB, etc.), I have found that many of my required features are just not available. In the rare case where my required features were available (Google's AppEngine datastore), it's not available where I need it (Amazon AWS, Slicehost, etc.).

What do I require?

scales reasonably on small systems (MongoDB fails here)
document/row store (S3 and Voldemort fail here)
secondary indexes (SimpleDB fails here)
the ability to add/remove indexes during runtime (Cassandra fails here)
no surprise queries (aka no table scans, MongoDB, Cassandra, and CouchDB fail here)
no corruption on system restart (AWS and other cloud hosting providers sometimes lose your instance for a bit, MongoDB fails here)

Additional points for:

no schema (MongoDB, CouchDB, SimpleDB, ...)
multiple values per column (MongoDB and AppEngine's list columns)
sub-documents (like MongoDB's {'sub-doc':{'attr1':'val1', ...},} )
the ability to perform queries against a secondary index while it is being built
10k+ rows inserted/second on modest hardware (MongoDB, ...)
replication / sharding / clustering (MongoDB Replica Sets would be optimal)

Astute observers will note that MongoDB offers all of these things, except for scaling well on small systems, and 'no surprise queries'. To clarify what I mean by both of these; when you are using a 32 bit platform (small Amazon AWS hosts, some smaller VPS machines on other hosts), you are limited to 2 gigs of data and indexes with MongoDB. This is because they use memory-mapped files as an interface to on-disk files, which limits you to the architecture address space. As such, MongoDB really only makes sense for relatively small systems, or when you have a 64 bit system. Further, if you forget to explain your queries in advance (like during ad-hoc queries), to ensure that you are using an index, you can end up scanning your multi-gig database for a simple count query. On other databases (PostgreSQL and MySQL in the SQL world being the two I am most familiar with), a table scan will slow down other queries, but it won't destroy overall performance. With MongoDB, that table scan will destroy write throughput for any other operation. I have run into this myself.

To solve the table scan issue, we need to require an index for every query we perform (this is what Google's AppEngine datastore does). This somewhat limits ad-hoc queries, but with the ability to add/remove secondary indexes during runtime, and with the ability to make queries against indexes while they are being built (and drop the index during it's creation), we can start an index creation, and immediately perform our query. If there isn't enough data, we can wait and perform the query again. Yes, the index building operation does perform a table scan to create the index, but it can be designed to balance itself with incoming queries so that performance doesn't suffer greatly.

Over the course of the coming weeks, I will be documenting the process of building this new NoSQL database from the ground up. Yes, I said weeks. Part of the design and implementation will include the use of a readily-available, fast, durable data store (which bypasses many of the nasty implementation details), wrapped with an interface to offer all of the requirements and the first five pluses I list above. I do have a design for replication/sharding/clustering, but it's the only part of the system I've not built yet. We'll see how it goes. I will be posting source on Github as I discuss the design of the system, offering code as the nitty-gritty details of the higher-level design I will describe.