Caching uuid's for the win!

More and more often, we see schema designs that use UUIDs as primary keys. That's a valid choice if you're concerned about sharding and partitioning your database, but it has its own drawbacks, sometimes in unexpected places.

If you're working on a system big enough that needs to use UUIDs, chances are that at some point you'll have to turn those UUIDs into strings, perhaps in order to pass them across different systems.

In order to serialize the UUID into a string, Python has to format the UUID's hex property into a string containing hyphens, as RFC 4122 requires. And in order to do that, it has to slice that property 5 times.

    def __str__(self):
        hex = '%032x' % self.int
        return '%s-%s-%s-%s-%s' % (
            hex[:8], hex[8:12], hex[12:16], hex[16:20], hex[20:])

This is fasted way to do it, and it isn't usually a bottleneck per-se, as the overhead is minimal.

However, it's common for UUIDs to be used in scenarios where there's a lot of data and computation, and in some situations you might find yourself serializing a homogenous set of UUIDs over and over. This might happen in a situation where you have a 'heap' of hotter model instances that ares processed more often than most.

import random
import uuid


uuids = [uuid.uuid4() for _ in range(99)]
normal_distrib = [random.choice(uuids) for _ in range(9999)]
for uuid in normal_distrib:
    [str(uuid) for _ in range(100)]

In this situation, the overhead adds up, and could become a bottleneck you'd never think about.

You could use uuid.hex instead of the serialized value, but that would break the RFC and, more pragmatically, it would make the data not really portable across systems or languages.

Your best option is going to be the lru_cache decorator.

First, abstract the logic that's calling the serialization into a function that you can decorate:

def some_logic():
    for instance in instances:
        requests.post(url, {'uuid': str(instance.pk)})

from functools import lru_cache

@lru_cache(max_size=None)
def post_instance(instance):
    requests.post(url, {'uuid': str(instance.pk)})


def some_logic():
    for instance in instances:
        post_instance(instance)

Note: You might be tempted to abstract the str call itself:

@lru_cache
def uuid_to_str(uuid):
    return str(uuid)


def some_logic():
    for instance in instances:
        requests.post(url, {'uuid': uuid_to_str(instance.pk)})

That's not really a good idea. You'll end up caching too much. The cache size will be wasted in calls that are rarely reused, making the more common calls incur in cache misses.

Next, add logging to get some insight. This will tell uis if our caching is actually effective, and gives us some indication of how big the cache size should be:

import logging


logger = logging.getLogger(__name__)


def some_logic():
    for instance in instances:
        post_instance(instance)
    logger.debug(post_instance.cache_info())

Briefly run the code against production data, and you'll get something like this in your logs:

CacheInfo(hits=126, misses=8, maxsize=32, currsize=8)

If hits is really low, then you know lru_cache won't help you with this data. Your dataset of UUIDs is too heterogenous to take advantage of caching.

currsize gives you an idea of how big of a cache you might need and what you should set lru_caches maxsize argument to.

Keep in mind that higher values will use more RAM, and we're not really interested in perfectly caching every single value that's used more than once. We just want to cut down on the most common ones most of the times.

As rule of the thumb, if currsize is smaller than 512, I would set maxsize to that value plus some room (I like to go to next power of 2). Otherwise, I would cap maxsize to 128. In our cexample, I would use max_size=16:

@lru_cache(max_size=16)
def post_instance(instance):
    requests.post(url, {'uuid': str(instance.pk)}

Today I Learned – Caching uuid's for the win!