When designing the model for any piece of software, the need to assign a unique identifier to objects is one of the first things to come up. There are many different strategies in which one could go about this and today we're looking at some considerations you might want to make before deciding which kind of IDs you want to use.
We'll draw a lot of examples from internet-accessible applications backed by relational databases since that is what I'm most familiar with, but a lot of these principles apply to other software architectures and deployment environments.
This is probably the single most important factor to consider and applies to all applications right from the start. Basically, if you are using IDs generated by a sequence (1, 2, 3, ...) you might be vulnerable to enumeration attacks. These attacks exploit the pattern in resource identifiers like URLs. For instance, if you have a URL route for a thank you page which follows the format
/orders/<int:id>/thank-you/ an attacker could easily try all possible values for the integer ID and potentially access sensitive information of your customers.
Of course, this is not an issue if your application does not expose IDs directly to users, but it's something that you should always keep in mind if your are using IDs which are vulnerable to this attack.
I've found this problem quite common while working with some clients, mainly because it is the default choice of ID recommended by many frameworks and very easy to implement when working with relational databases like MySQL and PostgreSQL which provide the
AUTO INCREMENT modifier for primary key fields.
The format in which you store your IDs and the format in which you expose them to users doesn't have to be the same. For instance, RFC 4122 UUIDs contain 16 bytes or 128 bits of information, but they are generally represented in base16 with some delimiters like this
There are two main considerations when choosing a representation format:
- It should be separator-free (no hyphens, slashes, underscores, etc.) so that it can be properly tokenized by search engines, works well with URLs and can be easily copied and pasted.
- If it's intended to be read by humans at all, it should be as short as possible and avoid ambiguous characters like
The standard representation of UUIDs is not ideal because it is unnecessarily long and includes separators. On the other hand, integer primary keys are relatively straightforward to encode but they have a minor drawback: they are variable length (the first 10 IDs you generate will be single digits but soon enough you'll start to have longer IDs varying numbers of digits).
There are some encoding formats which attempt to address these problems. The Wikipedia page on binary to text encoding has a good overview of available formats but I'd like to mention some of my favorite here:
- Base 62 is like base 64 but without the
_characters (or without
/if compared to the URL-unsafe encoding). Essentially, it eliminates delimiters to address the first point above. Efficient implementations are available for many programming languages.
- Base 58 goes a step further and eliminates some ambiguous characters like
0, O, land
Iin addition to the delimiters. It's used for Bitcoin wallet addresses (which are an example of an ID that might, in some circumstances, have to be read or typed by a human).
- Crockford's base 32 goes a step further and makes IDs case insensitive.
Aside from encoding length one might also want to consider how much space IDs take in a database. Traditional primary keys take 4 or 8 bytes while UUIDs take 16 bytes in database systems which properly support them (like PostgreSQL).
The extent to which this aspect applies is limited: logs, URLs and even some databases1 will not store the raw ID bytes but rather an encoded version. If that's the case for your database or you are storing logs for a long time you should focus on the length of the representation instead.
From here on we start to dive into more specific issues that start to arise when you work at scale. If IDs are generated for objects that are created over time, a nice to have property is that the values of IDs are somewhat increasing with time. This comes handy when you want to partition data by the time it was created, or when you need to sort them by time but can't afford to store the creation date separately for space considerations.
A solution to this is to embed the timestamp in an ID by combining it with some randomness. If the timestamp is encoded using big-endian and put before the random part, those IDs are sortable. This idea was made popular by Twitter's Snowflake IDs. They suffer from some encoding inefficiencies (they only use integers to be compatible with integer fields in databases) so some other formats like ULIDs have been proposed to improve them.
Use in distributed systems
When you need to generate hundreds or thousands of IDs per second and want to avoid waiting for a single counter or need to introduce randomness to prevent enumeration attacks, thinks can get a bit more complicated.
A lot of the IDs we've seen so far can be generated by your database system but that start's to become impractical when you have lot's of nodes generating IDs at the same time.
Thankfully there's an easy solution to this, just make sure the random part of your ID is large enough to basically guarantee there will be no collisions. That's what ULIDs and KSUIDs essentially do. If you require more space efficiency, you can introduce another field which will partition IDs by the instance that generated them. This is what some versions of the RFC UUID and Snowflake IDs do.
These are some of the most important characteristics that can help you determine which ID format specification is most suitable for your project. Some might not be important at all for your project at early stages like the ability to generate thousands of IDs per second while others apply from day one.
I've personally found that going with something like KSUID is a good go-to option. They check all of the boxes above and are available in a number of languages. The only downside I can think of is that they are perhaps too large, at 160 bits which encode into 27 ASCII characters in base 62.
At least some versions of MySQL store the UUID representation instead of the raw bytes, which means they used 36-bytes per ID instead of the required 16-bytes. ↩