secture & code

IDs in databases: dangers of their automatic generation

Let's talk about the dangers of automatic ID generation in databases. In this article I will share my experiences and knowledge gained over the years working on multiple database-intensive projects. I have had the opportunity to evaluate different database strategies and learn about their advantages and disadvantages.

How do we identify the entities?

When I hear this question, by default I tend to respond with a ID auto-incremental, which would be the quickest and easiest. To be honest, many times this ends up being a mistake, since the right thing to do would be to stop and think about the following:

«It depends... What will be the function of this entity? Will I have a certain number of records (countries), or can they be theoretically infinite (users)? Will they store sensitive information? Will I insert an entity every two days or several per second?»

Once we answer these questions, we will know for sure if we should delegate the identification process in the hands of our friend the ID auto-incremental or is it worth spending our time and resources to cook up a tailor-made solution.

I need a customized solution, what are my options?

Are you sure? Let me try to sell you to my beloved ID auto-incremental.

Auto-incremental IDsis

Each time a new record is added, it is assigned a consecutive number as ID (e.g: 1, 2, 3...). This number increases by +1 for each new record, which ensures that it is unique. What's more, you don't have to do anything to implement it. Sounds good, doesn't it? Unfortunately, all that glitters is not gold...

Replicating the DB (having several instances) becomes much more complicated.
The above and following IDs are easy to guess, it is a possible security hole.
Inserting multiple records simultaneously is a potential bottleneck by having to wait to get a consecutive ID for each record, or trying to insert two entities with the same ID in case of multiple DB instances.
You can have the same ID in multiple different tables.

UUID (Universally Unique Identifier)

Each new record is identified by a 128-bit number (e.g., the number of the new record): 550e8400-e29b-41d4-a716-446655440000) randomly generated that virtually ensures that no two will be alike. Looks like we've hit the nail on the head, doesn't it?
This type of identifiers are not perfect either...

They transmit 0 information about the entity they represent. If you have several entities that use UUIDs as IDs or if you find a log with an ID we will not know to which entity it belongs without doing a DB search.
Possibly the range of objects allowed is far beyond our needs (2^128).
The IDs themselves do not indicate the order, which requires the entity to have other fields to indicate their order.

Custom IDs

We can also generate the IDs with our own algorithm, playing with letters and numbers according to our needs. For example we have the entity user. and we decided that the identifier should start with a U followed by 10 random numbers (U1234567890). Third time's a charm! Sorry, but this time no...

It is very likely that our algorithm is not as random as we think and we will have to create mechanisms to check that the ID does not exist and generate another one in case it is already in use.
We may have fallen short and underestimated our needs, which will result in collisions being much more common than we expect or we may have to modify the algorithm and/or the ID format.
We may over-engineer our solution trying to avoid the above problems and end up creating something much more complex than we need.
Depending on the design of the algorithm, the IDs themselves may not suggest their order, which requires the entity to have other fields to indicate their order.

So, what solution should we apply?

As it happens with almost everything in this life (I already made a little spoiler in the second paragraph), it really depends... There is no perfect solution. In my opinion we must know the tools we have at our disposal along with their strengths and weaknesses.

If you need an MVP (with a deadline in two weeks), I would probably use auto-incremental IDs and take care of the technical debt in the future in case it arises. If we are working on a more mature solution and foresee that the entity is going to have a high run-in, I would probably use a custom ID. I also have to admit that I prefer not to have more than one entity represented by UUIDs, I like to have an indication of the entity I am working with as soon as I see the ID.

For me the ideal solution is to use multiple different ID types. This shows that we have given the necessary care to each entity and we have made sure that it is correctly structured according to the requirements and its use cases.