> A relation should be identified by a natural key that reflects the entity’s essential, domain-defined identity — not by arbitrary or surrogate values.
I fairly strongly disagree with this. Database identifiers have to serve a lot of purposes, and natural key almost certainly isn’t ideal. Off the top my head, IDs can be used for:
- Joins, lookups, indexes. Here data type can matter regarding performance and resource use.
- Idempotency. Allowing a client to generate IDs can be a big help here (ie UUIDs)
- Sharing. You may want to share a URL to something that requires the key, but not expose domain data (a URL to a user’s profile image shouldn’t expose their national ID).
There is not one solution that handles all of these well. But using natural keys is one of the least good options.
Also, we all know that stakeholders will absolutely swear that there will never be two people with the same national ID. Oh, except unless someone died, then we may reuse their ID. Oh, and sometimes this remote territory has duplicate IDs with the mainland. Oh, and for people born during that revolution 50 years ago, we just kinda had to make stuff up for them.
So ideally I’d put a unique index on the national ID column. But realistically, it would be no unique constraint and instead form validation + a warning on anytime someone opened a screen for a user with a non-unique ID.
Then maybe a BIGINT for database ID, and a UUID4/7 for exposing to the world.
EDIT: Actually, the article is proposing a new principle. And so perhaps this could indeed be a viable one. And my comment above would describe situations where it is valid to break the principle. But I also suspect that this is so rarely a good idea that it shouldn’t be the default choice.
This isn’t a new principle, it was part of database design courses in the early 2000s at least. However from a couple of decades of bitter experience I say external keys should never be your primary keys. They’ll always change for some reason.
Yes you can create your tables with ON UPDATE CASCADE foreign keys, but are those really the only places the ID is used?
Sometimes your own data does present a natural key though, so if it’s fully within your control then it’s a decent idea to use.
For example, suppose you have an information management system where the user can define their own logical fields or attributes. Naturally those names should uniquely identify a field. That makes them an easy candidate for a natural key. But I would still use a surrogate key.
I've worked in two systems that had this feature. In one the field name was a primary key, and in the other they used a surrogate key and a separate uniqueness constraint. In both a requirement to let users rename fields was added later. In the one that used a surrogate key, this was an easy and straightforward change.
In the one that used field name as a natural key, the team decided that implementing the feature wasn't feasible. The name change operation, with its cascade of values needing to be updated for all the foreign key relationships, would have been way too big a transaction to be doing on the fly, in a live database, during high usage times of day. So instead they added a separate "display name" field and updated all the queries to coalesce that with the "official" name. Which is just ugly, and also horribly confusing. Especially in oddball cases where users did something like swapping the names of two fields.
> Allowing a client to generate IDs can be a big help here (ie UUIDs)
Trusting the client to generate a high-quality ID has a long history of being a bad idea in practice. It requires the client to not be misconfigured, to not be hacked, to not be malicious, to not have hardware bugs, etc. A single server can generate hundreds of millions of IDs per second and provides a single point of monitoring and control.
In context I read that as database client, meaning the application server (which is a client to the database) providing the service to the user. Having that be able to generate IDs could be useful when needing to refer to the same entity, even if there is data that has to exist in some separate database for some reason.
I’ve once attempted to implement a solution where ids are generated by UUIDv5 from a certain owner and the relationship of the new item to the owner; that way, users cannot generate arbitrary ids but can still predict ahead of time their new ids to ease optimistic behaviour.
I think if you have (tenant, uuid) or (user, uuid) or whatever as your primary key, that's fine. If that tenant or user or whatever generates rubbish keys, that's their problem.
As someone who currently has to deal with a database that has NO surrogate keys, 100% agree with having surrogate keys always. Compound natural keys are just plain awful to deal with.
Altough, it doesn't help that we have to do string manipulation to extract part of those natural keys from other fields that contain a bunch of data as one long string.
We can debate the first usage (small type for storage/cpu optimization) but the other 2 are actually good examples of natural keys!
Note the quotes:
> Every software project needs to represent the reality of the business he is embedded in...
> Such database needs to be designed to properly reflect reality. This can’t be automated, since the semantics of the situation need to be encoded in a way that can be processed by a computer.
Having the business need to share an ID, not give internal information, make some data opaque, use a 'arbitary' id to share as 'textual pointer' to the actual data, etc are valid natural keys.
What is wrong is just add a surrogate key just because, when the fact is that a natural key (to the domain, the business, the requirements) is the actual answer.
I discover this fact when doing sync. The moment I get rid of UUIDs and similar and instead keep data-as-natural things become MORE clear:
ulid: 01H6GZ8YQJ code: A1 name: A
ulid: 01H6GZ8YQK code: A1 name: A
ulid: 01H6GZ8YQJL code: A1 name: A
// All same, obscured by wrong key
vs
code: A1 name: A
(ie: The actual data and what I need to model a sync is NOT the same. Is similar to how in 'git' your code is one thing that is clearly distinct to the internal git structures that encode the repo)
I agree - in many cases a surrogate key is better than a natural key.
However there is a problem when people start creating surrogate keys by default and stop thinking about the natural candidate key at all, nevermind not putting on the appropriate constraints.
Especially in the case where the natural key values are within your control. Also not using them for the key ( and FK ) can create the need for additional joins just to go from surrogate to natural key value.
The writing style was already obtuse and off-putting. Adding this 'new' principle (claimed as an innovation), leads me to believe it's not worth my time to read. Additionally the principles are great starting points but real-world uses may require deviation and any thought that absolutes should prevail indicates a lack of experience.
You both want to control these values within your database engine, at least so that they are actually unique within the domain, and there is no real reason for it to be user-controlled anyway, as they are used referentially.
> Idempotency.
User supplied tokens for idempotency are mostly useful within the broader context of application sitting on top of database, otherwise they become subject to the same requirements internally generated ones are, without control, which is a recipe for disaster.
> Sharing
Those are the same idempotency tokens from previous points with you as the supplier. In some cases you want to share them across prod/stage/dev environments, in some cases you may want to explicitly avoid duplicates, in some cases you don't care.
All these use cases are solved with mapping tables / classifiers.
Example: in an asset management system you need to expose and identifier of a user-registered computer, that is built using components procured from different suppliers, with their own identification schemas, e.g. exposing keyboard/mouse combo as one component with two subcomponents or two (possibly linked) components.
This requires you to use all those listed identifier types in different parts of the system. You can bake those in database/schema design, or employ some normalization and use "native" identifier and mapping tables.
I'm with you. I've used natural keys in the past, and they've always been a problem eventually.
On the other hand I've used surrogate keys for 20 years, and never encountered an issue that wasn't simple to resolve.
I get there are different camps here, and yes your context matters, but "I'm not really interested in why natural keys worked for you." They don't work for me. So arguments for natural keys are kinda meh.
> will absolutely swear that there will never be two people with the same national ID...
I suddenly got flash-backs.
There are duplicate ISBN numbers for books, despite the system being carefully designed to avoid this.
There are ISBN numbers that have invalid checksums, but are valid ISBNs with the invalid number in the barcode and everything. Either the calculation was incorrectly done, or it was simply a mis-print.
The same book can have hundreds of ISBNs.
There is no sane way to determine if two such ISBNs are truly the same (page numbers and everything), or a reprint that has renumbered pages or even subtly different content with corrected typos, missing or added illustrations, etc...
Our federal government publishes a master database of "job id" numbers for each profession one could have. This is critical for legislation related to skilled migrants, collective workplace agreements, etc...
The states decided to add one digit to these numbers to further subdivide them. They did it differently, of course, and some didn't subdivide at all. Some of them have typos with "O" in place of "0" in a few places. Some states dropped the leading zeroes, and then added a suffix digit, which is fun.
On and on and on...
The real world is messy.
Any ID you don't generate yourself is fraught with risk. Even then there are issues such as what happens if the database is rolled back to a backup and then IDs are generated again for the missed data!
The states decided to add one digit to these numbers to further subdivide
them. They did it differently, of course, and some didn't subdivide at all.
Some of them have typos with "O" in place of "0" in a few places. Some
states dropped the leading zeroes, and then added a suffix digit, which is fun.
Any identifier that is comprised of digits but is not a number will have a hilariously large amount of mistakes and alterations like you describe.
In my own work I see this all the time with FIPS codes and parcel identifiers -- mostly because someone has round-tripped their data through Excel which will autocast the identifiers to numeric types.
Federal GEOIDs are particularly tough because the number of digits defines the GEOID type and there are valid types for 10, 11 and 12-digit numbers, so dropping a leading zero wreaks havoc on any automated processing.
There's a lot of ways to create the garbage in GIGO.
Actually, it should be a database ID and an encrypted database ID, which doesn’t require storing a second ID. Even better, you can make that key unique per session so that users can’t share keys. For security reasons, it is a bad idea to leak private state, which UUIDv7 does.
A single AES encryption block is the same size as a UUID and cheap to compute.
So you've got a database ID, either serial or uuid? And you encrypt it when you send it to the user, maybe encrypted against their JWT or something to maintain stateless sessions?
And I guess if the user references specific resources by ID you'd have to translate back? Assuming the session has been maintained correctly,which I guess is a plus for stateful sessions. And it doesn't really matter if you get a collision on the encrypted output.
I spent enough time in the nibling comment talking about my doubts about that advice not to publicly share the identifying key. But I'll add one more point; it feels like a bunch of added complexity for marginal benefit.
Can you explain this a bit more or link to something? I don’t really understand. What’s encrypted? A guid? A monotonic integer ID? Is the encrypted ID only used for user facing stuff? How is it decrypted? What do you gain by this?
I don’t have a link. I’ve never seen a good writeup but the practice is really old.
It is literally encrypting whatever type you are using as a handle for records that you send the user, typically an integer, a UUID, or some slightly more complex composite key. The only restriction is that the key should not be longer than 128-bits so that you can encrypt it to a UUID-like value; in practice you can almost always guarantee this [0]. The encryption only happens at the trust boundary i.e. when sending it to a random user. That encrypted key is not stored or used at all internally, it just changes the value of the key the user sees.
Most obvious ways of generating handles in a database leak private state. This is routinely exploited almost to the point of being a meme. Encrypting the handles you show the user prevents that.
An advantage of this is that if you are systematically encrypting exported keys then you can attach sensitive metadata to those keys at runtime if you wish, which can be very convenient. You have 128 bits to work with and a unique serial only needs 64 at most. If you are paranoid, there are schemes that enable some degree of key authentication. And while well beyond the scope here, similar constructions work nicely for compact keys in federate data models.
At scale, the cost (storage, compute, etc) of all of this matters. Encryption of keys, if done intelligently, is atypically efficient on all accounts while providing explicit inspectable security guarantees.
[0] There are reasons you might want to expand the exported key to a 256-bit value, particularly in high-assurance type environments, but the advantage of 128-bits is that it is literally drop-in compatible with UUIDs almost everywhere.
They would not be valid UUIDs, it is an opaque 128-bit value.
To be honest many companies are not using strictly standard conforming UUIDs everywhere, and UUID has become a synonym for opaque 128-bit identifier. Consumers of the value just need them to be unique. All standard UUID versions are sometimes prohibited in environments, except maybe UUIDv8, so this semantic ambiguity is helpful.
Technically, you could make it present as a standard UUIDv4 or UUIDv8 by setting a few bits, as long as you remember to add them back if you ever need to decrypt it. The entropy might be a bit off for a UUIDv4 if someone actually checks but you can guarantee the uniqueness.
Using AES to produce a UUID-like key is an old but useful trick. Both ARM and x86 do that encryption in hardware — it is cheaper to generate than the standardized UUID versions in most cases.
Idk that reeks of security through obscurity to me. Your authorization/permission scheme has got to be fubar'd if you're relying on obscuring IDs to prevent someone from accessing a resource they shouldn't.
I'm sure I'm missing something obvious, but I'm not sure what other threat vectors there are, assuming it's in conjunction with other layers of security like a solid authorization/access control scheme.
I guess I'm not the biggest fan for a few reasons. I'd rather try and design a system such that it's secure even in the case of database leak/compromise or some other form of high transparency. I don't want to encourage a culture of getting used to obscurity and potentially depending on it instead of making sure that transparency can't be abused.
Also, it just feels wasteful. If you have two distinct IDs for a given resource, what are you building your foreign keys against? If you build it against the hidden one, and want to query the foreign table based on user input, you've gotta either do a join or do a query to Get the hidden key just to do another query. It just feels wasteful.
EDIT: apparently RFC 4122 even says not to assume UUIDs are hard to guess and shouldn't be used for security capabilities. So if it shouldn't be depended on for security, why add all this complexity to keep it secure?
The point you may be missing is that the key itself contains information about records in the database that you don’t have access to. There are many famous examples in literature (e.g. the German Tank Problem [0]) of useful attacks on known unique serials to infer hidden information without access. In a database context, the keys for the records you are authorized to see can tell you much about the parts of the database to which you have no access.
Strong randomization of record serials mitigates these attacks.
I'm in agreement that a natural key shouldn't be used as the primary key for a record.
I was responding to a comment about having a hidden "database ID" (which I interpreted as being a serial key?) and a public "Uuid", and questioning the utility of that hidden database ID versus having a public UUIDv7 as the sole primary key, followed by questioning whether the utility of obscuring that primary UUIDv7 is worth the complexity of having to manage multiple artificial keys.
I agree that security through obscurity is a valuable layer in a multi-layered security position.
I guess I just don't think obscuring a Uuid primary key is worth the added complexity in most systems.
I see it like adding a second front door to your house with a separate set of keys. Sure it'd be more secure, but it's an added pain and doesn't help if you don't have a sturdy doorframe, or smash-resistant windows.
- Joins, lookups, indexes. Here data type can matter regarding performance and resource use.
I struggle to see a practical example.
> - Idempotency. Allowing a client to generate IDs can be a big help here (ie UUIDs)
Natural keys solves this
> - Sharing. You may want to share a URL to something that requires the key, but not expose domain data (a URL to a user’s profile image shouldn’t expose their national ID).
The you have another piece of data, which you relate to the natural key. Something like `exposed-name`.
> There is not one solution that handles all of these well
Natural keys solve these issues.
> Also, we all know that stakeholders will absolutely swear that there will never be two people with the same national ID. Oh, except unless someone died, then we may reuse their ID. Oh, and sometimes this remote territory has duplicate IDs with the mainland. Oh, and for people born during that revolution 50 years ago, we just kinda had to make stuff up for them.
If this happens, the designer had a error in his design, and should extend the design to accommodate the facts that escaped him at design time.
> Actually, the article is proposing a new principle
I'm putting it in words, but such knowledge has been common in the database community for ages, afaict.
> If this happens, the designer had a error in his design, and should extend the design to accommodate the facts that escaped him at design time.
Errors in the initial design should be assumed as the default. Wise software engineering should make change easy.
Constraints on natural keys are business logic, not laws of mathematical truth. They are subject to change and often violated in the real world. The database as record-keeping engine should only enforce constraints on artificial keys that maintain its integrity for tracking record identity, history, and references.
Your database may not be primarily a record-keeping engine, and your tradeoffs may be different.
The problem with natural keys is that nobody ever says, "My bad, I should have spelled my name the same way on this form as I did when I registered for the service, I promise not to do it again." Instead they say, "No matter how I spell it, you should have still known it was my name!"
Memory, and CPU, and even storage eventually, those would be the main practical examples of where having a key that's composed of something very small saves you space and thus, time.
Say we want to use a bigint key vs a VARCHAR(30)? depending on your big key you might be talking about terabytes of additional data, just to store a key (1t rows @ bigint = 8TB, 1T rows at 30 chars? 30TB...). The data also is going to constantly shuffle (random inserts).
If you want to define the PK as the natural key with no separate column then you get to do comparisons on all the natural key columns themselves, so instead of doing 1 4 or 8 byte column comparison you get to do what? 5 char comparisons?
Having worked extensively in ETL - when a developer tells me "there's no duplication about this real world process" what they mean is "there's no duplication in my mental model about this real world process"
> Memory, and CPU, and even storage eventually, those would be the main practical examples of where having a key that's composed of something very small saves you space and thus, time.
> Say we want to use a bigint key vs a VARCHAR(30)? depending on your big key you might be talking about terabytes of additional data, just to store a key (1t rows @ bigint = 8TB, 1T rows at 30 chars? 30TB...). The data also is going to constantly shuffle (random inserts).
>> Joins, lookups, indexes
I don't see how what you brought up has anything to do with these.
But the main point is being missed here because of a physical vs logical conflation anyhow.
It would be helpful if the article used a natural key. Instead, it uses a primary key that is neither natural nor guaranteed, and is mutable. It makes assumptions that are not true, and that is one of the big dangers in using natural keys.
> If this happens, the designer had a error in his design, and should extend the design to accommodate the facts that escaped him at design time.
This again is a dangerous assumption. The danger here is the assumption that facts don't change. Facts do change. And facts that are true at design time are not necessarily true 1 day later, 1 year later, or 1 decade later, or more.
Again, when the example can't even use a natural key to present it's idea of using natural keys, we have a problem.
This takes an overly simple view of what domains can look like. There are data models that necessarily violate these principles, and they aren’t all that rare.
Some examples:
> A relation should be identified by a natural key that reflects the entity’s essential, domain-defined identity
In some domains there is no natural key because the identity is literally an inference problem and relations are probabilistic. The objective of the data model is to aggregate enough records to discover and attribute natural keys with some level of confidence. A common class of data models with this property are entity resolution data models.
> All information in the database is represented explicitly and in exactly one way
Some data models have famously dual natures. Cartographic data models, for example, must be represented both as a graph models (for routing and reachability relationships) and as geometric models (for spatial relationships). The “one true representation” has been a perennial argument in mapping for my entire life and both sides are demonstrably correct.
> Every base relation should be in its highest normal form (3, 5 or 6th normal form).
This is one of those things that sounds attractive because it ignores that it requires no ambiguities about domain boundaries or semantics, which doesn’t exist in practice. I bought into this idea too when I was a young and naive data modeler. Trying to tamp out these ambiguities adds an unbounded number of data model epicycles that add a lot of complexity and performance loss. At some point, strict normalization is not worth the cost in several aspects.
In almost all cases, it is far more important that the data model be efficient to work with than it be the abstract platonic ideal of a domain model. All of these principles have to work on real hardware in real operational environments with all of the messy limitations that implies.
> Trying to tamp out these ambiguities adds an unbounded number of data model epicycles that add a lot of complexity and performance loss
If you can talk about a business rule, you have a predicate. If you have a predicate, you can make it 5 or 6 normal form, since all that means is that your relation expresses only and completely the predicate.
It seems that your definition of normalization is not the one that I am using above. What is it?
I strive to keep most things in at least the 3rd NF. Except stuff like addresses or names, those I intentionally don't push to be normalized, as they make up a single datum anyways, and normalizing creates more problems than it solves IME.
Addresses and names are nice, well-known examples for cross-domain data. It's not that attempts at normalizing these structural datums create problems per se, but rather there is no single true normalization, therefore wrong normalizations start causing problems.
Yeah, normalizing inherently introduces constraints on the data. For example, normalizing to first and last name implies no middle names/having a first and last in the first place.
Also, first and last names depend on the culture. Oh and people can have more than 1 name (as in distinct names, rather than multi part names. Some cultures use different names with different social circles).
Easier to just let them put their preferred name into a freeform text field.
> Except stuff like addresses or names, those I intentionally don't push to be normalized
IMO, it depends. While the idea of normalizing names is amusing, I probably wouldn't ever push for that either. For addresses, though, I would absolutely normalize everything beyond the street address. If nothing else, it enables you to quite cheaply (from a storage / memory perspective) add a lot of analytics on users that might be useful, but would be expensive to store with every record - things like city population, postal code median income, etc.
And in cases where you need to have your own company's address displayed per-user (disclosures for financial products, for example), it's absolutely a good idea. A full address, especially a business one that might have a Suite, Floor, etc. can easily be 60-80 chars, which over hundreds of millions of rows, adds up.
> Principle of Essential Denotation (PED): A relation should be identified by a natural key that reflects the entity’s essential, domain-defined identity — not by arbitrary or surrogate values.
Is national_id really a natural key, or is it someone else's synthetic key? If so, should the owner of that database have opted for a natural key rather than a synthetic key?
I was going to comment on this. Natural keys sound like a good idea and they should enforced maybe by using a unique constraint.
Natural keys are important. But the real world and the databases that represent them are messy. People’s identities get stolen. Data entry mistakes and integration between systems fail and leave the data in a schizophrenic state.
In my experience I find arguments about natural keys unproductive. I usually try to steer the conversation to the scenarios I mentioned above. Those who listen to me will have a combination of synthetic and natural keys. The first is used to represent system state. The second is used to represent business processes.
Natural keys are also all too often PII. A surrogate key that's just pure entropy is much safer to blast all over the place in logs and error messages and so on.
I usually encourage people to place all PII in a separate table. Only those who engage with customers e.g., verifying customers identities should have access. Furthermore images of customer identity cards are strictly forbidden. You can enter their passport number, name, address, birthdate etc. but copies of identity documents will make you a target of hackers and angry customers. The rep can ask the customer to show the document or in the worst case present a copy but the copy should immediately be deleted.
"I usually encourage people to place all PII in a separate table. Only those who engage with customers e.g., verifying customers identities should have access"
This sounds nice but usually falls apart fast. "separate table" is neat but access at the user-level is generally not implemented at the DB layer, so which table it is in is unrelated. Also IME data access is usually "everyone up to the role that actually 'needs' it gets it". So e.g. if customer support has access to something, generally so does every single engineering team in the middle. Which is generally a shitton more people than the people who designed the access control mechanisms probably imagined as they bothered adding all this granularity.
Realistically, I think the threat model needs to be looked at from the other side: who's most likely to accidentally leak the data? Is it a support person having their laptop stolen? An engineer getting phished? An engineer accidentally sending Splunk PII in logs? How you address the actual threats your data faces often look almost completely unrelated to what you'd build if you sat down and said "ok big boss says we have to secure the data. what did he mean by this."
I do agree about not holding on to data you don't actually need tho.
Programming software that accepts a national ID in our woods is usually the trial by fire for junior developers when they find out that "unique national ID" isn't actually unique for many historic reasons :)
The "natural key" for a (natural) person is compound: full name and mother's full name, plus date, time and place of birth. Your birth certificate is your primary identification document.
However that still runs into problems of nondurability of the key in cultures that delay naming for a few years. To name one problem.
So yeah, use a big enough integer as your key, and have appropriate checks on groups of attributes like this.
However, if you are only interested in citizens, then a "natural" key is the citizen id issued by the government department responsible for doing that. (Citizen is a role played by a natural person so probably doesn't have too many attributes.) I still wouldn't use that as a primary key on the citizen table, though.
> The "natural key" for a (natural) person is compound: full name and mother's full name, plus date, time and place of birth.
All it would take for a Bobby Tables[0] moment is a mother to have twins delivered by caesarean and given the same name... The same name thing has already been done[1].
Just a side note about the historical anecdote at the bottom of the post, which is related to Notre-Dame de Paris:
> 28 statues that portrayed the biblical Kings of Judah. [...] They didn’t portray French kings
That's wrong. Several texts from the revolution and before still exist that prove that these kings were identified as both Judea kings and France kings. For instance, David was Pépin le bref. On one of the gates of the cathedral, the List of the French kings was engraved, starting with Clovis. That glorification of the monarchy, with parallels to the bible, was common at the time: other French medieval cathedrals show the same analogies.
> A relation should be identified by a natural key that reflects the entity’s essential, domain-defined identity — not by arbitrary or surrogate values.
That's a very nice example of Theory VS Practice !
I agree, in Theory, PK should be identified by the natural unique key identifier of the domain. Your example with national_id is proper.
In practice :
- due to their varying underlying type, manipulating natural key identifier throught an application (passing values) is not always easy. UUID or INT are reliable and always works.
- sometimes, we want a "poor" anonymization when passing and calculting data. As the PK is the identifier of the tuple, having a personal data for PK can be problematic.
- last but not least : using PK that is subject to change is a terrible idea. An address mail, for example, anybody can be subject to change. When it's the PK, you are fuck to do or propagate the change.
I have a joke in context that I often like to tell:
Devil captured Physicist, Engineer and Mathematician. He gave each of them big can of spam and locked them in the empty room saying „you will be here for 2 weeks - open the can and survive or die to starvation”. After 2 weeks Devil opens Physicist cell. It’s covered floor to the ceiling in complex scribbles. One piece of wall is clean of etching but small dent is visible. Can of span is opened and eaten clean, Physicist sits in corner visibly annoyed. Next one is Engineer. Cell walls are covered in multiple dents and pieces of spam. Engineer is bruised almost as much as the can, but it is ultimately opened and engineer is alive.
Finally the Devil opens Mathematician cell and find him dead. Only „given the cylinder” is etched on the wall.
—-
Puent isn’t about engineering but it always helped me to set limits between software engineering and computer science.
Uh... I don't get it. Perhaps something was lost in translation?
It sounds like you're saying each person had to smash a container against the wall to release the food, and the Mathematician got side-tracked analyzing the container.
However:
1. Spam™ cans have a convenient pull-tab, you don't need to bash them against a wall.
2. Spam™ cans are typically not "cylinder" shaped... Unless the Mathematician was expecting a cylinder and didn't recieve one, but in that case what did the Devil say that would have caused that expectation?
3. Why would the Engineer do worse at opening the can compared to the Physicist?
Seems like this article places too much emphasis on normalization, which is appropriate for many cases, but may be a huge cost and performance issue for requirements like reporting. You may probably need different kinds of schema and data storage structures for different requirements in the same application, which in turn may result in duplicated data, but results in acceptable trade offs.
" Every base relation should be in its highest normal form (3, 5 or 6th normal form). "
If I remember my database lessons correctly there is no strictly highest normal form. It progresses from 1NF to BCNF, but above that it is more choosing different trade-offs.
Even below it is always a trade-off with performance and that is why we most of the time aim for 3NF, and sometimes BCNF.
There are big disadvantages from choosing e g. 5th normal form: any changing in business requirements leads to a big rewrite and data conversion.
Never seen successful projects choosing beyond 3rd/BCNF.
Then don't use a relational database. Sorry for being rude, but joins are an integral part of the relational model. If you want a KV store, you should use a KV store; if you want a Document DB, you should use a Document DB.
Lots of peole got hung up on the example, which I thought would be be helpful on the discussion, but certainly should not replace the main point, which is:
Relations, attributes and tuples are logical. A PK is a combination of one or more attributes representing a name that uniquely identifies tuples and, thus, is logical too[2], while performance is determined exclusively at the physical level, by implementation.
So generating SKs for performance reasons (see, for example, Natural versus Surrogate Keys: Performance and Usability, Performance of Surrogate Key vs Composite Keys) is logical-physical confusion (LPC)[3]. Performance can be considered in PK choice only when there is no logical reason for choosing one key over another.
I really don't like using natural keys as primary keys.
Natural keys sometimes need to change for unforeseen reasons, such as identity theft, and this is really tricky to manage if those keys are cascaded into many tables as foreign keys.
I’m going to have to disagree. What happened what national ids are expanded to include letters? Or extra digits? I’ve lived through this mess a few times in the migrations are painful and take a ton of time away from solving actual business problems. For example: us telephone numbers. The exchange (middle 3 digits), used to physically Geolocate a consumer, because everybody had land lines. I worked on a project and an investment firm that was using it as a primary key to identify a users’s location. Holy old macro was that an unbelievably expensive and painful migration. Don’t do this.
Instead, follow this principle: never ever use an externally assigned identifier as a primary key in your database.
Instead: link tables with simple integers. Do not use these integers for any other purpose and assume unordered. Never let these integers leave the app. When exchanging information between systems or third parties generate a suitable external identifier with a alphabetical prefix.
No. Real life rarely has natural keys that are unique and do not change. For example the national id number in several countries can change in some circumstances...and that is already a synthetic key.
Principle of Full Normalization is a useful guide that should not be religiously followed. Mostly full normalization is good, but some objects like names and address resist normalization and thus can introduce significantly more complexity than they save by normalizing.
Where I live there’s no such thing as national ID. There’s a few documents that can be used as such depending on the purpose, and some of those change the number on every update!
Never trust something outside your system to be stable.
Many of the principles and also the example provided for PED cannot be mapped easily through an ORM library and AFAIK Java JPA doesn't handle it too.
Why does it matter? I have seen that many developers rely totally only on the code to manage entities on the database, instead of relying on prepared statements and pure SQL queries. This obviously opens a door for poor optimisation, since these Entity Management libraries don't support certain SQL capabilities.
" Every base relation should be in its highest normal form (3, 5 or 6th normal form). "
If I remember my database lessons correctly there is no strictly highest normal form. It progresses from 1NF to BCNF, but above that it is more choosing different trade-offs.
Even below it is always a trade-off with performance and that is why we most of the time aim for 3NF, and sometimes BCNF.
> When a collection of such propositions is stored in a computer system, we call it a database.
Is it? A database is a place data is stored and retrieved. It is literally a data base. No more, no less. Whether it logically models a domain may be completely irrelevant to store and retrieve data.
I would argue that a database should not logically model a domain. Why? Because every database must store and retrieve data. Therefore data should be modeled to store and retrieve data as efficiently as possible. And the structure that most efficiently stores and retrieves data most likely does not logically model a domain.
It's ideological purity which has no place in the real world.
Fully normalised structures are slow, dangerous and place expensive and frustrating burdens on upstream users reducing the amount of value that can be extracted from the data.
I've been working around data for 20+ years and not once seen a dataset where tables could be blindly joined without some conditions. In these situations it can be better to just pre-join them so the business logic is captured.
You want to be pragmatic and have a healthy and constructive mix.
Since I "want to be pragmatic" I want my data not repeated, so that it cannot be inconsistent, and not NULL, to simplify logic. And of course any interesting join condition involves fields from different tables.
The example works if and only if there's one National ID per person.
That's not true for SSNs. It's not true in that it is false. My statement that it is false is, in point of fact, true, and therefore not up for debate.
> Most persons have only one SSN. In certain limited situations, SSA can assign you a new number. If you receive a new SSN, you should use the new number. However, your old and new number will remain linked in our records to ensure that your earnings are credited properly. This could affect your benefits.
Maybe there are countries where it is the case that nobody ever gets multiple National IDs. Maybe there are countries without fraud and where everyone can and will update their records when the government does. Maybe there is a veritable Utopia on Earth, a Cockaigne of validated data and reasonable deadlines.
The "natural ID" for people design reminds me of a story from a state department of education: They had two students, both named John Smith Jr. They were identical twins and attending the same class.
They had the same birth date, school, parents, phone number, street address, first name, last name, school, teachers, everything...
The story was that their dad was John Smith Sr in a long line of John Smiths going back a dozen generations. It was "a thing" for the family line, and there was no way he was going to break centuries of tradition just because he happened to have twins.
Note: In very junior grades the kids aren't expected to memorise and use a student ID because they haven't (officially) learned to read and write yet! (I didn't use one until University.)
Seems like a terrible idea, TBH - the reality is that values like "national ID" can be sensitive themselves. In the US, such a scheme would make even HTTP ACCESS LOGS into personally-identifiable-information requiring special handling.
Both truth and representation are very slippery, many-faceted concepts, encumbered with millennia of use and philosophy. Using them in this way is deceptive to the junior and useless to the senior.
> A relation should be identified by a natural key that reflects the entity’s essential, domain-defined identity — not by arbitrary or surrogate values.
I fairly strongly disagree with this. Database identifiers have to serve a lot of purposes, and natural key almost certainly isn’t ideal. Off the top my head, IDs can be used for:
- Joins, lookups, indexes. Here data type can matter regarding performance and resource use.
- Idempotency. Allowing a client to generate IDs can be a big help here (ie UUIDs)
- Sharing. You may want to share a URL to something that requires the key, but not expose domain data (a URL to a user’s profile image shouldn’t expose their national ID).
There is not one solution that handles all of these well. But using natural keys is one of the least good options.
Also, we all know that stakeholders will absolutely swear that there will never be two people with the same national ID. Oh, except unless someone died, then we may reuse their ID. Oh, and sometimes this remote territory has duplicate IDs with the mainland. Oh, and for people born during that revolution 50 years ago, we just kinda had to make stuff up for them.
So ideally I’d put a unique index on the national ID column. But realistically, it would be no unique constraint and instead form validation + a warning on anytime someone opened a screen for a user with a non-unique ID.
Then maybe a BIGINT for database ID, and a UUID4/7 for exposing to the world.
EDIT: Actually, the article is proposing a new principle. And so perhaps this could indeed be a viable one. And my comment above would describe situations where it is valid to break the principle. But I also suspect that this is so rarely a good idea that it shouldn’t be the default choice.
This isn’t a new principle, it was part of database design courses in the early 2000s at least. However from a couple of decades of bitter experience I say external keys should never be your primary keys. They’ll always change for some reason.
Yes you can create your tables with ON UPDATE CASCADE foreign keys, but are those really the only places the ID is used?
Sometimes your own data does present a natural key though, so if it’s fully within your control then it’s a decent idea to use.
Even internal keys.
For example, suppose you have an information management system where the user can define their own logical fields or attributes. Naturally those names should uniquely identify a field. That makes them an easy candidate for a natural key. But I would still use a surrogate key.
I've worked in two systems that had this feature. In one the field name was a primary key, and in the other they used a surrogate key and a separate uniqueness constraint. In both a requirement to let users rename fields was added later. In the one that used a surrogate key, this was an easy and straightforward change.
In the one that used field name as a natural key, the team decided that implementing the feature wasn't feasible. The name change operation, with its cascade of values needing to be updated for all the foreign key relationships, would have been way too big a transaction to be doing on the fly, in a live database, during high usage times of day. So instead they added a separate "display name" field and updated all the queries to coalesce that with the "official" name. Which is just ugly, and also horribly confusing. Especially in oddball cases where users did something like swapping the names of two fields.
Advice used to be that "natural keys can only be primary if they don't change" but there was always an exception to the rule after a few months.
On the other hand, I remember a CEO wanting to be #1 in the database, so even non-natural primary keys can change... haha.
> but are those really the only places the ID is used?
I'm curious, where else would they be used?
Use both. For each table, I use an internal id that is auto generated, and an external uuid.
> Allowing a client to generate IDs can be a big help here (ie UUIDs)
Trusting the client to generate a high-quality ID has a long history of being a bad idea in practice. It requires the client to not be misconfigured, to not be hacked, to not be malicious, to not have hardware bugs, etc. A single server can generate hundreds of millions of IDs per second and provides a single point of monitoring and control.
In context I read that as database client, meaning the application server (which is a client to the database) providing the service to the user. Having that be able to generate IDs could be useful when needing to refer to the same entity, even if there is data that has to exist in some separate database for some reason.
That is indeed what I had in mind, although I did leave it intentionally vague as everyone can asses what’s best for their own situation
I’ve once attempted to implement a solution where ids are generated by UUIDv5 from a certain owner and the relationship of the new item to the owner; that way, users cannot generate arbitrary ids but can still predict ahead of time their new ids to ease optimistic behaviour.
I think if you have (tenant, uuid) or (user, uuid) or whatever as your primary key, that's fine. If that tenant or user or whatever generates rubbish keys, that's their problem.
As someone who currently has to deal with a database that has NO surrogate keys, 100% agree with having surrogate keys always. Compound natural keys are just plain awful to deal with.
Altough, it doesn't help that we have to do string manipulation to extract part of those natural keys from other fields that contain a bunch of data as one long string.
We can debate the first usage (small type for storage/cpu optimization) but the other 2 are actually good examples of natural keys!
Note the quotes:
> Every software project needs to represent the reality of the business he is embedded in...
> Such database needs to be designed to properly reflect reality. This can’t be automated, since the semantics of the situation need to be encoded in a way that can be processed by a computer.
Having the business need to share an ID, not give internal information, make some data opaque, use a 'arbitary' id to share as 'textual pointer' to the actual data, etc are valid natural keys.
What is wrong is just add a surrogate key just because, when the fact is that a natural key (to the domain, the business, the requirements) is the actual answer.
I discover this fact when doing sync. The moment I get rid of UUIDs and similar and instead keep data-as-natural things become MORE clear:
ulid: 01H6GZ8YQJ code: A1 name: A
ulid: 01H6GZ8YQK code: A1 name: A
ulid: 01H6GZ8YQJL code: A1 name: A
// All same, obscured by wrong key
vs
code: A1 name: A
(ie: The actual data and what I need to model a sync is NOT the same. Is similar to how in 'git' your code is one thing that is clearly distinct to the internal git structures that encode the repo)
I agree - in many cases a surrogate key is better than a natural key.
However there is a problem when people start creating surrogate keys by default and stop thinking about the natural candidate key at all, nevermind not putting on the appropriate constraints.
Especially in the case where the natural key values are within your control. Also not using them for the key ( and FK ) can create the need for additional joins just to go from surrogate to natural key value.
The writing style was already obtuse and off-putting. Adding this 'new' principle (claimed as an innovation), leads me to believe it's not worth my time to read. Additionally the principles are great starting points but real-world uses may require deviation and any thought that absolutes should prevail indicates a lack of experience.
> Joins, lookups, indexes.
You both want to control these values within your database engine, at least so that they are actually unique within the domain, and there is no real reason for it to be user-controlled anyway, as they are used referentially.
> Idempotency.
User supplied tokens for idempotency are mostly useful within the broader context of application sitting on top of database, otherwise they become subject to the same requirements internally generated ones are, without control, which is a recipe for disaster.
> Sharing
Those are the same idempotency tokens from previous points with you as the supplier. In some cases you want to share them across prod/stage/dev environments, in some cases you may want to explicitly avoid duplicates, in some cases you don't care.
All these use cases are solved with mapping tables / classifiers.
Example: in an asset management system you need to expose and identifier of a user-registered computer, that is built using components procured from different suppliers, with their own identification schemas, e.g. exposing keyboard/mouse combo as one component with two subcomponents or two (possibly linked) components.
This requires you to use all those listed identifier types in different parts of the system. You can bake those in database/schema design, or employ some normalization and use "native" identifier and mapping tables.
I'm with you. I've used natural keys in the past, and they've always been a problem eventually.
On the other hand I've used surrogate keys for 20 years, and never encountered an issue that wasn't simple to resolve.
I get there are different camps here, and yes your context matters, but "I'm not really interested in why natural keys worked for you." They don't work for me. So arguments for natural keys are kinda meh.
I guess they work for some folk (shrug).
> will absolutely swear that there will never be two people with the same national ID...
I suddenly got flash-backs.
There are duplicate ISBN numbers for books, despite the system being carefully designed to avoid this.
There are ISBN numbers that have invalid checksums, but are valid ISBNs with the invalid number in the barcode and everything. Either the calculation was incorrectly done, or it was simply a mis-print.
The same book can have hundreds of ISBNs.
There is no sane way to determine if two such ISBNs are truly the same (page numbers and everything), or a reprint that has renumbered pages or even subtly different content with corrected typos, missing or added illustrations, etc...
Our federal government publishes a master database of "job id" numbers for each profession one could have. This is critical for legislation related to skilled migrants, collective workplace agreements, etc...
The states decided to add one digit to these numbers to further subdivide them. They did it differently, of course, and some didn't subdivide at all. Some of them have typos with "O" in place of "0" in a few places. Some states dropped the leading zeroes, and then added a suffix digit, which is fun.
On and on and on...
The real world is messy.
Any ID you don't generate yourself is fraught with risk. Even then there are issues such as what happens if the database is rolled back to a backup and then IDs are generated again for the missed data!
In my own work I see this all the time with FIPS codes and parcel identifiers -- mostly because someone has round-tripped their data through Excel which will autocast the identifiers to numeric types.
Federal GEOIDs are particularly tough because the number of digits defines the GEOID type and there are valid types for 10, 11 and 12-digit numbers, so dropping a leading zero wreaks havoc on any automated processing.
There's a lot of ways to create the garbage in GIGO.
- Author makes an ontological statement - Somehow someone feels its appropriate to talk about mechanisms
Why have both a database ID and UUIDv7, versus just a UUIDv7?
Actually, it should be a database ID and an encrypted database ID, which doesn’t require storing a second ID. Even better, you can make that key unique per session so that users can’t share keys. For security reasons, it is a bad idea to leak private state, which UUIDv7 does.
A single AES encryption block is the same size as a UUID and cheap to compute.
So you've got a database ID, either serial or uuid? And you encrypt it when you send it to the user, maybe encrypted against their JWT or something to maintain stateless sessions?
And I guess if the user references specific resources by ID you'd have to translate back? Assuming the session has been maintained correctly,which I guess is a plus for stateful sessions. And it doesn't really matter if you get a collision on the encrypted output.
I spent enough time in the nibling comment talking about my doubts about that advice not to publicly share the identifying key. But I'll add one more point; it feels like a bunch of added complexity for marginal benefit.
Can you explain this a bit more or link to something? I don’t really understand. What’s encrypted? A guid? A monotonic integer ID? Is the encrypted ID only used for user facing stuff? How is it decrypted? What do you gain by this?
I don’t have a link. I’ve never seen a good writeup but the practice is really old.
It is literally encrypting whatever type you are using as a handle for records that you send the user, typically an integer, a UUID, or some slightly more complex composite key. The only restriction is that the key should not be longer than 128-bits so that you can encrypt it to a UUID-like value; in practice you can almost always guarantee this [0]. The encryption only happens at the trust boundary i.e. when sending it to a random user. That encrypted key is not stored or used at all internally, it just changes the value of the key the user sees.
Most obvious ways of generating handles in a database leak private state. This is routinely exploited almost to the point of being a meme. Encrypting the handles you show the user prevents that.
An advantage of this is that if you are systematically encrypting exported keys then you can attach sensitive metadata to those keys at runtime if you wish, which can be very convenient. You have 128 bits to work with and a unique serial only needs 64 at most. If you are paranoid, there are schemes that enable some degree of key authentication. And while well beyond the scope here, similar constructions work nicely for compact keys in federate data models.
At scale, the cost (storage, compute, etc) of all of this matters. Encryption of keys, if done intelligently, is atypically efficient on all accounts while providing explicit inspectable security guarantees.
[0] There are reasons you might want to expand the exported key to a 256-bit value, particularly in high-assurance type environments, but the advantage of 128-bits is that it is literally drop-in compatible with UUIDs almost everywhere.
> A single AES encryption block is the same size as a UUID and cheap to compute.
I didn’t realise this! The UUID spec mandates some values for specific digits, so I assume this would not be strictly valid UUIDs?
They would not be valid UUIDs, it is an opaque 128-bit value.
To be honest many companies are not using strictly standard conforming UUIDs everywhere, and UUID has become a synonym for opaque 128-bit identifier. Consumers of the value just need them to be unique. All standard UUID versions are sometimes prohibited in environments, except maybe UUIDv8, so this semantic ambiguity is helpful.
Technically, you could make it present as a standard UUIDv4 or UUIDv8 by setting a few bits, as long as you remember to add them back if you ever need to decrypt it. The entropy might be a bit off for a UUIDv4 if someone actually checks but you can guarantee the uniqueness.
Using AES to produce a UUID-like key is an old but useful trick. Both ARM and x86 do that encryption in hardware — it is cheaper to generate than the standardized UUID versions in most cases.
There is a security principle to not expose real identifiers to the outside world. It makes a crack in your system easier to open.
Idk that reeks of security through obscurity to me. Your authorization/permission scheme has got to be fubar'd if you're relying on obscuring IDs to prevent someone from accessing a resource they shouldn't.
I'm sure I'm missing something obvious, but I'm not sure what other threat vectors there are, assuming it's in conjunction with other layers of security like a solid authorization/access control scheme.
I guess I'm not the biggest fan for a few reasons. I'd rather try and design a system such that it's secure even in the case of database leak/compromise or some other form of high transparency. I don't want to encourage a culture of getting used to obscurity and potentially depending on it instead of making sure that transparency can't be abused.
Also, it just feels wasteful. If you have two distinct IDs for a given resource, what are you building your foreign keys against? If you build it against the hidden one, and want to query the foreign table based on user input, you've gotta either do a join or do a query to Get the hidden key just to do another query. It just feels wasteful.
EDIT: apparently RFC 4122 even says not to assume UUIDs are hard to guess and shouldn't be used for security capabilities. So if it shouldn't be depended on for security, why add all this complexity to keep it secure?
The point you may be missing is that the key itself contains information about records in the database that you don’t have access to. There are many famous examples in literature (e.g. the German Tank Problem [0]) of useful attacks on known unique serials to infer hidden information without access. In a database context, the keys for the records you are authorized to see can tell you much about the parts of the database to which you have no access.
Strong randomization of record serials mitigates these attacks.
[0] https://en.wikipedia.org/wiki/German_tank_problem
I thought we were talking about UUIDv7, which is random enough to make this not a problem right?
The idea being to expose uuid instead of the natural index.
It’s been downgraded as people use uuids more.
That said, security through obscurity is an effective layer, particularly for slowing an attack.
Slowing lateral movement is valuable.
Sorry I'm a bit confused.
I'm in agreement that a natural key shouldn't be used as the primary key for a record.
I was responding to a comment about having a hidden "database ID" (which I interpreted as being a serial key?) and a public "Uuid", and questioning the utility of that hidden database ID versus having a public UUIDv7 as the sole primary key, followed by questioning whether the utility of obscuring that primary UUIDv7 is worth the complexity of having to manage multiple artificial keys.
I agree that security through obscurity is a valuable layer in a multi-layered security position.
I guess I just don't think obscuring a Uuid primary key is worth the added complexity in most systems.
I see it like adding a second front door to your house with a separate set of keys. Sure it'd be more secure, but it's an added pain and doesn't help if you don't have a sturdy doorframe, or smash-resistant windows.
The only thing UUIDv7 exposes is its creation time, which isn't tremendously useful or secret information.
- Joins, lookups, indexes. Here data type can matter regarding performance and resource use.
I struggle to see a practical example.
> - Idempotency. Allowing a client to generate IDs can be a big help here (ie UUIDs)
Natural keys solves this
> - Sharing. You may want to share a URL to something that requires the key, but not expose domain data (a URL to a user’s profile image shouldn’t expose their national ID).
The you have another piece of data, which you relate to the natural key. Something like `exposed-name`.
> There is not one solution that handles all of these well
Natural keys solve these issues.
> Also, we all know that stakeholders will absolutely swear that there will never be two people with the same national ID. Oh, except unless someone died, then we may reuse their ID. Oh, and sometimes this remote territory has duplicate IDs with the mainland. Oh, and for people born during that revolution 50 years ago, we just kinda had to make stuff up for them.
If this happens, the designer had a error in his design, and should extend the design to accommodate the facts that escaped him at design time.
> Actually, the article is proposing a new principle
I'm putting it in words, but such knowledge has been common in the database community for ages, afaict.
> If this happens, the designer had a error in his design, and should extend the design to accommodate the facts that escaped him at design time.
Errors in the initial design should be assumed as the default. Wise software engineering should make change easy.
Constraints on natural keys are business logic, not laws of mathematical truth. They are subject to change and often violated in the real world. The database as record-keeping engine should only enforce constraints on artificial keys that maintain its integrity for tracking record identity, history, and references.
Your database may not be primarily a record-keeping engine, and your tradeoffs may be different.
> Errors in the initial design should be assumed as the default. Wise software engineering should make change easy.
I don't think I said that errors would not happen.
The problem with natural keys is that nobody ever says, "My bad, I should have spelled my name the same way on this form as I did when I registered for the service, I promise not to do it again." Instead they say, "No matter how I spell it, you should have still known it was my name!"
> I struggle to see a practical example.
Memory, and CPU, and even storage eventually, those would be the main practical examples of where having a key that's composed of something very small saves you space and thus, time.
Say we want to use a bigint key vs a VARCHAR(30)? depending on your big key you might be talking about terabytes of additional data, just to store a key (1t rows @ bigint = 8TB, 1T rows at 30 chars? 30TB...). The data also is going to constantly shuffle (random inserts).
If you want to define the PK as the natural key with no separate column then you get to do comparisons on all the natural key columns themselves, so instead of doing 1 4 or 8 byte column comparison you get to do what? 5 char comparisons?
Having worked extensively in ETL - when a developer tells me "there's no duplication about this real world process" what they mean is "there's no duplication in my mental model about this real world process"
> Memory, and CPU, and even storage eventually, those would be the main practical examples of where having a key that's composed of something very small saves you space and thus, time.
> Say we want to use a bigint key vs a VARCHAR(30)? depending on your big key you might be talking about terabytes of additional data, just to store a key (1t rows @ bigint = 8TB, 1T rows at 30 chars? 30TB...). The data also is going to constantly shuffle (random inserts).
>> Joins, lookups, indexes
I don't see how what you brought up has anything to do with these.
But the main point is being missed here because of a physical vs logical conflation anyhow.
> Natural keys solves this
It would be helpful if the article used a natural key. Instead, it uses a primary key that is neither natural nor guaranteed, and is mutable. It makes assumptions that are not true, and that is one of the big dangers in using natural keys.
> If this happens, the designer had a error in his design, and should extend the design to accommodate the facts that escaped him at design time.
This again is a dangerous assumption. The danger here is the assumption that facts don't change. Facts do change. And facts that are true at design time are not necessarily true 1 day later, 1 year later, or 1 decade later, or more.
Again, when the example can't even use a natural key to present it's idea of using natural keys, we have a problem.
This takes an overly simple view of what domains can look like. There are data models that necessarily violate these principles, and they aren’t all that rare.
Some examples:
> A relation should be identified by a natural key that reflects the entity’s essential, domain-defined identity
In some domains there is no natural key because the identity is literally an inference problem and relations are probabilistic. The objective of the data model is to aggregate enough records to discover and attribute natural keys with some level of confidence. A common class of data models with this property are entity resolution data models.
> All information in the database is represented explicitly and in exactly one way
Some data models have famously dual natures. Cartographic data models, for example, must be represented both as a graph models (for routing and reachability relationships) and as geometric models (for spatial relationships). The “one true representation” has been a perennial argument in mapping for my entire life and both sides are demonstrably correct.
> Every base relation should be in its highest normal form (3, 5 or 6th normal form).
This is one of those things that sounds attractive because it ignores that it requires no ambiguities about domain boundaries or semantics, which doesn’t exist in practice. I bought into this idea too when I was a young and naive data modeler. Trying to tamp out these ambiguities adds an unbounded number of data model epicycles that add a lot of complexity and performance loss. At some point, strict normalization is not worth the cost in several aspects.
In almost all cases, it is far more important that the data model be efficient to work with than it be the abstract platonic ideal of a domain model. All of these principles have to work on real hardware in real operational environments with all of the messy limitations that implies.
> Trying to tamp out these ambiguities adds an unbounded number of data model epicycles that add a lot of complexity and performance loss
If you can talk about a business rule, you have a predicate. If you have a predicate, you can make it 5 or 6 normal form, since all that means is that your relation expresses only and completely the predicate.
It seems that your definition of normalization is not the one that I am using above. What is it?
I strive to keep most things in at least the 3rd NF. Except stuff like addresses or names, those I intentionally don't push to be normalized, as they make up a single datum anyways, and normalizing creates more problems than it solves IME.
Addresses and names are nice, well-known examples for cross-domain data. It's not that attempts at normalizing these structural datums create problems per se, but rather there is no single true normalization, therefore wrong normalizations start causing problems.
Yeah, normalizing inherently introduces constraints on the data. For example, normalizing to first and last name implies no middle names/having a first and last in the first place.
Also, first and last names depend on the culture. Oh and people can have more than 1 name (as in distinct names, rather than multi part names. Some cultures use different names with different social circles).
Easier to just let them put their preferred name into a freeform text field.
> Except stuff like addresses or names, those I intentionally don't push to be normalized
IMO, it depends. While the idea of normalizing names is amusing, I probably wouldn't ever push for that either. For addresses, though, I would absolutely normalize everything beyond the street address. If nothing else, it enables you to quite cheaply (from a storage / memory perspective) add a lot of analytics on users that might be useful, but would be expensive to store with every record - things like city population, postal code median income, etc.
And in cases where you need to have your own company's address displayed per-user (disclosures for financial products, for example), it's absolutely a good idea. A full address, especially a business one that might have a Suite, Floor, etc. can easily be 60-80 chars, which over hundreds of millions of rows, adds up.
> I would absolutely normalize everything beyond the street address. If nothing else,
I'd personally rather store those datums in additional fields/tables, extracted from the address.
> Principle of Essential Denotation (PED): A relation should be identified by a natural key that reflects the entity’s essential, domain-defined identity — not by arbitrary or surrogate values.
Is national_id really a natural key, or is it someone else's synthetic key? If so, should the owner of that database have opted for a natural key rather than a synthetic key?More arguments for synthetic over natural keys: https://blog.ploeh.dk/2024/06/03/youll-regret-using-natural-...
I was going to comment on this. Natural keys sound like a good idea and they should enforced maybe by using a unique constraint.
Natural keys are important. But the real world and the databases that represent them are messy. People’s identities get stolen. Data entry mistakes and integration between systems fail and leave the data in a schizophrenic state.
In my experience I find arguments about natural keys unproductive. I usually try to steer the conversation to the scenarios I mentioned above. Those who listen to me will have a combination of synthetic and natural keys. The first is used to represent system state. The second is used to represent business processes.
Natural keys are also all too often PII. A surrogate key that's just pure entropy is much safer to blast all over the place in logs and error messages and so on.
I usually encourage people to place all PII in a separate table. Only those who engage with customers e.g., verifying customers identities should have access. Furthermore images of customer identity cards are strictly forbidden. You can enter their passport number, name, address, birthdate etc. but copies of identity documents will make you a target of hackers and angry customers. The rep can ask the customer to show the document or in the worst case present a copy but the copy should immediately be deleted.
"I usually encourage people to place all PII in a separate table. Only those who engage with customers e.g., verifying customers identities should have access"
This sounds nice but usually falls apart fast. "separate table" is neat but access at the user-level is generally not implemented at the DB layer, so which table it is in is unrelated. Also IME data access is usually "everyone up to the role that actually 'needs' it gets it". So e.g. if customer support has access to something, generally so does every single engineering team in the middle. Which is generally a shitton more people than the people who designed the access control mechanisms probably imagined as they bothered adding all this granularity.
Realistically, I think the threat model needs to be looked at from the other side: who's most likely to accidentally leak the data? Is it a support person having their laptop stolen? An engineer getting phished? An engineer accidentally sending Splunk PII in logs? How you address the actual threats your data faces often look almost completely unrelated to what you'd build if you sat down and said "ok big boss says we have to secure the data. what did he mean by this."
I do agree about not holding on to data you don't actually need tho.
PII in a separate db. Encrypted like you would a credit card card number.
BTW: email+password should be separated too. An early draft of GDPR specifically mentioned that, though the final version got less into the weeds.
I’m sure if you vibe code any of this, it will all be plaintext, lol.
Natural key for people: let's use your DNA map.
Programming software that accepts a national ID in our woods is usually the trial by fire for junior developers when they find out that "unique national ID" isn't actually unique for many historic reasons :)
The "natural key" for a (natural) person is compound: full name and mother's full name, plus date, time and place of birth. Your birth certificate is your primary identification document.
However that still runs into problems of nondurability of the key in cultures that delay naming for a few years. To name one problem.
So yeah, use a big enough integer as your key, and have appropriate checks on groups of attributes like this.
However, if you are only interested in citizens, then a "natural" key is the citizen id issued by the government department responsible for doing that. (Citizen is a role played by a natural person so probably doesn't have too many attributes.) I still wouldn't use that as a primary key on the citizen table, though.
I know someone who doesn’t know when she was born, nor who her mother is.
She doesn’t have a birth certificate.
She was born in a country that was enduring several years of brutal war.
I know another person whose national ID was changed. Systems that use national ID as primary key failed to accept this change.
That natural key isn’t guaranteed to be unique.
Depends on the universe of discourse adopted.
> I still wouldn't use that as a primary key on the citizen table, though.
Why not?
https://www.kalzumeus.com/2010/06/17/falsehoods-programmers-...
> The "natural key" for a (natural) person is compound: full name and mother's full name, plus date, time and place of birth.
All it would take for a Bobby Tables[0] moment is a mother to have twins delivered by caesarean and given the same name... The same name thing has already been done[1].
[0] https://xkcd.com/327/
[1] https://www.walesonline.co.uk/news/real-life/mum-gives-twin-...
Just a side note about the historical anecdote at the bottom of the post, which is related to Notre-Dame de Paris:
> 28 statues that portrayed the biblical Kings of Judah. [...] They didn’t portray French kings
That's wrong. Several texts from the revolution and before still exist that prove that these kings were identified as both Judea kings and France kings. For instance, David was Pépin le bref. On one of the gates of the cathedral, the List of the French kings was engraved, starting with Clovis. That glorification of the monarchy, with parallels to the bible, was common at the time: other French medieval cathedrals show the same analogies.
Thanks, I didn't knew that. Can you link to one such text?
> A relation should be identified by a natural key that reflects the entity’s essential, domain-defined identity — not by arbitrary or surrogate values.
That's a very nice example of Theory VS Practice !
I agree, in Theory, PK should be identified by the natural unique key identifier of the domain. Your example with national_id is proper.
In practice :
- due to their varying underlying type, manipulating natural key identifier throught an application (passing values) is not always easy. UUID or INT are reliable and always works.
- sometimes, we want a "poor" anonymization when passing and calculting data. As the PK is the identifier of the tuple, having a personal data for PK can be problematic.
- last but not least : using PK that is subject to change is a terrible idea. An address mail, for example, anybody can be subject to change. When it's the PK, you are fuck to do or propagate the change.
And in this age of gods war, Practice always win.
I have a joke in context that I often like to tell:
Devil captured Physicist, Engineer and Mathematician. He gave each of them big can of spam and locked them in the empty room saying „you will be here for 2 weeks - open the can and survive or die to starvation”. After 2 weeks Devil opens Physicist cell. It’s covered floor to the ceiling in complex scribbles. One piece of wall is clean of etching but small dent is visible. Can of span is opened and eaten clean, Physicist sits in corner visibly annoyed. Next one is Engineer. Cell walls are covered in multiple dents and pieces of spam. Engineer is bruised almost as much as the can, but it is ultimately opened and engineer is alive.
Finally the Devil opens Mathematician cell and find him dead. Only „given the cylinder” is etched on the wall.
—-
Puent isn’t about engineering but it always helped me to set limits between software engineering and computer science.
Uh... I don't get it. Perhaps something was lost in translation?
It sounds like you're saying each person had to smash a container against the wall to release the food, and the Mathematician got side-tracked analyzing the container.
However:
1. Spam™ cans have a convenient pull-tab, you don't need to bash them against a wall.
2. Spam™ cans are typically not "cylinder" shaped... Unless the Mathematician was expecting a cylinder and didn't recieve one, but in that case what did the Devil say that would have caused that expectation?
3. Why would the Engineer do worse at opening the can compared to the Physicist?
Seems like this article places too much emphasis on normalization, which is appropriate for many cases, but may be a huge cost and performance issue for requirements like reporting. You may probably need different kinds of schema and data storage structures for different requirements in the same application, which in turn may result in duplicated data, but results in acceptable trade offs.
" Every base relation should be in its highest normal form (3, 5 or 6th normal form). "
If I remember my database lessons correctly there is no strictly highest normal form. It progresses from 1NF to BCNF, but above that it is more choosing different trade-offs.
Even below it is always a trade-off with performance and that is why we most of the time aim for 3NF, and sometimes BCNF.
There are big disadvantages from choosing e g. 5th normal form: any changing in business requirements leads to a big rewrite and data conversion. Never seen successful projects choosing beyond 3rd/BCNF.
That’s what I was taught as well. And even then I use it more as a rule of thumb
Putting aside performance implications, I get kinda irritated by having to do joins for basic queries all the time.
Then don't use a relational database. Sorry for being rude, but joins are an integral part of the relational model. If you want a KV store, you should use a KV store; if you want a Document DB, you should use a Document DB.
If you're doing reporting, create a reporting view that does all those joins once!
"Databases are representations of reality"
The national ID example is funny. Let me give you a dose of reality concerning national IDs.
- I've seen cases with my country's national ID numbers containing duplicates due to human error.
- National ID's for people can change. In my country, children being adopted, get a different ID after adoption.
- There exist people without a national ID. Or people that don't want you to know their national ID.
Lots of peole got hung up on the example, which I thought would be be helpful on the discussion, but certainly should not replace the main point, which is:
Relations, attributes and tuples are logical. A PK is a combination of one or more attributes representing a name that uniquely identifies tuples and, thus, is logical too[2], while performance is determined exclusively at the physical level, by implementation.
So generating SKs for performance reasons (see, for example, Natural versus Surrogate Keys: Performance and Usability, Performance of Surrogate Key vs Composite Keys) is logical-physical confusion (LPC)[3]. Performance can be considered in PK choice only when there is no logical reason for choosing one key over another.
https://www.dbdebunk.com/2018/04/a-new-understanding-of-keys...
I really don't like using natural keys as primary keys.
Natural keys sometimes need to change for unforeseen reasons, such as identity theft, and this is really tricky to manage if those keys are cascaded into many tables as foreign keys.
Natural keys are often not unique either. Using the national ID example, there are millions of duplicate SSNs issued within USA. https://www.computerworld.com/article/1687803/not-so-unique....
So, don't use natural keys as primary keys. Put them in as surrogate keys, ideally with a unique constraint.
I’m going to have to disagree. What happened what national ids are expanded to include letters? Or extra digits? I’ve lived through this mess a few times in the migrations are painful and take a ton of time away from solving actual business problems. For example: us telephone numbers. The exchange (middle 3 digits), used to physically Geolocate a consumer, because everybody had land lines. I worked on a project and an investment firm that was using it as a primary key to identify a users’s location. Holy old macro was that an unbelievably expensive and painful migration. Don’t do this.
Instead, follow this principle: never ever use an externally assigned identifier as a primary key in your database.
Instead: link tables with simple integers. Do not use these integers for any other purpose and assume unordered. Never let these integers leave the app. When exchanging information between systems or third parties generate a suitable external identifier with a alphabetical prefix.
No. Real life rarely has natural keys that are unique and do not change. For example the national id number in several countries can change in some circumstances...and that is already a synthetic key.
Ohhhhh absolutely not, thank you. I want my IDs to have absolutely no meaning whatsoever.
All of these "rules" will change once they hit the reality of utilization. It reminds me of this:
"We need to normalize the database for better performance."
"We need to denormalize the database for better performance."
Principle of Full Normalization is a useful guide that should not be religiously followed. Mostly full normalization is good, but some objects like names and address resist normalization and thus can introduce significantly more complexity than they save by normalizing.
Article sounds like it was written from a purely theoretical perspective and not from what happens and the requirements in real life.
Bad luck if you don’t yet have (or know) your national ID.
National id is not something issued at birth in the country I live in. It’s something applied for at a certain age.
Where I live there’s no such thing as national ID. There’s a few documents that can be used as such depending on the purpose, and some of those change the number on every update!
Never trust something outside your system to be stable.
Many of the principles and also the example provided for PED cannot be mapped easily through an ORM library and AFAIK Java JPA doesn't handle it too.
Why does it matter? I have seen that many developers rely totally only on the code to manage entities on the database, instead of relying on prepared statements and pure SQL queries. This obviously opens a door for poor optimisation, since these Entity Management libraries don't support certain SQL capabilities.
That’s non argument. Just use a better ORM. Hibernate is able to do that for about 20 years.
That said, I’m not a fan of natural keys as primary keys. Especially composite keys. This just takes everybody back to the 80s/early 90s.
It only makes sense when there’s a huge storage benefit
Find me an ORM that can do:
* window functions
* SELECT FOR UPDATE
* row-level security
Prepared statements and query builders are the better ORM.
" Every base relation should be in its highest normal form (3, 5 or 6th normal form). "
If I remember my database lessons correctly there is no strictly highest normal form. It progresses from 1NF to BCNF, but above that it is more choosing different trade-offs.
Even below it is always a trade-off with performance and that is why we most of the time aim for 3NF, and sometimes BCNF.
> When a collection of such propositions is stored in a computer system, we call it a database.
Is it? A database is a place data is stored and retrieved. It is literally a data base. No more, no less. Whether it logically models a domain may be completely irrelevant to store and retrieve data.
I would argue that a database should not logically model a domain. Why? Because every database must store and retrieve data. Therefore data should be modeled to store and retrieve data as efficiently as possible. And the structure that most efficiently stores and retrieves data most likely does not logically model a domain.
“ Principle of Full Normalization (POFN) : Every base relation should be in its highest normal form (3, 5 or 6th normal form)”
No it shouldnt.
I have implemented database schemas, (without knowing database theory), and these principles are a revelation to me.
A question I have is: Given a schema, are there automated verifiers for validating that it adheres to these principles ? A schema "linter" of sorts.
There seem to be parallels to linear algebra here (orthogonal bases, decompositions, etc)
Please, feel free to elaborate so we can all learn
It's ideological purity which has no place in the real world.
Fully normalised structures are slow, dangerous and place expensive and frustrating burdens on upstream users reducing the amount of value that can be extracted from the data.
I've been working around data for 20+ years and not once seen a dataset where tables could be blindly joined without some conditions. In these situations it can be better to just pre-join them so the business logic is captured.
You want to be pragmatic and have a healthy and constructive mix.
Since I "want to be pragmatic" I want my data not repeated, so that it cannot be inconsistent, and not NULL, to simplify logic. And of course any interesting join condition involves fields from different tables.
The example works if and only if there's one National ID per person.
That's not true for SSNs. It's not true in that it is false. My statement that it is false is, in point of fact, true, and therefore not up for debate.
The government even acknowledges this:
https://www.ssa.gov/OP_Home/handbook/handbook.14/handbook-14...
> 1401.7 Can a person have more than one SSN?
> Most persons have only one SSN. In certain limited situations, SSA can assign you a new number. If you receive a new SSN, you should use the new number. However, your old and new number will remain linked in our records to ensure that your earnings are credited properly. This could affect your benefits.
Maybe there are countries where it is the case that nobody ever gets multiple National IDs. Maybe there are countries without fraud and where everyone can and will update their records when the government does. Maybe there is a veritable Utopia on Earth, a Cockaigne of validated data and reasonable deadlines.
The "natural ID" for people design reminds me of a story from a state department of education: They had two students, both named John Smith Jr. They were identical twins and attending the same class.
They had the same birth date, school, parents, phone number, street address, first name, last name, school, teachers, everything...
The story was that their dad was John Smith Sr in a long line of John Smiths going back a dozen generations. It was "a thing" for the family line, and there was no way he was going to break centuries of tradition just because he happened to have twins.
Note: In very junior grades the kids aren't expected to memorise and use a student ID because they haven't (officially) learned to read and write yet! (I didn't use one until University.)
Same first name? For twins?
I find it very difficult to believe this is not prevented by some law.
Me too, but it seems to be an oddly common occurrence: https://www.google.com/search?q=identical+twins+with+the+sam...
Seems like a terrible idea, TBH - the reality is that values like "national ID" can be sensitive themselves. In the US, such a scheme would make even HTTP ACCESS LOGS into personally-identifiable-information requiring special handling.
"Databases are representations of reality"
"tell the truth that is out there"
Both truth and representation are very slippery, many-faceted concepts, encumbered with millennia of use and philosophy. Using them in this way is deceptive to the junior and useless to the senior.
[dead]
[dead]