The CRUD Problem - Types on Data Validation

Continuing with the CRUD problem, the entire thing is data-oriented, right?

When you have important data, the most pressing requirement you usually have is to make sure it's correct. A process that checks this is called "data validation", and on the current systems it usually happen in several different stages, on several different places, with widely different capabilities, none of them complete.

Data validation comes on several different forms.

  • It comes as types systems, where you declare your requirements and some standard piece of software checks them through all of your code; either at runtime or beforehand.

  • It comes as custom verification you write on your data-handling code.

  • It comes as post-fact verification, where reports and alerts tell people of some problem.

  • It comes as your users calling support and yelling...

And yet, well except for that last format, all of that diversity hides the same kind of code running the same kind of checks.

Let's look at the database

If we are to centralize all of that stuff, we should look into the format that is most generic and declarative (luckily, those attributes tend to come together). That will make not only the solution the one capable of replacing the other systems, but also the easiest one to express the requirements right and to verify them for coherence.

That one is the type system.

And from those, the one most fit for data validation is the database's type system. So it's worth focusing on that one.

So, what does your database type system looks like?

Some part of it is composed of data structure declarations like this:

create table person (
	id bigserial primary key
	, name text not null
	, email text not null
	);

Those are the same kind of declarations you find in any language. They tend to be more diverse and concrete than other environments, this is partially because SQL has very bad support for generics, and partially because it improves your data validation. But outside of that, there isn't much of a difference.

Most databases will use the data structure for code verification, like any language with a static type system. What doesn't gain you much, because the SQL code is usually dynamic by itself. But that's just a facet of the rewrite-everything-3-times problem, and something we should expect not to exist.

The data structure will dictate how your data is stored and queried. Meanwhile, in some other layer, some other data structure will dictate how it's presented to the user; and in another layer, another structure tells the computer how to move the data around. That's the function of data structure, and why every language has some kind of it.

But some other parts of your database types are declared like those:

alter table person
	add constraint ak_person_email
	unique (email);
alter table person
	add constraint ch_person_name_length
	check (length(name) > 2 and length(name) < 1000);

Those are exclusively about your data content, not the structure, and something that most languages can't handle at all. Dependent type systems go into that level of detail, but lack the same generality.

Database constraints allow for an almost free verification of nearly any data feature at all. The cost is that this verification is dynamic; constraints don't help you verify your code, are checked for all data, and don't help determine the behavior of your system.

(But are they necessarily dynamic? Well, some are, and computing theory is obsessed with the idea that any problem that can't be fully solved just can't be solved at all. That idea is wrong, by the way.)

So, why do people use other systems?

The largest reason, by far, is the write-everything-3-times problem. Web software is organized in a way where database verification must never fail. Thus any verification there must be done in some other layer in a form that is at least as complete.

As one can expect, this lead to people not doing much verification there. But again, this is the kind of constraint that we must expect not to have on our systems, not one to learn how to cope with.

The other reasons are all about the limitations of this type system.

Database constraints are supposed to always be true, independently of time and the system's environment. This means they must be pure. Databases also usually impose that they must operate over a single table. And also, they must be fast enough to verify all of your data without slowing your software down.

But the nice thing is, if we allow rules that are not always true, none of those restrictions apply. And we can still use those rules to validate our data, verify our code, document it, and guide its execution. All we have to do is decide what to do with bad data.