What is bitemporal data?
Bitemporal Data is data that has two time components associated with it. Data in the real world is immutable, bitemporal data is how we can capture this nature of data in our databases.
Understanding uni-temporal data
I am going to lay out some statements.
- The pyramid of Giza is the tallest man made structure.
- The Eiffel tower is the tallest man made structure.
- The Empire state building is the tallest man made structure.
- Burj Khalifa is the tallest man made structure.
Which of these statements are true?
If you’re like me then you would probably say the last one: “Burj Khalifa is the tallest man made structure”. But, all of these statements are true… at some point in time. It is critical to realize that when we ask the question “which is the tallest man made structure?” there’s an unsaid part to this question, “which is the tallest man made structure (at this point of time)?”.
In reality, facts or data have a time component to it, a time at which the fact was valid. We just don’t tend to model this in our DBs and instead store the most recent version of the fact.
Why don’t we model facts with time?
Models are ways to simplifying our understanding of certain things. The process of simplification involves making tradeoffs about what data points we ignore and the ones we keep in our model. Ignoring the time component simplifies CRUD operations, and if you never have to answer a question like “what is the tallest man made structure (in the year 1969)?” then ignoring the time component makes sense.
Also, this way of modelling data started at a time where memory was expensive, and overwriting data was the only feasible option.
Modelling the tallest structure data
I’m using postgresql for this. Let’s start with a simple table with two columns,
valid_from | structure_name
valid_from
will store the timestamp when the fact became valid.
You can implicitly assume that the fact is valid till the valid_from
date of the next fact (next row).
=>CREATE TABLE tallest_structure (
valid_from date,
structure_name text);
CREATE TABLE
=>INSERT INTO tallest_structure
VALUES ('January 1, 2570 BC', 'Pyramid of Giza'),
('March 15, 1889', 'Eiffel Tower'),
('April 11, 1931', 'Empire State Building'),
('October 1, 2009', 'Burj Khalifa');
INSERT 0 4
=> SELECT * FROM tallest_structure;
valid_from | structure_name
---------------+-----------------------
2570-01-01 BC | Pyramid of Giza
1889-03-15 | Eiffel Tower
1931-04-11 | Empire State Building
2009-10-01 | Burj Khalifa
Now we can ask ourselves, which was the tallest building in the year 1935
SELECT
structure_name
FROM
tallest_structure
WHERE
valid_from < '1935-01-01'
ORDER BY
valid_from DESC
LIMIT 1;
structure_name
-----------------------
Empire State Building
(1 row)
You can immediately see how complex the query became.
To be honest, the sql query does not actually express our intentions.
I’ll get to that in a bit.
Also, our representation isn’t accurate.
The reality is that there were several structures that were the tallest building “between” the empire state building and Burj Khalifa.
Our table claims that from 1931-04-11 to 2009-10-01 the empire state building was the tallest.
I’ll admit this is an unusual case, because for an application, the database is its source of truth.
But this example does point out that using only one column valid_from
does not allow us
to express discontinuous time periods.
Now, let’s use 3 columns to represent the same data.
valid_from | valid_to | structure_name
=> CREATE TABLE tallest_structure2 (
valid_from date,
valid_to date,
structure_name text
);
CREATE TABLE
=>INSERT INTO tallest_structure2
VALUES ('January 1, 2570 BC', 'January 1 1221', 'Pyramid of Giza'),
('March 15, 1889', 'May 27, 1930', 'Eiffel Tower'),
('April 11, 1931', 'January 1, 1954', 'Empire State Building'),
('October 1, 2009', 'infinity', 'Burj Khalifa');
INSERT 0 4
Let’s write a query to see the tallest building in the year 1935.
=> SELECT
structure_name
FROM
tallest_structure2
WHERE
'1935-01-01' BETWEEN valid_from AND valid_to;
structure_name
-----------------------
Empire State Building
(1 row)
This query expresses our intent better, we want the name of the structure where the date 1935-01-01 is in the time period of when a fact was true (valid).
Let’s see which building was tallest in the year 2000.
=> SELECT
structure_name
FROM
tallest_structure2
WHERE
'2000-01-01' BETWEEN valid_from AND valid_to;
structure_name
----------------
(0 rows)
That’s the correct answer because we haven’t stored that fact.
I’ll admit that this example was a tad contrived because you’re not likely to run into this scenario but it does show you the benefit of using a
valid_from
and valid_to
column for your facts.
What we have recorded in the tables are facts that we know are true. But what about facts that we thought were true at the time of recording it, but later turned out to be incorrect?
Understanding Bitemporal data
Imagine you’re a doctor. On the 13th of October 1999, a patient comes in with a critical illness, let’s call her Susan. This critical illness can only be cured by one drug named “curomysin” (not a real drug). But some people have a fatal allergy to curomysin and you have to test them for that. Being the meticulous doctor you are, you order an allergy test. The allergy test comes back negative (on 14th October). So you tell the nurse to administer curomysin. After sometime the nurse notices that there’s something wrong with Susan, you have a hunch, so you order the allergy test again. This time it comes back positive (on 15th October). Horrified, you run to the room to tell the nurse to stop the drip of curomysin. You reach the room, you hear the alarms going off like crazy, and then the dreaded flatline… Susan just died. The allergy test for curomysin has a one-in-a-million chance of giving a false negative.
Now, there’ll be a M&M session where you’ll be asked the question “was Susan allergic to curomysin?”. You’ll most likely say, on 14th October the test results came and we thought that Susan was not allergic, but then the test on 15th October said that Susan was allergic.
You see? the answer is not straightforward. There is one fact, but two “versions” of it. Both “versions” are important to record (for the sake of your job and sanity).
This is where the second time component comes in, a time period where we believed a fact was true. Let’s represent this in the database. We’ll use a structure like this
valid_from | valid_to | believed_from | believed_to | curomysin_test_result
Creat [sic] the table .
=>CREATE TABLE susan_allergy (
valid_from date,
valid_to date,
believed_from date,
believed_to date,
curomysin_test_result text
);
CREATE TABLE
Add the first version of the fact, when we thought Susan was not allergic.
=>INSERT INTO susan_allergy
VALUES ('1999-10-14', 'infinity', '1999-10-14', 'infinity', 'negative');
INSERT 0 1
=> SELECT * FROM susan_allergy;
valid_from | valid_to | believed_from | believed_to | curomysin_test_result
------------+----------+---------------+-------------+-----------------------
1999-10-14 | infinity | 1999-10-14 | infinity | negative
(1 row)
Here we say that the fact is valid from 14th October 1999 till forever, and we believe this from 14th October 1999 till forever.
Now let’s record the second version of the fact, when Susan’s test results came back positive. For this, we update a row and insert a new row.
=>UPDATE
susan_allergy
SET
believed_to = '1999-10-15';
UPDATE 1
=> SELECT * FROM susan_allergy;
valid_from | valid_to | believed_from | believed_to | curomysin_test_result
------------+----------+---------------+-------------+-----------------------
1999-10-14 | infinity | 1999-10-14 | 1999-10-15 | negative
(1 row)
=>INSERT INTO susan_allergy
VALUES ('1999-10-14', 'infinity', '1999-10-15', 'infinity', 'positive');
INSERT 0 1
=> SELECT * FROM susan_allergy;
valid_from | valid_to | believed_from | believed_to | curomysin_test_result
------------+----------+---------------+-------------+-----------------------
1999-10-14 | infinity | 1999-10-14 | 1999-10-15 | negative
1999-10-14 | infinity | 1999-10-15 | infinity | positive
(2 rows)
There are some confusing things in this table. Why do both facts have the same valid_from
and valid_to
?
doesn’t this mean that both are valid?
For me, the key to understanding this was to realize that bitemporal data is driven by what we “believe” or
the believed_from
and believed_to
columns.
- I read the first row like this “from 14th October to 15th October I believed that this fact (negative test result) is true from 14th October to forever.”
- The second row can be read like this “from 15th October 1999 to now I believed that this fact (positive test result) is true from 14th October to forever.”
The columns believed_from
and believed_to
tell us when we believed something
and the other columns tell us what we believed.
So, on 14th October 1999, what we believed was that Susan was NOT allergic, and the fact was valid forever. and then we only believed this from 14th October to 15th October. On 15th October 1999, what we believed was that Susan was allergic, and the fact was valid forever.
This is confusing at first so we’ll look at more examples.
Bitemporal Insert, Update, and Correct.
Joey is someone that likes to date a lot, and he has a… how do I put this nicely…a reptuation. One of his exes was hurt quite badly and is now obsessed with who Joey is currently dating. She decides to keep a record of this. Let’s call her Janice.
=>CREATE TABLE joeys_girlfriends (
valid_from date,
valid_to date,
believed_from date,
believed_to date,
girlfriend_name text
);
Insert
Since March 10th, Joey started dating Karen. Janice notes that.
INSERT INTO joeys_girlfriends
VALUES ('2022-03-10', 'infinity', '2022-03-10', 'infinity', 'Karen');
INSERT 0 1
=> SELECT * FROM joeys_girlfriends;
valid_from | valid_to | believed_from | believed_to | girlfriend_name
------------+----------+---------------+-------------+-----------------
2022-03-10 | infinity | 2022-03-10 | infinity | Karen
(1 row)
Update
Janice then gets a call saying that Joey is now dating Lisa.
Bitemporal update is tough, because we have to write 3 queries.
first, we update the believed_to
for the current fact.
Since going forward, we no longer believe that the fact will be valid forever.
=>UPDATE
joeys_girlfriends
SET
believed_to = '2022-03-15'
WHERE
girlfriend_name = 'Karen';
UPDATE 1
=> SELECT * FROM joeys_girlfriends;
valid_from | valid_to | believed_from | believed_to | girlfriend_name
------------+----------+---------------+-------------+-----------------
2022-03-10 | infinity | 2022-03-10 | 2022-03-15 | Karen
(1 row)
Second, we need to add a fact saying that karen was the valid girlfriend_name from 10th March to 15th March, and that we believe it from 15th March to forever.
=>INSERT INTO joeys_girlfriends
VALUES ('2022-03-10', '2022-03-15', '2022-03-15', 'infinity', 'Karen');
INSERT 0 1
=> SELECT * FROM joeys_girlfriends;
valid_from | valid_to | believed_from | believed_to | girlfriend_name
------------+------------+---------------+-------------+-----------------
2022-03-10 | infinity | 2022-03-10 | 2022-03-15 | Karen
2022-03-10 | 2022-03-15 | 2022-03-15 | infinity | Karen
(2 rows)
Finally, we need to add the new fact, that Joey is dating Lisa.
=>INSERT INTO joeys_girlfriends
VALUES ('2022-03-15', 'infinity', '2022-03-15', 'infinity', 'Lisa');
INSERT 0 1
=> SELECT * FROM joeys_girlfriends;
valid_from | valid_to | believed_from | believed_to | girlfriend_name
------------+------------+---------------+-------------+-----------------
2022-03-10 | infinity | 2022-03-10 | 2022-03-15 | Karen
2022-03-10 | 2022-03-15 | 2022-03-15 | infinity | Karen
2022-03-15 | infinity | 2022-03-15 | infinity | Lisa
(3 rows)
(Phew, that was a lot of work updating something.)
Let’s pause and try to make sense of this.
First row - we believe from 10th March to 15th March that Karen was Joey’s girlfriend from 10th March to forever. Second row - from 15th March to forever we believe that Karen was Joey’s girlfriend from 10th March to 15th March. Third row - from 15th March to forever we believe that Lisa was Joey’s girlfriend from 15th March to forever.
Remember, Bitemporal tables are driven by what we believe. We don’t update(change) in place what we believe, but instead up till when we believe it.
Correction
On 16th March, Janice comes to know that she was lied to and that Joey is actually dating Emma. So she makes a correction.
You’ve seen correction before, in the medical example. Try to think what it’ll be like before you read on.
To make a correction we update the believed_to
of the wrong fact.
=>UPDATE
joeys_girlfriends
SET
believed_to = '2022-03-16'
WHERE
girlfriend_name = 'Lisa';
UPDATE 1
=> SELECT * FROM joeys_girlfriends;
valid_from | valid_to | believed_from | believed_to | girlfriend_name
------------+------------+---------------+-------------+-----------------
2022-03-10 | infinity | 2022-03-10 | 2022-03-15 | Karen
2022-03-10 | 2022-03-15 | 2022-03-15 | infinity | Karen
2022-03-15 | infinity | 2022-03-15 | 2022-03-16 | Lisa
(3 rows)
Then we insert the correct fact, that has the same valid_from
of the incorrect fact.
=>INSERT INTO joeys_girlfriends
VALUES ('2022-03-15', 'infinity', '2022-03-16', 'infinity', 'Emma');
INSERT 0 1
=> select * from joeys_girlfriends;
valid_from | valid_to | believed_from | believed_to | girlfriend_name
------------+------------+---------------+-------------+-----------------
2022-03-10 | infinity | 2022-03-10 | 2022-03-15 | Karen
2022-03-10 | 2022-03-15 | 2022-03-15 | infinity | Karen
2022-03-15 | infinity | 2022-03-15 | 2022-03-16 | Lisa
2022-03-15 | infinity | 2022-03-16 | infinity | Emma
(4 rows)
Basically, there are two types of Bitemporal updates, one is an update, where the fact changes from a correct value to another correct value, and the other is a correction to a fact.
How do we query this data?
Finding Joey’s current girlfriend.
SELECT
girlfriend_name
FROM
joeys_girlfriends
WHERE
valid_from <= '2022-03-20'
AND valid_to >= '2022-03-20'
AND believed_from <= '2022-03-20'
AND believed_to >= '2022-03-20';
girlfriend_name
-----------------
Emma
(1 row)
Bitemporal delete
What does it mean to delete a fact from our memory?
We forget things, but is it the same as deleting a fact?
I don’t think so.
I think the closest parallel I can think of is that we believe the fact is no longer valid, we are “closing” the valid_to
interval.
There are two types of delete operations in the bitemporal world: Bitemporal inactivation and Bitemporal delete.
Bitemporal Inactivation
Let’s say we have a table called people_that_exist
and stores data about all the people currently alive.
So when someone dies, we “delete” them from this table.
=>CREATE TABLE people_that_exist (
valid_from date,
valid_to date,
believed_from date,
believed_to date,
name text
);
CREATE TABLE
Let’s track a man named Shawn Foe (cousin of John Doe). He was born in 1920.
=> SELECT * FROM people_that_exist ;
valid_from | valid_to | believed_from | believed_to | name
------------+----------+---------------+-------------+-----------
1920-01-01 | infinity | 1920-01-01 | infinity | Shawn Foe
(1 row)
Shawn died a peaceful death in the year 2022, so we delete him.
There are two steps involved in this.
First we update the believed_to
of the first row.
=>UPDATE
people_that_exist
SET
believed_to = '2022-01-01'
WHERE
name = 'Shawn Foe';
UPDATE 1
Then we insert a new row.
=>INSERT INTO people_that_exist
VALUES ('1920-01-01', '2022-01-01', '2022-01-01', 'infinity', 'Shawn Foe');
=> SELECT * FROM people_that_exist ;
valid_from | valid_to | believed_from | believed_to | name
------------+------------+---------------+-------------+-----------
1920-01-01 | infinity | 1920-01-01 | 2022-01-01 | Shawn Foe
1920-01-01 | 2022-01-01 | 2022-01-01 | infinity | Shawn Foe
(2 rows)
If you query the table for all the people that you believe are alive as of 2025-01-01
Shawn Foe will still show up, but if you query for all people that you believe are alive as of 2025-01-01
and are actually alive (valid) then
Shawn Foe won’t show up.
Bitemporal Delete
In bitemporal deletion we close the believed_to
interval also.
So Shawn Foe won’t show up when we query all the people that you believe are alive as of 2025-01-01
.
=> SELECT * FROM people_that_exist ;
valid_from | valid_to | believed_from | believed_to | name
------------+------------+---------------+-------------+-----------
1920-01-01 | infinity | 1920-01-01 | 2022-01-01 | Shawn Foe
1920-01-01 | 2022-01-01 | 2022-01-01 | 2022-01-02 | Shawn Foe
(2 rows)
In this situation bitemporal delete makes more sense because when we query the names of people that we believe exist as of now
we don’t want to see Shawn Foe’s name.
Depending on the data you have you might have to chose which delete operation is more suitable to your use case.
Nomenclature
When talking about Bitemporal data, people don’t use the term believed_from
or believed_to
, I used them because it is more intuitive(at least for me).
valid_from
and valid_to
are called effective time period and believed_to
and believed_from
are called assertive time period.
Other terms for it are valid time and record time respectively.
If you’ve ever created a table with the columns created_at
and updated_at
these columns are the same as believed_to
and believed_from
or record time.
Primary keys
In non-bitemporal tables, Generally, an ID (uuid) is used as a Primary key for a row. This is fine, because you update the row in-place. In Bitemporal tables, we add new rows when updating an entity, so we need to modify the primary key for the table. The primary key for a bitemporal table includes the primary key for a normal table (like an ID), and it’s effective and assertive time interval.
Foreign Keys
Foreign Keys are tricky because you have to ensure that the effective and assertive period of a parent entity are “contained within” the effective and assertive period of the entity it references.
Closing Thoughts
Bitemporal data is pretty cool, and simplifies things like generating audit reports, but it does involve managing a lot of complexity. There are also other considerations you must make, like performance. For e.g each bitemporal update, updates one row and adds two rows.
There are databases that were designed with bitemporal data in mind that handle all this complexity for you;
like Datomic, and XTDB.
There are databases that were designed with temporal data in mind that handle all this complexity for you;
like Datomic(Unitemporal Database), and XTDB(Bitemporal Database).
If you’re already using Postgres, and don’t want to migrate to a new database,
there’s a library you can use called pg_bitemporal
the authors also gave a talk that I have heavily used as reference for this blogpost.
If you have suggestions, corrections or general feedback you can tweet @the_lazy_folder or email me.
EDIT: Thanks to @ghoseb for pointing out that Datomic is Unitemporal and not Bitemporal. Thanks to @stopachka for pointing out a dangling sentence. Thanks to @EskoLuontola for pointing out an inconsistency in a date.