What is this substack about? Here is the list of best ideas from the first 25 issues.
Some social networks have a concept of friendship which is mutual by definition. You can send somebody a “friend request”, and if they accept it, then you two become “friends”.
(There is a separate concept of followers, that could be unidirectional: you could be my follower, but I may or may not be your follower; this is trivially modelled.)
I’d like to discuss a mutual friendship today, because it poses a couple of interesting questions. First, let’s begin with the requirements. We have the following typical operations and queries:
Establish mutual friendship;
Delete mutual friendship;
Get the list of friends of a certain user;
Find out if two users are friends.
Our sample users are Alice (id=5) and Bob (id=3).
Two-row model
We can represent mutual friendship as a pair of directed friendships: Alice is a friend of Bob, Bob is a friend of Alice. We can design the following physical table:
CREATE TABLE friendship (
user1_id INTEGER NOT NULL,
user2_id INTEGER NOT NULL,
PRIMARY KEY (user1_id, user2_id)
);
To establish mutual friendship, we need to execute the following SQL statements:
INSERT INTO friendship (user1_id, user2_id)
VALUES ((3, 5), (5, 3));
With the resulting table that looks like this:
| user1_id | user2_id |
|---------------------|
| 3 | 5 |
| 5 | 3 |
-----------------------
Getting the list of Alice’s friends is straightforward:
SELECT user2_id
FROM friendship
WHERE user1_id = 5;
Finding out if Alice and Bob are friends is straightforward:
SELECT 1
FROM friendship
WHERE user1_id = 5 AND user2_id = 3;
Or, equivalently, using the tuple-based syntax:
SELECT 1
FROM friendship
WHERE (user1_id, user2_id) = (3, 5);
Deleting the friendship requires deleting both rows:
DELETE FROM friendship
WHERE (user1_id = 3 AND user2_id = 5) OR (user1_id = 5 AND user2_id = 3);
Let’s look again at the table contents where Alice and Bob’s friendship is registered:
| user1_id | user2_id |
|---------------------|
| 3 | 5 |
| 5 | 3 |
-----------------------
Well, well, well, but isn’t it a (shudder) data duplication? Our code guarantees that for every row there is a corresponding symmetric row; we may think that this second row is redundant. Let’s investigate the single-row design then.
Single-row model
The physical table looks similarly:
CREATE TABLE mutual_friendship (
user1_id INTEGER NOT NULL,
user2_id INTEGER NOT NULL,
PRIMARY KEY (user1_id, user2_id)
);
Establishing mutual friendship
We want to have a single row for every mutual friendship. So, would it be:
INSERT INTO mutual_friendship (user1_id, user2_id)
VALUES ((3, 5));
or
INSERT INTO mutual_friendship (user1_id, user2_id)
VALUES ((5, 3));
? Choosing a random order is not a good idea, because there is a chance that both possibilities would eventually get inserted, and what would that mean? And how will it affect our typical queries?
We can decide that we always insert a smaller ID into user1_id, and larger ID into user2_id. It’s even possible to enforce this on the database level by defining the CHECK constraint on our table:
CREATE TABLE mutual_friendship (
user1_id INTEGER NOT NULL,
user2_id INTEGER NOT NULL,
PRIMARY KEY (user1_id, user2_id),
CONSTRAINT user1_user2_ids CHECK (user1_id < user2_id)
);
(But of course, this is not necessary if you can guarantee this invariant in your code in some other way).
Getting the list of friends
If our invariants are guaranteed, we can get the list of Alice’s friends by the following query:
SELECT user2_id
FROM mutual_friendship
WHERE user1_id = 5
UNION ALL
SELECT user1_id
FROM mutual_friendship
WHERE user2_id = 5
Ugh, this is interesting.
The first version of this query was buggy, because I carelessly used the obvious-looking condition “WHERE user1_id = 5 OR user2_id = 5”. This condition is wrong.
By the way, from the practical point of view, to improve the performance of this query we should also create a second table index. The first one is provided by the primary key constraint. The second one is going to be “INDEX (user2_id, user1_id)”. This index will improve the performance of the following query too.
Are they friends?
To find out if Alice and Bob are friends, we will use the following query:
SELECT 1
FROM mutual_friendship
WHERE user1_id = 3 AND user2_id = 5
For this query to work, our code must make sure that the first ID is less than the second ID. So if we have some kind of function that accepts two arbitrary IDs, this function needs to sort both IDs and pass them in the right order.
If we don’t want or cannot sort the IDs in our code, we have to consider both options in the SQL:
SELECT 1
FROM mutual_friendship
WHERE (user1_id = $ID1 AND user2_id = $ID2) OR
(user1_id = $ID2 AND user2_id = $ID1)
(Here “$ID1” and “$ID2” are some kind of abstract variables.)
Deleting friendship record
Deleting the record of friendship is similar to the previous query:
DELETE
FROM mutual_friendship
WHERE user1_id = 3 AND user2_id = 5
This only works if we can pre-process a pair of IDs in our code and thus guarantee that the first ID is smaller than the second. Otherwise, we must explicitly write the OR-condition.
The difference between two models
Let’s summarize the peculiarities of both representations.
Two-row model:
Establishing friendship: a two-part INSERT command is needed;
Deleting friendship: a two-part DELETE command is needed;
Getting the list of friends: straightforward query;
Are they friends?: straightforward query;
Storage requirements: data size is twice as large;
Potential invariant violations: only one of two rows is present;
Single-row model:
Establishing friendship: pre-processing is needed before INSERT;
Deleting friendship: pre-processing is needed before DELETE;
Getting the list of friends: two-part query is needed;
Are they friends?: pre-processing (or two-part query) is needed;
Storage requirements: optimal data size, but an additional index would be needed;
Potential invariant violations: two symmetric rows; wrong order of IDs.
We see here that for both models we need to implement invariant-preserving functions to establish and delete friendships. Then we must ensure that the data could be changed only via those functions.
As for querying: the two-row model seems to be substantially simpler for ad-hoc manual and analytic queries, if you need those (and it’s a nice thing to have). The single-row model will quickly become tedious or would again require pre-processing.
Is there data duplication in the two-row model?
Preventing data duplication is one of the main things people talk about when they talk about database normalization. If you would discuss choosing between two representations with somebody, this question would most certainly come up, and we have to decide if the two-row model representation is acceptable.
We can illustrate the idea of data duplication by an example from Wikipedia. Suppose that we have a table of tournament winners with the following structure:
Tournament name;
Tournament year;
Winner name (let’s pretend it to be a primary key);
Winner date of birth;
This structure is not in the third normal form: if the same person won multiple tournaments, we would store their DOB multiple times. This is the source of data duplication and inconsistencies: what if we accidentally put two different dates of birth for the same person?
To normalize this schema we need to create a separate table for athletes, with the following structure:
Athlete name (a primary key);
Athlete date of birth;
And then remove the “Winner date of birth” from the table of tournament winners.
What happened here? Now, as the data duplication is eliminated, when a single piece of data needs to be changed, it will lead to changing a single table value. If we discover that the date of birth of one of the athletes is incorrect, we only need to change one value. In non-3NF form this would require changing an unlimited number of rows, as many as the number of times they won tournaments.
The keyword here seems to be “unlimited”, and it allows us to propose that there is no (bad) data duplication in the two-row model. When we add or remove a single friendship, we need to add or remove a constant number of rows, namely two. It’s not unlimited.
I’m not sure if that would be an acceptable argument for everyone, and I’m not even sure if the original question is valid. Both of those models frankly feel somehow weird, they go strongly against the usual effortlessness of relational database modeling. Maybe this is because of the additional invariants that are not handled directly by the table structure?
Frankly, I kind of expect that somebody will just present a much more elegant solution for this problem, and we all would learn something new.
We’re going to investigate some other similar cases in future posts. Our hope is to find the right way to describe this kind of situations.
P.S.: If you’re new to this substack, take a look at the list of best ideas from the first 25 issues.
Here is a trick I seem to have learn here! I wonder if this could be modeled with JSON. A nice property of JSON object is that keys are an unordered set, which fits our data modelisation here:
```
CREATE TABLE mutual_friendship (
friendship JSONB NOT NULL PRIMARY KEY
);
INSERT INTO mutual_friendship(friendship) VALUES ('{"3": {}, "5":{}}');
INSERT INTO mutual_friendship(friendship) VALUES ('{"4": {}, "3":{}}'); -- FAIL!
SELECT * FROM mutual_friendship WHERE friendship ? '3';
SELECT * FROM mutual_friendship WHERE friendship ?& '{"3", "5"}';
```
* Establishing friendship: Single request, no pre-processing
* Deleting friendship: Single request, no pre-processing
* Getting the list of friends: Single query with `?` operator;
* Are they friends?: Single query with `?&`;
* Storage requirements: Not sure how much space would be required for these simple JSON object. Extra GIN index required.
* Potential invariant violations: The id in the object cannot be foreign keys.
This could be resolved by having two extra column to act as foreign key container (And only used as such, this way the what column store what id doesn't matter) and a CHECK for `(user1_id, user2_id) = jsonb_object_keys(friendship)`