Migrations: Getting rid of links
What is this substack about? Here are the highlights from the first 25 issues.
Getting rid of links is very similar to getting rid of attributes (the topic of the previous post). This post is going to be complementary to the “Adding links” post, and it will rely somewhat on the material of both of those posts. The two main themes we’ll be discussing are naming and dependencies.
As we already said, each link depends on two anchors (or nouns, or entities). As with attributes, to remove the link we must a) find all places in the codebase where the link data is used, and remove them; and b) get rid of the link data. Each link has two corresponding mini-sentences, for example “User likes Post” and “Post is liked by User”. In relational setting the link could be implemented as either a separate two-column table, or as a column of one of the anchor tables.
Naming
The first problem that we have is to find all the places in the codebase that refer to the link. There are several common ways of naming those links, and we want to focus on some of the problems with those.
Often your codebase will have some sort of User class, and a Post class. The link “User likes Post” has N:M cardinality (many users can like many posts), so it’s going to be stored in a separate table. In this particular case we’re often more interested in a list of likes that the post has, so we’re going to have a sort of pseudo-attribute (accessor) of the Post class, for example Post.likes. This accessor may return an array of users who liked the post.
The table could be named “likes”, for example (that’s the name that I have chosen intuitively in the last project that had this feature). There are probably two fields in this table: “user_id” and “post_id”. So, we made our verb “to like” into a plural noun “likes”. There is probably another class, named similarly to the name of the table (Like?), and it manages both sides of the link: it can return a list of the posts that the user liked, a list of users who liked the post, and provides “do like” and “do unlike” methods.
One problem with this naming is that it’s not very greppable. The resulting code becomes very short (which is good), but this succinctness may be considered a linguistic optimization for readability (which could be just as premature as the programming optimization). While you work on a new feature, it’s kind of cool to express your thoughts naturally and fluently, but this may eventually become too dense for some typical needs.
Another problem is that the pseudo-attributes approach discourages batch operations (we mentioned it before in “Designing data access layer, pt. I”). Often you want to render many posts on the same page, with their own list of likes. For that you need a function that receives a list of post IDs and returns a list of likes for each of them. If your preferred access method is “my_post.likes”, however, then you would often feel that the first method is more awkward. If your preferred (or the only) method of accessing list of likes is batch-oriented (that is, it accepts a list of post IDs) then you’ll sometimes be able to extract more performance. Also, having a single access method would be more greppable, because the name of such function would be longer and more specific.
One last problem that we will mention here is that the pseudo-attributes approach creates an unnecessary code dependency. In the “Adding links” post we mentioned that links depend on both anchors. Taking into account an idea of concatenability, we come to the conclusion that class definitions of both anchors (Post and User) must be free of mentions of the links between them. The link (and associated methods) must be defined in a separate class that depends on anchor classes.
Getting rid of the link data
The confusion explained in the previous section is understandable if you remember that people often have a table-centric approach to data modeling. This is especially noticeable for 1:N links, for example “Videos have Comments”. Often the “comments” table would have a video_id column that refers to the videos.id column.
Suppose that we decide that having comments under our videos is not worth it. The dependency graph dictates that we need to get rid of the link first. But if we remove the link, the Comment anchor (and its attributes) becomes basically useless. It is very tempting to get rid of everything at once: a) “Videos have comments” link; b) “Text of the comment” and other attributes; c) Comment anchor itself. Basically, this means that we can just drop the “comments” table. This is understandable from the practical point of view, but for the purposes of our systematization we do not consider this process optimization.
So, to remove the 1:N link we rename the foreign key column. For example, comments.video_id gets renamed to comments.TO_DROP_video_id. If nothing breaks then we physically drop the column.
To remove the N:M link (represented as a separate two-column table), we rename the table itself. This is also an atomic operation in most relational tables. Again, if nothing breaks after some time, we physically drop the renamed table.
In the next post we’re going to discuss getting rid of anchors.