240 tables and no documentation?
A bit of advice on extracting the logical schema from database tables
Someone on r/Database asked for advice on preparing the model of a database of 240+ tables with no coherent documentation. Here is a slightly edited version of my response. I guess that the situation is pretty typical, including the fact the person tasked with this is an intern.
240 tables is quite a lot. You may want to manage everyone's expectations and set a goal for maybe about half of that: that would already be an achievement!
Okay, here is how I would have approached this problem.
First, forget about full-database graphical schemas: they are cluttered, they don't really explain anything, and they are noisy with the implementation artifacts.
Dump your SQL schema as a series of CREATE TABLE operators. Remove all the repetitive junk (such as "if not exists" clauses, special variables, etc.). The idea is to make it readable.
You may want to print it out. You need physical pieces of paper that you can move around and group, just like they do in the movies when solving complicated murder cases.
Create a separate list of just the table names, sorted alphabetically. Print out this list in a larger font. Try to find some logic in it, make yourself acquainted with all the "main characters". Sorted list will automatically bring some of the related tables closer to each other (but there could be other grouping patterns that you will discover).
You may want to have some color markers at that point, too.
If you know how to use Git that would be perfect: create an empty repository, put your SQL schema there. You're going to be deleting the lines that you processed, understood and included in the output, so that you have less and less lines to work with.
As an output you would have a Google Doc, or spreadsheet, or maybe some other text-based format. Just don't use graphic tools: they're not very well suited for the job like this.
Logical structure and physical tables
I would suggest using a simple 3-component vocabulary:
nouns (aka entities, aka anchors);
attributes (such as "name of the user" or "is this order active?"), and
links between two nouns.
Your main output would be a list of nouns, with attributes under each noun, and a separate list of links. Also, you should keep a log of questions and notes that you had while working on this task: it’s going to be useful later.
Typical tables in the typical database contain ID + several attributes of a certain noun. They can be quite wide sometimes, but may well be narrow, depending on how the modeling process went over the years.
Some tables are link tables: they contain two ID fields from two different nouns, and maybe a bit more: timestamp, a primary key, or one or more attributes such as "seq. number". Often such tables have just two fields, which is old-school but perfect. Link tables are used for N:M links, but may sometimes be used for 1:N links. See also below.
Begin with nouns. You probably already know the main nouns that are used in your company's business. User, Order, Payment, Post, Comment, InvoiceItem, whatever. Write them down. Find the tables where those nouns are defined, they're going to be in your "spotlight" at the beginning.
Now you need to classify each field in the tables that you work with. A typical noun table has the following kinds of fields:
primary keys (not null, auto-increment): ID of a noun, would be used in link tables and in other side tables;
nouns ID (without auto-increment): those refer back to the main table of the noun;
attribute field: the actual data about the noun: name of the user, payment amount, name of the item, text of the post, etc, etc.
timestamps: all kinds of timestamps that show when the table row was created or updated. Timestamps are most often recorded automatically, using some kind of “current timestamp” function; they also tend to have predefined names such as “created_at”.
NB: some timestamps could actually be attributes. For example, if your database records the time and date of birth, it's going to be the attribute.
link to the second noun: this is used for 1:N links. E.g. if you have a table for posts, you may record the author's ID as posts.author_id field (alternatively, you may have created a special 2-column table for that, but that happens rarely);
secondary, or derived, data: all kinds of data duplication, precomputed values, etc. This is something that was added for speed, or convenience. For such fields you should write down where the primary data is stored.
junk: unused fields, fields that are not mentioned in the code base, fields with weird names like 'tmp_to_delete', fields always containing the same value, etc.
fields that you genuinely do not understand — try to find somebody with the local knowledge who can help you with making sense of those;
So, again, you need to convert the physical table structure to the logical schema based on nouns/attributes/links. Write down the nouns; write down their attributes; write down links separately (they always belong to both nouns, so don't try to put them under nouns).
Remove the processed fields from the initial text files (attributes that you wrote down; junk; secondary). Update the list of junk fields that you’ve found; update the list of secondary fields. Commit your work if you use Git. When you're done with a table, add another one to your spotlight.
Here is an example of this kind of analysis that I use as an example in my data modeling course:
CREATE TABLE `fr_comments` ( -- anchor: Comment `id` int(11) NOT NULL AUTO_INCREMENT, -- anchor ID: Comment `fr_post_id` int(11) NOT NULL, -- link: Post has Comments (1:N) `fr_user_id` int(11) NOT NULL, -- link: User writes Comments (1:N) `body` text NOT NULL, -- attribute: Comment.text `md5_id` varchar(255) NOT NULL, -- ??? wtf `published_at` datetime DEFAULT NULL, -- attr: Comment.publication_date `created_at` datetime(6) NOT NULL, -- automatic timestamp `comment_id` int(11) DEFAULT NULL, -- link: Comment generated from OriginalComment `wordcount` int(11) DEFAULT NULL, -- secondary data, derived from the `body` attribute PRIMARY KEY (`id`), ) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;
Just from this one table we are able to extract the following:
# Nouns noun Comment attr Comment.body attr Comment.publication_date noun Post ... noun OriginalComment ... noun User ... ... # Links link Post has Comments link User writes Comments link Comment corresponds to the OriginalComment # Notes * what is md5_id?
Some more suggestions: This work will take time, weeks maybe. Sometimes you will have to pause and do something else. You will really need to have some time for background processing, to talk to people, to review your intermediate results. You will not be able to sit down and write down 240 tables at once.
How do you choose the tables that are in your spotlight? Sometimes you have to abandon your current spotlight, just because you're not making progress. You'll get unstuck if you switch to a different area of the database.
At different times you could try all possible selection methods to choose the spotlight candidates:
begin with the "most important" tables;
try a logical group of tables, e.g. beginning with a common prefix;
try a random order :)
try one of the widest tables first;
try one of the narrow tables first;
Count the number of lines in your original SQL schema text file. Count the number of tables still to go. Count the number of nouns/attributes/links that you've extracted from the schema so far. Enter those numbers in a spreadsheet every couple of days, and watch the graphs grow.
After the first couple of modeling sessions, rethink your strategy: does this advice help? You will probably learn which tactics work for you, and which don't.
Hope it helps,
P.S.: The UNION ALL post seems to be more time-consuming than I thought, so meanwhile I decided to take some time out and post something more practical.
P.P.S.: follow me on Twitter: https://twitter.com/alexeymakhotkin.