Minimal Modeling and GenAI

Minimal model could be used as a source of high quality textual content to feed into your LLM.

Oct 02, 2024

It seems that a lot of people really want to ask questions about their data using natural language, and receive appropriate responses. Let’s narrow down that problem to “generate SQL query based on free-form questions in the natural language, using LLM”. E.g., we may ask “how many users do we have?” and receive “SELECT COUNT(*) FROM users” as a response.

To generate such queries, LLM needs to understand the database schema (in a broad sense), and business-level vocabulary (e.g., what is ARR as applied to a specific database).

First, let’s see what sort of tokens we can extract from a typical, unprepared database. Then we’ll discuss what Minimal Modeling brings to the table.

Extracting tokens from unprepared database

Main source of tokens in any database is a SQL schema definition. Typically it would be just a sequence of “CREATE TABLE” statements. Most interesting tokens would be the names of tables and columns; sometimes there are also random comments written by people. Data types may be an additional source of information about the meaning.

Understanding table/column names is sometimes very straightforward, sometimes virtually impossible, both for people and for an LLM:

You can have simple words such as “users” and “tax_rate”.
For legacy systems, you would sometimes have abbreviated names. For example, in a classic “EMP” database you may find names such as “mgr” and “hisal”. They mean “ID of the manager” and “upper salary range”. Would LLM be able to realize what it means is an open question.
For some other legacy systems you may have more or less meaningless auto-generated names such as “L168”.
Sometimes the names are in a foreign language, e.g. “Gesprekdoel”.
Etc., etc.

Metadata in table data

In many real-world schemas people define an additional level of metadata. One very common example is EAV (Entity-Attribute-Value). Under this approach, a table would look like:

CREATE TABLE restaurant_attributes ( restaurant_id INTEGER NOT NULL, attr_name VARCHAR(32) NOT NULL, attr_value VARCHAR(256) NOT NULL, PRIMARY KEY (id, attr_name), INDEX (attr_name) );

Such a table may contain an arbitrary number of various attributes. Their names would be stored in the “attr_name” column. So, to get a list of all possible attribute names, we may have to execute a query such as “SELECT DISTINCT(attr_name) FROM restaurant_attributes”, and feed this list to a LLM. We also have to explain to a LLM that querying this table requires an SQL fragment such as “SELECT attr_value FROM restaurant_attributes WHERE attr_name = 'operating_license_number' AND restaurant_id = ?”.

Please note that we could not just do that for each and every table, because for most tables the contents is the actual data (such as item names and people addresses), and not the metadata (such as attribute names).

Metadata in JSON

Modern databases often use JSON-typed columns. JSON is another example of an additional level of metadata. To find out the structure of JSON columns, we need to select a random sample of data from such columns, mask out the actual data values (preserving their types), and feed the names of the keys as tokens to LLM.

Note that it’s not clear how to make sure that all the possible keys would be found by this random sampling. For rarely-used keys it’s possible that they would be accidentally skipped, and thus LLM will not know about them.

Another way to deal with that is to find or write down a JSON schema that corresponds to that column, and feed it into the LLM.

Enums etc.

Yet another source of metadata is enums. Suppose that you have some sort of orders that could be in different statuses, such as “pending”, “paid”, “delivered”, etc. Usually you would have some sort of “orders.status” column that contains values from a small set of possible values.

How does LLM know a set of possible statuses? It’s easy if the column is defined as some sort of “enum” type: then the list of possible values is directly available as part of SQL schema. But sometimes you would use a more opaque column type such as “VARCHAR”, with the list of enums available somewhere else. You have to prepare that list for LLM consumption, so that it would be possible to answer questions such as “What is the percentage of pending orders that we have?”

Sample queries

To train an LLM, we need to gather as much of human-created token sequences as possible. You probably have those in the form of SQL queries that were written by people. It’s even possible that some of those queries are accompanied by some comments, either in-line or as commit messages.

You can get all sorts of queries from your pipeline definitions etc.

There are probably a few more minor sources of textual tokens that you can extract from the unprepared database. Particularly, we could think about analyzing the samples of actual data and extracting some additional structural information from that. We need to be careful here, though: if you directly feed user-provided data from the database into an LLM, you could probably get an interesting equivalent of a “; DROP TABLE users” exploit.

Making sense of tokens

Now let’s discuss how we can assign semantics to all those “hisal”, “L168”, “operating_license_number”, etc. Most probably a big part of understanding the meaning of those strings would come from public text corpuses that were used to train the base LLM.

For example, I’m pretty sure that most LLMs can actually fully describe the schema of an EMP database because it has been discussed in the literature. So it probably has no trouble in understanding what “hisal” means.

But we’re interested in making sense of non-public databases. My proposal is, of course, to use Minimal Modeling.

Minimal Modeling approach is to represent every database as four tables: list of anchors, list of attributes, list of links and list of secondary data. Each table (the word is awfully overloaded) consists of a small number of columns. Here is how we could describe a sample of the EMP database.

Anchors

Attributes

Links

Secondary data

Let’s skip an example of documenting secondary data for now. Secondary data is usually created ad-hoc to improve performance or convenience of certain queries. Teaching LLM when to query secondary data and when to use primary data directly is a separate topic.

Also, I’m going to write up a separate post on documenting secondary data, because there was a lot of progress with thinking about that recently.

Minimal Model as source of textual tokens

The tables above are organized that way for human consumption. To feed them to a LLM, you will have to rewrite them into some structured textual form. But this is a straightforward exercise.

What those tables give you is a source of human-written tokens. The most important are “Anchors/Name”, “Attributes/Question” and “Links/Sentences” columns. They contain something that only humans can know: business-level semantics, expressed in a natural language.

Second part is “SQL query” columns. They show:

where the IDs for each anchor are stored;
how to get attribute values;
how to get the list of IDs for a link.

Using this information, LLM can learn the meaning of each table column.

Additionally, you get the information on data type, cardinality (1:1, 1:N, N:M), and some sample values.

Required effort

Each anchor, attribute and link is described by one row.

Anchors require three pieces of information:

name;
SQL query;
ID example;

Attributes require five pieces of information:

anchor name;
question (single sentence);
data type;
example value;
SQL query;

Links require four pieces of information;

linked anchors;
two sentences;
cardinality;
SQL query.

All three lists are incremental and could be added by different people independently. Adding a row takes 3-5 minutes if you are familiar with that part of the database. If you’re still learning the database yourself, it can take more time.

Conclusion

Minimal Modeling could be used as a source of high quality textual content to feed into your LLM. Default textual representation of the structure of a typical database may not be enough because only column identifiers are available. Main problems with interpreting the column names are abbreviations, unreadable naming strategies, and “schemaless” approaches such as JSON and EAV.

Note that the situation is basically the same both for LLMs and for people. If you ask somebody who is new to the company to prepare the data, and you use plain English, they would first have to find out where the data is actually stored. They would run into conceptually similar problems as an LLM would. It’s not surprising that the Minimal Modeling approach helps both people and LLMs.

I’m planning to do a full-size experiment: completely describe one non-public database that I know very well, and see how successful LLM is with generating queries against that database. That would be fun.

Minimal Modeling