Application code as a derivation function
When one dataset is derived from another, it goes through some kind of transformation function. For example:
- • A secondary index is a kind of derived dataset with a straightforward transformation function: for each row or document in the base table, it picks out the values in the columns or fields being indexed, and sorts by those values (assuming a B- tree or SSTable index, which are sorted by key, as discussed in Chapter 3).
- • A full-text search index is created by applying various natural language processing functions such as language detection, word segmentation, stemming or lem- matization, spelling correction, and synonym identification, followed by building a data structure for efficient lookups (such as an inverted index).
- • In a machine learning system, we can consider the model as being derived from the training data by applying various feature extraction and statistical analysis functions. When the model is applied to new input data, the output of the model is derived from the input and the model (and hence, indirectly, from the training data).
- • A cache often contains an aggregation of data in the form in which it is going to be displayed in a user interface (UI). Populating the cache thus requires knowledge of what fields are referenced in the UI; changes in the UI may require updating the definition of how the cache is populated and rebuilding the cache.
The derivation function for a secondary index is so commonly required that it is built into many databases as a core feature, and you can invoke it by merely saying CREATE INDEX. For full-text indexing, basic linguistic features for common languages may be built into a database, but the more sophisticated features often require domain- specific tuning. In machine learning, feature engineering is notoriously application- specific, and often has to incorporate detailed knowledge about the user interaction and deployment of an application .
When the function that creates a derived dataset is not a standard cookie-cutter function like creating a secondary index, custom code is required to handle the application-specific aspects. And this custom code is where many databases struggle. Although relational databases commonly support triggers, stored procedures, and user-defined functions, which can be used to execute application code within the database, they have been somewhat of an afterthought in database design (see “Transmitting Event Streams” on page 440).