Performance benefits of surrogate keys in Dimensional Models
by Pete Stiglich, EWSolutions
There are many reasons for implementing surrogate keys in Dimensional Models such as insulating dimensions from changes to source systems and enabling historical versioning of dimension members. However, query performance is another primary reason for incorporating surrogate keys into your Dimensional Models that should not be neglected.
A surrogate key is a non-intelligent, system generated, numeric (integer or smallint) value assigned as the primary key of a dimension. An alternate key (aka natural key) must always be defined as well for your dimensions in order to understand the granularity of the dimension – something vitally important for enabling conformed dimensions across multiple fact tables and areas of analysis.
You may sometimes hear that surrogate keys are not needed due to increases in performance capabilities in hardware – however, due to the ever expanding volume of data, the increasing volume of queries, and the data volumes queried it is important to make query joins as efficient as possible.
According to the 2006 OLAP Survey, the most commonly reported problem is poor query performance. Using surrogate keys is the foremost means to optimize dimensional queries on a RDBMS platform. The primary reasons are:
• Simplified, high performance joins
• Reduced I/O operations
• RDBMS optimizations
Simplified, high performance Joins
Using a surrogate key will simplify the join between a fact and dimension table. A single, small numeric value (usually 4 bytes (integer) or 2 bytes (smallint)) is scanned in the fact table or fact table index, rather than a large character value or multiple attributes of mixed datatypes. Obviously, the join on the single, small numeric value will be faster. It is very common for dimensions to have multi-part natural keys comprised of the business key (attributes that are well known to business users as providing uniqueness in source systems) plus requisite meta data tags (e.g. source system identifier, effective date). The natural key describes the uniqueness and granularity of the dimension table.
These multi-part natural keys often consist of character values that can add up to a significant amount of space when your fact tables consist of millions or billions of records. It is very common for a Business Intelligence (BI) query to need to retrieve a very large number of records from the database, even though the results presented to the user are almost always summarized.
In Dimensional models, it is very common to need to view the state of a dimension member as it appeared as of fact occurrence date. If the Type II Slowly Changing Dimension method is used for handling changes in a dimension, changes to a dimension member will result in the existing record being “deactivated” – e.g. the inactive date is set to the previous date, and the current values are inserted into a new record. Without a surrogate key, an equi-join cannot be performed to accurately join the fact and dimension tables – instead, an inefficient date range operation must be performed.
select d.col1, d.col2, sum (f.amt)
from dimension d, fact f
where d.col3 = f.col3
and f.tran_date between d.effective_date and d.inactive_date
and d.col4 = “Widgets”
group by d.col1, d.col2
When a surrogate key is used, the task of optimizing queries is reduced – there is only a single possible way to join a fact table to the dimension. A side benefit of this is that users cannot incorrectly join a dimension and fact table. In the above example, if the user forgot to include the range join on the fact transaction date, double counting could result.
Simplifying dimension joins by utilizing a surrogate key will be appreciated by anyone who has to troubleshoot or tune complex analytical queries generated by BI tools. A fact table usually has many associated dimension tables – it would not be unusual to have fifteen (15) dimensions that could join to a fact table. In addition, many dimensions may have multiple relationships (roles) to a fact table. There are usually many types of dates that need to be represented with a fact record – each of which would need to join to the date dimension. For example, on an Order fact table you might have an Order Placed Date, Order Shipped Date, Order Invoice Date, etc.
Reduced I/O Operations
While disk is (relatively) cheap, usually very many fact table rows have to be read into memory in order to satisfy queries – if the records are not already in memory. Relational databases perform I/O using pages. A page can contain many data or index rows, so if a page contains a row required to satisfy a query, the data page (with all the rows contained in it) is read into memory. Using natural keys increase the width of the fact table, so fewer fact table rows can be stored on a data page. The result is increased I/O operations - usually the main cause of performance bottlenecks.
A significant amount of the time spent during ETL processing can be attributed to index builds. Utilizing surrogate keys will reduce amount of time spent in building and maintaining indexes and will reduce the amount of space allocated for indexes. As a result, the indexes will be more compact and efficient.
Enables RDBMS optimizations
Using surrogate keys help enable database optimizations developed specifically for use with dimensional models, e.g. bitmap indexing, star transformations.
Some database features developed specifically for Star Schemas require single part foreign keys. Due to the performance impact of enabling Referential Integrity (RI) for a dimensional model, foreign key constraints are usually not enforced in the database. Foreign keys in the fact table, however, usually do have a bitmap or b-tree index on the column. Unlike a B-Tree index, most bitmap indices cannot be comprised of multiple columns. Bitmap indices were developed specifically for BI application and can have a dramatic impact on query performance.
In order for a RDBMS query optimizer to execute a query using a Star Transformation, a single part foreign key with a bitmap index is required. Star Transformations will “rewrite” a query in order to optimize it by using subqueries that take advantage of the bitmap indices on the foreign keys.
Conclusion
The increases in performance capabilities of hardware and software plus the low cost of DASD may tempt one to forego using surrogate keys. More powerful infrastructure does not negate the need for good data architecture. Using natural keys in dimensional models will usually be a mistake. Use of surrogate keys should be considered the standard for dimensional models, unless there are very specific, valid, and justified reasons. Surrogate keys have significant performance benefits, in addition to other data architecture benefits.
About the Author
Pete Stiglich is a Senior Consultant with EWSolutions with over 20 years of IT experience in the fields of Data Modeling, Data Warehousing/Business Intelligence, Meta Data Management, Enterprise Architecture, Customer Relationship Management (CRM), Customer Data Integration (CDI), Database Design and Administration, Data Quality, and Transaction Processing in the Medical, Retail, Banking, Manufacturing, Telecom, Insurance, and Government industries.
Pete has developed and taught courses on Conceptual Data Modeling, Dimensional Data Modeling, SQL, Data Quality, and XML. Pete’s articles on Data Modeling have been published in Real World Decision Support and InfoAdvisors. Pete has presented at DAMA at the international and local level.
Pete can be reached at ewsolutions.com