Handling Slowly Changing Dimensions in ETL/BI – From theory to implementation

Slowly Changing Dimensions in Data Warehousing are dimensions that will change over time slowly – instead of changing in regular intervals. Think for example a customer’s address. This is an attribute that it will be not changing on the regular data refreshes, but it could change at some point in the future. In Data Warehouses -and potentially in any other system that uses an ETL approach to get data- there is a need to track those changes, in order to be able to perform accurate aggregated reporting.

Example Use Case

Let us assume that we are developing a Data Warehouse for the sales department of an organization. In this scenario we most likely would like also to be able to report sales for geographical data (this can include dimensions/attributes such as region, city, zip code and/or even county or continent). Now assume that we have a record in our customer dimension with the -oversimplified- attributes Name: John Doe and Address: New York, USA. Whenever we want to report sales for New York, John Doe’s purchases will be included -correctly- in the results. At some point John Doe decides to move to Los Angeles, thus our operational data source will update its records for him with the new address. The next time the ETL pipeline will be used, it will update John Doe’s address in our Data Warehouse as well and it will overwrite it. The next time when we would like to see the sales per city report, John Doe’s purchases will be aggregated under Los Angeles, which is correct for the current and the future purchases, but we will also get sales that happened in a different city and we will lose all the sales from New York as well. This will cause a lot of issues and confusion for the consumers of the reports. Just think for example, that the bonuses of the sales representatives could be based on those reports. This report would lead to an unfair bonus distribution for the sales representatives in those two cities.

Dealing with SCD – Theory

There are several approaches to tackle those cases. Typically you would hear them as Type x where x is a number. Let us discuss them before we move to implementing the handling of a SCD.

Type 0

This is a passive approach. It means that we do not do anything if we find that a change of such type happened. So for our previous example, John Doe in our Data Warehouse will always be in New York.

Type 1

Overwriting the old value. With this approach we simply update our records with the newer data and keep no track of the history of a record in our database. This is what we actually did in our example.

Before Change:

CustomerKey	NativeKey	Name	Address
1	100	John Doe	New York,

After Change:

CustomerKey	NativeKey	Name	Address
1	100	John Doe	Los Angeles

Type 2

In this approach, we will be creating a new record if any changes are introduced. Like this, we will be able to keep the whole history of the record in our existing dimension. To do this, we will need a unique identifier for our record for the data source. This should be already there as the native/business key (primary id of the operational data source). We will also need a flag attribute to define if the record is current and there are cases that will also add start/end date columns. Those date columns will default in today as start date and in something like 31-12-9999 for the end date. When we are using this approach we will always have only one record flagged as current for the records with the specific native key – and if we have added date columns only one record with end date column as 31-12-9999.

This is overall a very clean approach, but it includes several database operations which makes it somehow expensive. We can reduce that cost by only adding the current flag and not use the date aspect.

Before Change:

CustomerKey	NativeKey	Name	Address	StartDate	EndDate	IsCurrent
1	100	John Doe	New York	01-01-2021	31-12-9999	true

After Change:

CustomerKey	NativeKey	Name	Address	StartDate	EndDate	IsCurrent
1	100	John Doe	New York	01-01-2021	20-02-2021	false
2	100	John Doe	Los Angeles	21-02-2021	31-12-9999	true

Type 3

Type 2 with its variations is most likely the most commonly used technique to handle SCD. On the contrary type 3 is most likely the least used one. We would add a new column for each dimension/attribute we want to track its changes in the terms of current_column. While with type 2 we can track the whole record for changes with type 3 we are limited in the attributes we have explicitly define that we want to track. One more downside with this approach is that we can only track the latest change -which on the other hand it may be a strong requirement for our use case.

Before Change:

CustomerKey	NativeKey	Name	CurrentAddress	PreviousAddress
1	100	John Doe	New York	New York

After Change:

CustomerKey	NativeKey	Name	CurrentAddress	PreviousAddress
1	100	John Doe	Los Angeles	New York

Type 4

Usage of an additional table. In this approach we will only keep the most recent (current) data in our dimension and we will use a history table to keep the changes that have happened there (including or excluding the current records, this depends on how we plan to expose the data for consumption but it is helpful to track start dates). Typically a history table will be a mirror of the main dimension with start and end date (defaulting as we discussed earlier).

Main dimension/table:

CustomerKey	NativeKey	Name	Address
1	100	John Doe	Los Angeles

History dimension/table:

CustomerKey	NativeKey	Name	Address	StartDate	EndDate
1	100	John Doe	New York	01-01-2021	20-02-2021
1	100	John Doe	Los Angeles	21-02-2021	31-12-9999

Current record inclusion depends on our requirements

Type 6

We did not forgot a type. Type 6 is a combination of Types 1, 2 and 3 (Hence 1 + 2 +3 = 6). To implement it we would require to have in our dimension a set up for Type 2 and Type 3. Then for each update in the data, we would insert the new record in table table and we will directly update the previous record with not current and/or end date. This is overall a very rare case and adds a lot of overhead both on implementation and on resources.

Before change:

CustomerKey	NativeKey	Name	CurrentAddress	PreviousAddress	StartDate	EndDate	IsCurrent
1	100	John Doe	New York	New York	01-01-2021	31-12-9999	true

After Change:

CustomerKey	NativeKey	Name	CurrentAddress	PreviousAddress	StartDate	EndDate	IsCurrent
1	100	John Doe	New York	New York	01-01-2021	20-02-2021	false
2	100	John Doe	Los Angeles	New York	21-02-2021	31-12-9999	true

This would conclude the theoretical part of Slowly Changing Dimensions. As we can see there is a lot of flexibility on how to handle them. Typically we would have to chose something between Type 1, 2 and 3. Those three approaches are able to cover the vast majority of the scenarios we will find in our way.

Implementation

Let us know check some examples regarding the mentioned approaches on how to handle SCD. To have some context let us assume that we want to write some script to move the data from our staging area to our Data Warehouse and data is based on the example shown above (customer data). The initial tables that we will use are:

create table customer_staging
(
	id serial not null
		constraint customer_staging_pk
			primary key,
	name varchar not null,
	address varchar not null
);

create table customer_dimension
(
	customer_key serial not null
		constraint customer_dimension_pk
			primary key,
	native_id integer not null,
	name varchar not null,
	address varchar not null
);

The SQL examples are shown for PostgreSQL and they would require some changed depending on the SQL dialect you are working with.

Type 0

In the case we are not required to perform any changes we can simply get the data from the data source and just insert anything that is not already there. We will need most likely to make sure that the native key is unique for this case as well.

INSERT INTO customer_dimension (native_id, name, address)
SELECT id, name, address
FROM customer_staging
WHERE id NOT IN (SELECT native_id FROM customer_dimension)

Type 1

For type 1 what we want to do actually is an upsert. If the record exists update it and if it does not insert it. Ideally we want to split the records to new ones and ones to be updated on two different parts to have more control over it.

WITH data_to_insert (id, name, address) AS (
  SELECT id, name, address
  FROM customer_staging
  WHERE id NOT IN (SELECT native_id FROM customer_dimension)
), data_to_update (id, address) AS (
  SELECT cs.id, cs.address
  FROM customer_staging cs
  JOIN customer_dimension cd ON cs.id= cd.native_id
                                  AND cs.address != cd.address
), inserted_records AS (
  INSERT INTO customer_dimension (native_id, name, address)
  SELECT id, name, address
  FROM data_to_insert
  RETURNING *
),updated_records AS (
  UPDATE customer_dimension
  SET address = data_to_update.address
  FROM data_to_update
  WHERE native_id = data_to_update.id
  RETURNING *
) SELECT
    (SELECT count(*) FROM inserted_records) as new_records,
    (SELECT count(*) FROM updated_records) as modified_records
;

So what we do here is to split the data of the staging table. We get any new records (based on the existence of native key in the dimension table), then we check for records that exists on both sides but have different address -or any other attribute that we may need to check and keep them on a separate CTE. Then simply we use those two CTE in order to perform the relevant actions, INSERT and UPDATE based on the data they hold. In the end we simply return a counter of those two. The above script is a good base to use from now on.

Type 2

Before implementing type 2, we would want to make some changes to our table. We will add start_date defaulting to today, end_date defaulting to 31-12-9999 and is_current defaulting to true columns.

ALTER TABLE customer_dimension ADD COLUMN start_date date DEFAULT CURRENT_DATE;
ALTER TABLE customer_dimension ADD COLUMN end_date date DEFAULT '31-12-9999';
ALTER TABLE customer_dimension ADD COLUMN is_current boolean DEFAULT TRUE;

Now that we have this set up, let us see how we need to modify the previous script to handle those as well.

WITH data_to_insert (id, name, address) AS (
  SELECT id, name, address
  FROM customer_staging
  WHERE id NOT IN (SELECT native_id FROM customer_dimension)
), data_to_update (key, id, name, address) AS (
  SELECT cd.customer_key, cs.id, cs.name, cs.address
  FROM customer_staging cs
  JOIN customer_dimension cd ON cs.id= cd.native_id
                                AND cs.address != cd.address
                                AND cd.is_current = TRUE
), update_previous_records AS (
   UPDATE customer_dimension
   SET is_current = FALSE, end_date = CURRENT_DATE
   FROM data_to_update
   WHERE customer_key = data_to_update.key
   RETURNING *
), insert_new_records AS (
   INSERT INTO customer_dimension (native_id, name, address)
   SELECT id, name, address
   FROM data_to_insert
   RETURNING *
), insert_updated_records AS (
   INSERT INTO customer_dimension (native_id, name, address)
   SELECT id, name, address
   FROM data_to_update
   RETURNING *
)SELECT
    (SELECT count(*) FROM insert_new_records) as new_records,
    (SELECT count(*) FROM insert_updated_records) as inserted_modified_records,
    (SELECT count(*) FROM update_previous_records) as updated_modified_records
;

Regarding the new records part we do not anything different. The records that have changes in their address although, we treat them differently. First we find them in data_to_update CTE and then we update their is_current and end_date attributes in update_previous_records. Then from data_to_update we get the records and insert them as current with the default start and end date.

Type 3

For this approach we need again to modify our table. We will drop the columns for Type 2, rename address column to current_address and finally add a previous_address column as well.

ALTER TABLE customer_dimension DROP COLUMN start_date;
ALTER TABLE customer_dimension DROP COLUMN end_date;
ALTER TABLE customer_dimension DROP COLUMN is_current;
ALTER TABLE customer_dimension RENAME COLUMN address TO current_address;
ALTER TABLE customer_dimension ADD COLUMN previous_address VARCHAR;

Note: If you are following up with the examples it will be good now to TRUNCATE customer dimension table to now have several time the records you were experimenting with.

In order to facilitate this we are required to do some changes in our script.

WITH data_to_insert (id, name, current_address, previous_address) AS (
  SELECT id, name, address, address
  FROM customer_staging
  WHERE id NOT IN (SELECT native_id FROM customer_dimension)
), data_to_update (key,  address) AS (
  SELECT cd.customer_key, cs.address
  FROM customer_staging cs
  JOIN customer_dimension cd ON cs.id= cd.native_id
                                AND cs.address != cd.current_address
), insert_new_records AS (
  INSERT INTO customer_dimension (native_id, name, current_address, previous_address)
  SELECT id, name, current_address, previous_address
  FROM data_to_insert
  RETURNING *
), update_records AS (
  UPDATE customer_dimension
  SET previous_address = current_address, current_address = address
  FROM data_to_update
  WHERE customer_key = data_to_update.key
  RETURNING *
)SELECT
    (SELECT count(*) FROM insert_new_records) as new_records,
    (SELECT count(*) FROM update_records) as updated_records
;

Again for the new records we do not do anything new -except of course, to add in the INSERT clause the new field’s value, which it will be the same for both current and previous column for every new record. For the records that have changes we simply update the previous_address field with the value from the current_record and we also update the current_record with the new value.

Type 4

For Type 4, as mentioned in theoretical part, we will need a new table and to modify our existing table to keep the data as it was in the beginning (before Type 0). So our tables would look something like this.

create table customer_dimension
(
	customer_key serial not null
		constraint customer_dimension_pk
			primary key,
	native_id integer not null,
	name varchar not null,
	address varchar not null
);

create table customer_history
(
    customer_key integer not null,
	native_id integer not null,
	name varchar not null,
	address varchar not null,
	start_date DATE DEFAULT CURRENT_DATE,
	end_date DATE DEFAULT '31-12-9999'
);

Note that we do not use any constraints in the history table.

This will combine some pieces of the code that we have already wrote earlier.

WITH data_to_insert (id, name, address) AS (
  SELECT id, name, address
  FROM customer_staging
  WHERE id NOT IN (SELECT native_id FROM customer_dimension)
), insert_new_records_dim AS (
  INSERT INTO customer_dimension (native_id, name, address)
  SELECT id, name, address
  FROM data_to_insert
  RETURNING *
)SELECT
   (SELECT count(*) FROM insert_new_records_dim) as new_records;

WITH insert_new_records_history AS (
  INSERT INTO customer_history (customer_key, native_id, name, address)
  SELECT customer_key, native_id, name, address
  FROM customer_dimension
  WHERE customer_key NOT IN (SELECT customer_key FROM customer_history)
  RETURNING *
), data_to_update (key, name,native_id, address, previous_address) AS (
  SELECT cd.customer_key,cd.name, cd.native_id, cs.address, cd.address
  FROM customer_staging cs
  JOIN customer_dimension cd ON cs.id= cd.native_id
                                AND cs.address != cd.address
), update_hist_records AS (
  UPDATE customer_history
    SET end_date = CURRENT_DATE
    FROM data_to_update
    WHERE customer_key = data_to_update.key
      AND customer_history.native_id = data_to_update.native_id
      AND customer_history.address = data_to_update.previous_address
      AND end_date = '9999-12-31'
  RETURNING *
), update_records AS (
  UPDATE customer_dimension
    SET address = data_to_update.address
    FROM data_to_update
    WHERE customer_key = data_to_update.key
    RETURNING *
), insert_recent_records_history AS (
   INSERT INTO customer_history (customer_key, native_id, name, address)
   SELECT key, native_id, name, address
   FROM data_to_update
   RETURNING *
) SELECT
    (SELECT count(*) FROM insert_new_records_history) as new_hist_records,
    (SELECT count(*) FROM update_hist_records) as updated_hist_records,
    (SELECT count(*) FROM update_records) as update_records,
    (SELECT count(*) FROM insert_recent_records_history) as most_recent_hist_records
;

Notice how we first add the new records and then we start the process to keep history and update records. This is one way to do it. Another way to do it would be to track the changes to intermediate tables and retrieve the data from there to perform history updates and inserts. Perhaps the latter is cleaner but this could be subjective. In addition, keep in mind that if you are handling all those externally from some software, you can use an ORM and most of them include a historicity functionality which requires minimum configuration typically.

Type 6

For this approach, we will modify our dimension table with the following:

ALTER TABLE customer_dimension ADD COLUMN start_date date DEFAULT CURRENT_DATE;
ALTER TABLE customer_dimension ADD COLUMN end_date date DEFAULT '31-12-9999';
ALTER TABLE customer_dimension ADD COLUMN is_current boolean DEFAULT TRUE;
ALTER TABLE customer_dimension RENAME COLUMN address TO current_address;
ALTER TABLE customer_dimension ADD COLUMN previous_address VARCHAR;

And our script will be:

WITH data_to_insert (id, name, current_address, previous_address) AS (
  SELECT id, name, address, address
  FROM customer_staging
  WHERE id NOT IN (SELECT native_id FROM customer_dimension)
), data_to_update (key, native_id, name, address) AS (
  SELECT cd.customer_key, cs.id,cs.name,  cs.address
  FROM customer_staging cs
  JOIN customer_dimension cd ON cs.id= cd.native_id
                                AND cs.address != cd.current_address
),update_records AS (
  UPDATE customer_dimension
  SET current_address = address, end_date=CURRENT_DATE, is_current=FALSE
  FROM data_to_update
  WHERE customer_dimension.native_id = data_to_update.native_id
  RETURNING *
),
  insert_updated_records AS (
  INSERT INTO customer_dimension (native_id, name, current_address, previous_address)
  SELECT DISTINCT native_id, name, address, address
  FROM data_to_update
  RETURNING *
), insert_new_records AS (
  INSERT INTO customer_dimension (native_id, name, previous_address, current_address)
    SELECT id, name, previous_address,current_address
    FROM data_to_insert
    RETURNING *
)
SELECT
    (SELECT count(*) FROM insert_updated_records) as inserted_updated_records,
    (SELECT count(*) FROM insert_new_records) as new_records,
    (SELECT count(*) FROM update_records) as updated_records
;

The tricky part could be here to update the correct records with the correct information every time. .

Conclusion

That was a long post, but hopefully, you know have a better understanding of SCD, which really are a critical aspect in those projects and often is underused or misused. Overall, it is suggested to go with a Type 2 when you design your schema and proceed from there, according to your requirements.