Windows server

The data warehouse all runs on a Windows server (called TF-DW) that has been instantiated by Prestige. This is responsible for all the data processing pipelines. In addition, this server runs the software that interacts with APIs such as for credit checks, as much of the Ruby code is shared between that and the data warehouse. Prestige manages user access to TF-DW.

Pipeline operations

The imported data in the data warehouse is stored in an SQL Server instance that is hosted on Azure. Each pipeline uses its own schema in this database to namespace tables. Additionally, the source data is stored in a (or more than one) data lake. This allows for complete recovery of the data warehouse from scratch, either in the case of disaster recovery or if a change in requirements invalidates the existing design in some way.

The data pipelines are run by calling a command line program with arguments. To perform a single run of a pipeline, you call a command. Repeated runs simply call the command multiple times. The Windows Task Scheduler automates this as follows:

Dancerace: daily at 07:55
Sentinel: daily at 08:17
Dynamics: 35 mins past every hour between 07:35 and 19:35, Mon to Fri

The pipeline runner program (called tfdw - commands like this live in the subdir bin/ in the repo, so are prefixed with that path), has help text detailing its arguments as well as the arguments that are passed to each data pipeline - these can be viewed by running ./bin/tfdw --help.

Each invocation of ./bin/tfdw calls a single pipeline. Passing –environment tells the runner which configuration details to use. By passing –data-lakes, you tell the pipeline to use multiple storage backends for the data. By passing –loggers, you configure multiple loggers for the pipeline. These loggers can be things that log to a file, Azure logging API or things that notify people on certain events, such as errors.

Pipelines are configured to run automatically and have been designed with data atomicity in mind. That means the data is never in a broken or half-written state. As these are complex procedures with many moving parts, errors do occur and many of these have been factored into the design. For instance, fetching snapshot data from Dynamics usually succeeds but when it fails to fetch an table due to an issue outside our control, the pipeline makes sure to throw away all data and will try again on the next run as it was deemed unimportant if a single hourly snapshot is missed.

Other unforeseen errors are handled somewhat differently. In this case, to make sure the data isn’t corrupted, we pause future snapshot imports until someone has investigated and made sure everything is “safe”. This is because the SCD Type 2 system that we have implemented depends on the integrity of prior snapshots. We record snapshot import success and failure for a pipeline in the tfdw_metadata table in a pipeline’s schema.

In this case, an operator will manually disable the scheduled task for the pipeline (to prvent multiple pipeline runs on the same schema) and run a pipeline in the production environment passing pipeline flags like --debug (which drops to an interactive shell for investigation on an error) and --skip-check (which allows the pipeline to proceed beyond the “unknown error” state once the operator is satisfied). Once this is complete, the automated pipeline will be able to be enabled again.

Logging

Log levels are fairly standard hierarchical affairs. The current levels are:

DEBUG, INFO, DONE, WARN, ERROR, FATAL & UNKNOWN

As mentioned, each pipeline can utilise multiple loggers. Azure Monitor API (Azure logger) is used for centralised logging, as it is searchable and filterable via the web UI. Another notable logger is a pub-sub system called ntfy (Ntfy logger). This is typically used to alert us when an important event occurs. For example, --loggers Ntfy:DONE will notify when a pipeline completes (or raises a warning, error, etc - see above hierarchy) and --loggers Ntfy:ERROR will notify when a pipeline throws an error.

Dynamics pipeline

This pipeline fetches all Dynamics/Leasepath data for the defined tables via the Microsoft Dynamics 365 HTTPS API. If all the tables were fetched successfully, this data is compressed and saved in JSON format to the Azure data lake and processing can begin for each table.

First, the columns for the table are read from the API and new ones are added to the data warehouse schema. This allows Time Finance to add columns to tables in Dynamics/Leasepath and these changes are reflected in the data warehouse immediately and without requiring code changes to the pipeline. Then, each row is read and imported using SCD Type 2 to track changes to the data.

Each of the source tables are imported directly - that is, each column in the source data is mapped to a matching column in the data warehouse.

Dancerace pipeline

This relies on a scheduled task running on the Riskfactor Windows machine to put daily ZIP archives of CSV files onto the data warehouse SFTP server. We aim to tell Dancerace to send the files directly to the data warehouse but currently they have not supplied complete data for more than a month so we have put that on hold.

These files are first read and decompressed by the TF-DW pipeline from SFTP and saved to the Azure data lake (stored on Azure Storage Container) as individual CSVs. Each of the CSVs is then processed to import the data. The source data is not especially helpful for importing and it is not possible to identify changes to individual rows as there is nothing we can use as a primary key (even composite columns). This means that we can’t differentiate between a row being deleted and updated. Therefore, with regards to SCD Type 2, any update to a row result in a deletion and insertion as normal but there is no way to link these two operations.

Each of the source files are imported directly - that is, each column in the source data is mapped to a matching column in the data warehouse.

Sentinel 3:1 pipeline (called `Sentinel321`)

This relies on a scheduled task running on the Riskfactor Windows machine to convert daily SQL Server backups from the three Sentinel instances into the correct format and transfer them to the data warehouse SFTP server.

These files are first read by the TF-DW pipeline from SFTP to a local data lake (stored on TF-DW). Then each of the 12 databases (4 per instance) are restored into a local SQL Server instance. Once the data is in place on TF-DW, the pipeline merges the instances into the data warehouse database under the sentinel321 schema.

This merging process extracts data required for all of the relationships in the existing semantic model that is used to query and report on Sentinel data in Power BI. Each of the tables in the sematic model is imported directly - that is, each column in the source data is imported directly into a matching column in the data warehouse.

In order to not create conflicts in the data, the actual SQL relationships are defined by using additional columns that the pipeline adds to each table. Each table has an additional UUIDv4 column called tfdw_sentinel_id. This is used as the primary key (so it is unique across all three instances) and this is referenced by other tables. For example, The AgreementTable is referenced by rows in the CustomerVulnerability table in the column tfdw_sentinel_agreement_table_id.

Additionally, each row in each table is marked with two source columns that record the original Sentinel instance database that held the data: tfdw_metadata_sentinel_source_instance & tfdw_metadata_sentinel_source_db.

As the primary purpose of the Sentinel 3:1 pipeline is to hold the merged instance data before transferring to Leasepath, no SCD process is in place. Instead, all Sentinel instance data is imported fresh each day.

Deployment

Our deployment procedure executes commands over an SSH connection. The command itself is quite simple: ./bin/deploy $GIT_HASH.

This checks out a specific version of the code, archives it into a ZIP file and transfers it the TFDW server. Then it unzips it to a release directory before copying over configuration details and installing & updating required libraries. Once that is complete and successful, we update the symlink to make the release live and log the deployment to track what version of code is on the server. Rollbacks are simply deployments of a different version of the code.