Overview
The cdm_Archive_to_STG notebook is a critical component in the cdm_today and cdm_Archive pipeline. Its primary role is to create a staging table that represents a point-in-time snapshot of source data, which is then used by the cdm__Archive_upsert notebook to accurately update the main dimension table.
This staging layer acts as a buffer between raw archived data and the SCD2 logic, ensuring that new records and changes to existing records are processed cleanly and consistently. By dynamically rewriting queries to pull from archived Parquet files, the notebook guarantees historical accuracy and prepares data for controlled upsert operations.
Why Do We Need It?
In modern data warehousing, SCD2 is essential for maintaining historical accuracy in dimension tables. However, the upsert process requires a stable, snapshot-based dataset to compare against existing records. Without a staging layer, you risk:
- Incorrect versioning: Changes might overwrite historical data instead of creating new versions.
- Data drift: Queries against live tables can return inconsistent results during pipeline execution.
- Compliance gaps: Auditors need point-in-time data for regulatory reporting.
This notebook solves these challenges by:
- Generating archive-aware queries that pull data from historical storage rather than live tables.
- Creating a staging table in Delta format for efficient comparison and upsert operations.
- Preserving snapshot dates for auditability and downstream analytics.
What Happens Behind the Scenes?
1. Parameter Setup
The notebook accepts pipeline parameters such as:
- Datamart (e.g.,
project) - Viewname (e.g.,
core_details) - Archive_date (defaults to today if not provided)
- archive_run (controls whether to use a specific date or today)
2. Dynamic Query Rewriting
- Reads the original SQL query from the Silver/sql container.
- Replaces
FROMandJOINclauses with OPENROWSET paths pointing to archived Parquet files in the Archive container. - Substitutes
getdate()with the Archive_date for accuracy.
This ensures the query retrieves historical data rather than live data.
3. Execute Snapshot Query
- Runs the modified query against the Silver Lakehouse engine.
- Returns a DataFrame representing the snapshot of the data for the specified date.
4. Prepare Data for Staging
- Converts date columns (e.g.,
EXPIRY_DATE) to proper types. - Transforms the DataFrame into a PyArrow table for Delta Lake compatibility.
5. Write to Staging Table
- Writes the snapshot data to the stg schema in the Silver Lakehouse as a Delta table.
- Uses
overwritemode to ensure the staging table reflects the latest snapshot run.
Pipeline Context
This notebook runs before the SCD2 upsert process:
- Staging Table: Holds the snapshot data for comparison.
- Main Table: Maintains historical versions of records using SCD2 logic.
Why Is This Important?
- Auditability: Captures point-in-time data for compliance and reporting.
- Accuracy: Provides a clean snapshot for SCD2 processing.
- Performance: Reduces complexity by isolating snapshot logic from upsert logic.
Key Benefits
- Automated archive query generation.
- Seamless integration with Silver Lakehouse.
- Delta format for efficient upsert operations.
- Supports historical reporting and compliance requirements.
Best Practices
- Schedule this notebook immediately before the SCD2 upsert job to ensure the snapshot is fresh.
- Validate archive paths regularly to avoid broken queries.
- Monitor row counts in the staging table to detect anomalies before upsert.