The Senior DE Architect: Pipelines vs. Platforms
🏗️ The Senior DE Architect: Pipelines vs. Platforms
Beginners learn how to move data from A to B. Seniors build platforms that allow data to move reliably from anywhere to everywhere. This guide focuses on the “Senior Architecture” that prevents a pipeline from becoming a “Big Ball of Mud.”
🏗️ 1. The Core Shift: From ETL to ELT
In a modern Senior DE’s world, the data “Transform” (the ‘T’ in ETL) happens inside the Data Warehouse (BigQuery, Snowflake, Redshift) using dbt (data build tool).
Why Seniors Love ELT:
- Scalability: The Data Warehouse is better at parallel processing than a Python script.
- SQL-First: Everyone (Analyst, Data Scientist, DE) speaks SQL.
- Reusability: You don’t have to rewrite the transformation logic for every new source.
🏗️ 2. The Medallion Architecture: Keeping it Clean
A Senior doesn’t just dump data into a table. They use the Bronze-Silver-Gold framework:
- Bronze (Raw): 1:1 copy of the source data. No changes. If something breaks in the future, you can re-run everything from here.
- Silver (Cleaned): Data is normalized, types are corrected, and duplicates are removed. The “Source of Truth.”
- Gold (Business): Highly optimized tables (Stars/Snowflakes) ready for BI tools and ML models.
🏗️ 3. The “Big Three” of Scalable Engineering
1. Data Modeling (Dimensional Modeling)
Seniors master the Star Schema (Fact tables and Dimension tables).
- Facts: Events (Sales, Clicks).
- Dimensions: Details (Product name, Store location).
2. Idempotency (The “Restart” Rule)
A Senior pipeline must be Idempotent. If a job fails halfway through, you should be able to restart it without creating duplicate records.
- ✅ Senior Move: Use
INSERT OVERWRITEorMERGEinstead of justINSERT.
3. Data Quality (The “Circuit Breaker”)
Don’t wait for the CEO to tell you the data is wrong.
- Great Expectations: Automate tests like “Is this column always positive?” or “Are there any NULLs in my primary key?”
🏗️ 4. The Toolset: When to use What?
| Requirement | The Senior Tool |
|---|---|
| Simple Scheduling | Airflow / Dagster |
| Massive Scale | PySpark |
| Streaming Data | Kafka / Flink |
| SQL Transformations | dbt |
| Local Development | DuckDB |
🚀 The Senior’s “No-Go” List
- Don’t use Python for everything: If you can do it in SQL inside the Warehouse, do it there. It’s usually 10x faster and cheaper.
- Don’t hardcode paths: Use Catalogues (like Unity Catalog or Glue) to manage your data assets.
- Don’t ignore the Cost: Every byte you store and every second of compute costs money. A Senior optimizes for ROI, not just performance.