Case Study 01 · TheGuarantors · Fintech / Housing

Centralized
dbt Platform

Built TheGuarantors' first analytics engineering platform from scratch — replacing fragmented Excel exports and competing metric definitions with a single, trusted data infrastructure that the whole company relied on.

Company
TheGuarantors
Timeline
June 2022 – Present
Role
Senior Analytics Engineer
(Previously Data Engineer / Senior Data Analyst)
Stack
dbt · Databricks · SQL · Python · Power BI · Atlan
01

The Problem

When I joined TheGuarantors, the company had grown quickly but its data infrastructure hadn't kept pace. Marketing pulled CSV exports from eight different platforms every Monday morning and spent four to five hours stitching them together in Excel. Product managers waited three to five days for experiment results because analysts were backlogged with data extraction requests.

The more critical problem was trust. Different departments calculated the same metrics differently. "Customer acquisition cost" meant one thing to Finance and something else entirely to Marketing. Leadership meetings regularly stalled while people debated whose numbers were right — which meant decisions were slow, and the underlying data problems kept compounding.

There was no centralized warehouse, no documented data models, and no single source of truth. The company needed an analytics engineering layer — not more dashboards on top of messy data, but clean, tested, documented data models that everyone could trust.

"Different departments calculated the same metrics differently. Leadership meetings regularly stalled while people debated whose numbers were right."

02

Architecture & Approach

The core principle was simple: build from the source outward, and don't expose anything to the business until it has tests on it.

Data Flow — Source to Consumption
Salesforce
Marketing APIs
Internal DBs
Event Streams
stg_*
staging models
int_*
intermediate
mart_*
business marts
Power BI
Analysts

I structured the dbt project in three layers. Staging models are source-specific: one model per source table, renaming columns to consistent conventions and casting types. Nothing else — no business logic here. Intermediate models handle reusable transformations: sessionization logic, identity resolution (mapping anonymous IDs to authenticated user IDs), and common business rules. Mart models are the business-facing layer: Kimball-style fact and dimension tables designed around the questions real teams ask.

Every mart model gets a full suite of dbt tests before it's promoted to production: uniqueness, not-null, referential integrity, and custom threshold tests for business logic. If a test fails in CI, the model doesn't deploy.

On top of the model layer, I built an event sessionization pipeline that handled anonymous-to-user ID mapping — critical for attributing pre-signup behavior to converted customers. This gave product and marketing their first reliable view of the full customer funnel.

03

How We Built It

The platform was built in five phases over roughly four months, with each phase delivering something usable before moving to the next. The goal was to have working infrastructure in stakeholders' hands as early as possible — not to design a perfect system in isolation.

01
Data Audit & Source Inventory
Catalogued all active data sources: Salesforce, internal PostgreSQL databases, marketing APIs, and event streams. Identified 15+ distinct sources feeding into manual reports, documented their schemas, update frequencies, and the business questions each was supposed to answer. Surfaced three sources with known reliability issues that needed fixing before modeling.
02
Staging Layer — Source-Faithful Models
Built one dbt staging model per source table: rename columns to consistent conventions, cast types, apply source filters. No business logic at this layer. Added uniqueness and not-null tests on every primary key. This layer ran in CI from day one — any source schema change would fail fast rather than silently corrupt downstream models.
03
Intermediate Layer — Event Pipeline & Identity Resolution
Built the event sessionization pipeline and anonymous-to-user ID mapping here. This was the most complex phase — resolving pre-signup behavioral events to authenticated users required careful business rule definition for edge cases (multi-device users, session timeouts, account merges). The resulting models gave the company its first complete customer behavioral timeline.
04
Mart Layer — Business-Facing Dimensional Models
Built Kimball-style fact and dimension tables organized around business domains: customer acquisition, deal pipeline, marketing attribution, and operational KPIs. Each mart model had a full test suite plus human-readable descriptions in schema.yml, imported into Atlan so any stakeholder could look up a metric definition without opening dbt.
05
Governance, Monitoring & Rollout
Deployed the anomaly detection framework, established the data change process (PR review + downstream notification for mart changes), and migrated all existing Power BI reports onto the new mart layer. Ran old and new in parallel for three weeks to validate consistency, then deprecated the manual Excel processes as each team confirmed the new data matched expectations.
04

Implementation Detail

A representative example: the anomaly detection layer I built used a rolling average approach inside dbt, calling a custom Python script when threshold deviations were detected. Here's a simplified version of the core pattern:

models/monitoring/anomaly_detection.sql
-- Rolling 7-day baseline vs. today's actual
WITH daily_metrics AS (
  SELECT
    metric_date,
    metric_name,
    metric_value,
    AVG(metric_value) OVER (
      PARTITION BY metric_name
      ORDER BY metric_date
      ROWS BETWEEN 7 PRECEDING AND 1 PRECEDING
    ) AS rolling_avg,
    STDDEV(metric_value) OVER (
      PARTITION BY metric_name
      ORDER BY metric_date
      ROWS BETWEEN 7 PRECEDING AND 1 PRECEDING
    ) AS rolling_stddev
  FROM {{ ref('mart_daily_metrics') }}
),

flagged AS (
  SELECT
    *,
    ABS(metric_value - rolling_avg) / NULLIF(rolling_stddev, 0)
      AS z_score
  FROM daily_metrics
)

SELECT *
FROM flagged
WHERE
  metric_date = CURRENT_DATE()
  AND z_score > 2.5  -- flag 2.5σ deviations
  AND rolling_avg IS NOT NULL

This model runs nightly. When it returns rows, a downstream Python script fires a Slack alert to the relevant team with the metric name, expected range, and actual value. It was this framework that caught the 40% drop in main-page traffic — the product team had the alert, diagnosed the issue, and deployed a fix within 24 hours.

05

Data Quality & Governance

Shipping dbt models is the easy part. Getting an organization to trust them is harder. I approached governance in three ways.

First, every model had tests before anyone consumed it. I wrote a test coverage policy: staging models need uniqueness and not-null tests on primary keys; mart models need the full suite plus at least one business-logic custom test. CI blocks deployment on failures — no exceptions.

Second, I built metric definitions into the documentation itself. Every model in the mart layer has a description in schema.yml that defines the business rule in plain language — not just "customer_id" but "the unique identifier assigned at account creation, distinct from prospect IDs generated before signup." I then imported these into an Atlan catalog so non-technical stakeholders could search definitions without opening dbt.

Third, I established a data change process: any modification to a mart-layer model required a PR review and a Slack notification to downstream consumers. This sounds bureaucratic but it's what separates a data platform that teams abandon after six months from one that becomes load-bearing infrastructure.

06

Results

90%
Metric Consistency
Reduction in cross-team metric inconsistencies after single source of truth was established
−5 hrs
Weekly Prep Time
Per marketer, every week — for 12 marketers that's 60 hours/month returned to strategy
+20%
Marketing ROI
Driven by spend analysis that became possible once attribution data was reliable
$50k+
Revenue Protected
Estimated loss prevented by anomaly detection catching a 40% traffic drop in under 24 hours
−50%
Onboarding Time
New analyst ramp time reduced via searchable data catalog and documented models
+15%
On-Time Delivery
Project delivery improvement after real-time KPI monitoring framework launched
07

Lessons Learned

  • 01 Start with the consumer, not the source. I initially modeled based on what the data looked like. The second iteration — starting with what business users actually needed to answer — was far more durable. The mart layer became stable because it was organized around business questions, not system structure.
  • 02 Documentation is a first-class deliverable. Every hour I spent writing clear model descriptions paid back tenfold in avoided Slack messages. Analysts stopped asking "what does this field mean?" and started asking better questions.
  • 03 Ship an imperfect v1 fast, then iterate. The first version of the platform had gaps and technical debt. But it was live in weeks, not months — and real usage surfaced the actual pain points far more accurately than requirements documents would have.
  • 04 Identity resolution is harder than it looks. The anonymous-to-user ID mapping work was the most complex part of the event pipeline. The edge cases — users switching devices, multiple accounts, session timeouts — required careful business rule definition before any SQL was written.