The Data Quality Dilemma in Adobe’s Data Collection

Navigating the multi-stage transformation pipeline: A strategic guide to managing data transformations across Adobe’s ecosystem

When working with Adobe’s data collection ecosystem, you quickly realize you’re not dealing with a simple ETL (Extract, Transform, Load) process. Instead, you’re managing what I call an E-T-T-T-L-T-L-T process – Extract, Transform, Transform, Transform, Load, Transform, Load, Transform. With transformation capabilities available at nearly every stage from Adobe Launch to Customer Journey Analytics, the question isn’t can you clean your data at a particular stage, but rather should you?

Let me share a strategic framework I’ve developed for managing data quality across Adobe’s collection pipeline, based on practical experience and lessons learned from scattered transformations.

First, Let’s Talk About “E-T-T-T-L-T-L-T”

Before we dive in, we need to address what I mean by this acronym and why traditional ETL thinking doesn’t apply here.

Breaking down the E-T-T-T-L-T-L-T:

  • E (Extract): Data is extracted from user interactions – the browser captures user behavior, page context, and interaction events
  • T (Transform 1): Adobe Launch applies business rules and validation
  • T (Transform 2): Dynamic Datastream filters and gates
  • T (Transform 3): Datastream server-side maps and normalizes
  • L (Load 1): Data is loaded into AEP Raw Dataset
  • T (Transform 4): Data Distiller transforms raw to curated
  • L (Load 2): Curated data is loaded/connected to CJA
  • T (Transform 5): CJA Dataviews apply semantic transformations

Why traditional ETL thinking fails here

Adobe Launch isn’t “extracting” in the traditional data warehouse sense – it’s capturing and generating event data based on user interactions. The extraction is passive (reading DOM, cookies, dataLayer), but the primary function is event production and business rule application.

“Loading” happens multiple times:

  1. Ingestion into AEP Raw Dataset – First persistent storage
  2. Materialization of Curated Dataset – Data Distiller creates new datasets
  3. Connection to CJA – Curated dataset made available to CJA (more of a “connect” than a “load”)

A more practical mental model:

While E-T-T-T-L-T-L-T accurately describes the technical flow, I find it more useful to think in terms of stages and their purposes:

  • Stage 1 (Launch): Collection & Business Rules
  • Stage 2 (Dynamic Datastream): Ingestion Filtering
  • Stage 3 (Datastream): Pre-Storage Normalization
  • Stage 4 (AEP): Dataset Curation
  • Stage 5 (CJA): Presentation Layer

Why this matters: The key insight isn’t perfect terminology – it’s recognizing that you have transformation capabilities at five distinct stages, and you need a strategy for which transformations belong where. That’s what this guide provides.

For the rest of this post, I’ll use the clearer stage-based terminology to avoid confusion, but remember: behind the scenes, it’s really an E-T-T-T-L-T-L-T process.

The multi-stage challenge

Adobe’s data collection architecture offers transformation capabilities at five distinct stages:

  1. XDM Schema (Foundation)
  2. Adobe Launch (Client-Side) – Collection & Business Rules
  3. Dynamic Datastream Configuration – Ingestion Filtering
  4. Adobe Datastream (Server-Side) – Pre-Storage Normalization
  5. Adobe Experience Platform (AEP) – Dataset Curation
  6. Customer Journey Analytics (CJA) – Presentation Layer

The problem? When transformations are scattered across all these stages, troubleshooting becomes a nightmare. “Where did this field get modified?” becomes an archaeological expedition through five different systems.

The guiding principles

Before diving into specific strategies, let’s establish five foundational principles that should guide all data quality decisions:

1. Raw = Immutable

Your raw event data should be stored exactly as it arrives. This provides auditability, enables backfilling, and serves as your debugging foundation. No destructive changes at the raw layer, ever.

2. Single Source of Truth

All production reports and downstream processes should ideally read from one curated dataset layer (typically in AEP). This is your “golden” dataset that everyone trusts.

3. Transform Once, Present Many

Correct and standardize data at exactly one point in your pipeline (preferably the curated dataset layer). Reporting tools should only apply presentational or contextual adjustments.

4. Clear Ownership

Every layer needs a clear owner – whether that’s client-side developers, data engineering, or analytics teams. Every data issue should have an obvious responsible party.

5. Observability & Data Quality KPIs

Implement automated checks for completeness, schema conformance, duplicates, and latency. Set up alerts so issues surface immediately.

The strategic framework: What should happen where?

Let me break down each stage with clear guidance on its purpose and appropriate transformations:

The Foundation: XDM Schema as your data quality contract

Before you implement any transformations, you need a solid foundation. Your XDM (Experience Data Model) schema isn’t just documentation – it’s your first and most important line of defense for data quality.

Why XDM Schema Matters

Think of your XDM schema as a contract between all stages of your pipeline. It defines what data you expect to receive, what format it should be in, what constraints it must satisfy, and what’s required vs. optional.

This is your starting point. Everything else in your E-T-T-T-L-T-L-T pipeline either enforces this contract or builds upon it.

Stage 1: Adobe Launch (Collection & Business Rules)

Primary Purpose: Capture user interactions and apply business logic for event firing

What to do here:

  • Business rules and conditional logic – This is where you define what events fire when and under what conditions (e.g., “fire purchase event only when order confirmation page loads AND transaction ID exists”)
  • Schema validation (ensure required fields are present)
  • Minimal field sanitization (trim whitespace, basic normalization like lowercasing)
  • Timestamp generation
  • Consent/opt-out verification

What to avoid:

  • Heavy enrichment or lookups
  • Complex joins or concatenation
  • PII persistence
  • Resource-intensive operations (remember: this runs on your user’s device!)

Owner: Frontend/Tagging Team

Why this is your business rules home: Launch is where your business logic lives – the conditional firing of events based on user behavior and context. This is appropriate here because you’re capturing the business intent at the moment it happens. However, keep the execution lightweight to protect page performance.

Stage 2: Dynamic Datastream Configuration (Ingestion Filtering)

Primary Purpose: Gatekeeping at the ingestion level and feature management

What to do here:

  • Filter obvious noise events (e.g. crawler and bots)
  • Exclude deprecated events – When you’ve decided to stop tracking certain events (feature toggles, sunset features, deprecated tracking), filter them out here rather than removing instrumentation
  • Initial privacy masking (drop raw PII fields before forwarding)
  • Consent enforcement as a safety net

Owner: Tagging/Platform Team in collaboration with Privacy/Legal

Why use this stage? This is your bouncer – keeping bad data from ever entering your expensive storage and processing systems. It’s also perfect for managing the lifecycle of tracking: when a feature is toggled off or deprecated, you can exclude it here without touching Launch code, making it easier to re-enable if needed.

Stage 3: Adobe Datastream (Pre-Storage Normalization)

Primary Purpose: Field mapping and fixed value assignment before data lands in storage

What to do here:

  • Field mapping – Map fields to Adobe Analytics or map your client side xdm to your server side xdm (if not the same)
  • Set fixed values – Add constants like environment indicators, data source identifiers, or processing timestamps
  • Basic normalization to match schema expectations (e.g. date formats)
  • Server-side timestamp generation
  • Hashing/anonymization of PII before storage (if collected)
  • Lightweight enrichment (geo lookups, device lookups)

The Power and the Pitfall:

Datastream is extremely powerful – you can do extensive transformations here, including complex conditional logic with if/else functions. However, this is often a forgotten transformation layer, and therein lies the danger.

My recommendation: Keep it simple. Use Datastream for straightforward mappings and basic normalization only. Avoid deep nested if/else logic and complex business rules here. Why?

  • Discoverability: Transformations buried in Datastream configuration are easy to forget and hard to troubleshoot
  • Complexity creep: What starts as simple logic can quickly become unmanageable
  • Better alternatives exist: Complex transformations belong in AEP Data Distiller where they’re more visible, testable, and maintainable

The right balance: Think of Datastream as your translation layer – converting incoming data into your XDM schema structure with basic standardization. Save the heavy lifting for Stage 4.

Owner: Data Engineering / Platform Team

Why here? Server-side processing doesn’t impact user experience, and you’re standardizing data before it hits your lake. This is your translation layer – taking diverse input formats and creating consistency before storage.

Stage 4: Adobe Experience Platform (Dataset Curation)

This stage applies to the framework, if you work with Data Distiller.
Primary Purpose: Transform raw datasets into analysis-ready curated datasets

Here’s the reality of how AEP works in practice:

Raw Dataset: This is your immutable source – the data exactly as it arrives from data collection (after Datastream processing). This is what gets ingested into AEP.

Curated Dataset(s): Using Data Distiller, you transform your raw dataset into curated datasets specifically prepared for Customer Journey Analytics. Why? Because you typically don’t want all raw data flowing into CJA.

What to do here:

  • Data selection and filtering – Choose which events and fields CJA actually needs
  • Identity stitching via Identity Graph
  • Historical corrections and backfills when errors are discovered
  • Complex aggregations or pre-calculations for performance
  • Data type standardization and cleanup

Critical Consideration: Dataset Duplication and Licensing

When you maintain both raw and curated datasets with similar data, you’re effectively storing data twice. Always check your AEP license – you’re typically licensed for total volume/rows, so understand the cost implications:

  • Raw dataset: Full fidelity, all events
  • Curated dataset: Subset optimized for CJA
  • The delta between them is your “waste” from a storage perspective, but your “insurance” from an auditability perspective

Owner: Data Engineering + Analytics (joint ownership), with Governance oversight

Why this is your transformation hub: AEP provides Query Service and Data Distiller to handle the heavy lifting of transforming raw data into analysis-ready datasets. This is where you prepare data specifically for CJA’s consumption patterns, balancing completeness against cost and performance.

Stage 5: Customer Journey Analytics (Presentation Layer)

Primary Purpose: Semantic layer and self-service enablement

What to do here:

  • Derived fields – CJA’s powerful derived field functionality lets you create new dimensions and metrics without touching the lake
  • Calculated metrics for business KPIs
  • User-friendly component naming and descriptions
  • Report-specific filters and segments
  • Attribution model application
  • Component organization for ease of discovery

The Self-Service Perspective:

Here’s the critical mindset shift: CJA Dataviews are your semantic layer. You’re not changing the underlying data; you’re creating a business-friendly interface to technical data that flowed through your entire pipeline.

Think of it this way:

  • The lake contains technical field names like commerce.purchases.value
  • Dataviews translate these into business concepts like “Revenue”

Why this matters for user-friendliness:

When done well, Dataviews enable self-service analytics. Business users shouldn’t need to understand your technical implementation – they should work with familiar business terminology. However, there’s a balance to strike:

Do in Dataviews:

  • Business-friendly naming and descriptions
  • Context-appropriate calculated metrics
  • Common filtering scenarios as pre-built components
  • Attribution models aligned to business questions

Be careful about:

  • Over-transforming data that should have been fixed upstream
  • Creating so many variations that users don’t know which to use
  • Duplicating transformations across multiple Dataviews (use shared dimension and metrics)
  • Using Dataviews as a band-aid for poor data quality upstream

Owner: Analytics/Reporting Team

The hard part: CJA’s flexibility is both its strength and its challenge. You can do many transformations here, but restraint is wisdom. Ask yourself: “Am I making this more self-service friendly, or am I compensating for problems that should be fixed in the curated dataset?” The former is good; the latter creates technical debt.

Practical Example: The Full E-T-T-T-L-T-L-T Workflow

Let’s walk through a realistic example of how data flows through your pipeline:

Extract – User Interaction:
  • User completes purchase on website
  • Browser captures event context (page URL, timestamp, user agent)
Transform 1 – Collection (Launch):
  • Launch evaluates business rules: “Is this checkout confirmation page AND does transaction object exist?”
  • If conditions met, fires purchase event with order details
  • Validates required fields (transaction ID, revenue, product array)
  • Checks consent status (user has analytics consent)
  • Sets client-side timestamp
  • Sends event to Dynamic Datastream
Transform 2 – Filtering (Dynamic Datastream):
  • Checks event type and source
  • Excludes deprecated “legacy_checkout_v1” events (feature sunsetted 3 months ago)
  • Exclude known bots and crawlers
  • Forwards purchase event (passes all filters) to Datastream
Transform 3 – Normalization (Datastream Server-Side):
  • Map field names (depends on your setup, if you have one common continuous xdm schema) e.g. ECID as persistent id.
  • Sets fixed value: dataSource = "web", environment = "production"
  • Normalizes date formats
  • Hashes mail addresses
Load 1 – Ingestion (AEP Raw Dataset):
  • Structured event ingested into AEP
  • Stored in raw dataset exactly as received from Datastream
  • Immutable record preserved for audit and replay
Transform 4 – Curation (AEP Data Distiller):

Raw Dataset → Curated Dataset transformation (e.g. runs hourly):

  • Filters to events needed for CJA: keeps purchase, product_view, add_to_cart; excludes diagnostic events
  • Identity stitching: merges anonymous web visitor with known customer via Identity Graph
  • Deduplication: removes duplicate purchase events (same transactionID within 5-minute window)
  • Business rule application: flags “valid_transaction” (revenue > 0, has shipping address)
  • Historical correction: applies retroactive fix for currency bug discovered last week
Load 2 – Connection (CJA Dataset):
  • “web_events_curated” dataset (smaller than raw, optimized for CJA)
  • Connected to CJA as data source
  • Available for Dataview configuration
Transform 5 – Presentation (CJA Dataviews):

Business users access curated data through friendly interface:

  • Derived field: “Product Line” extracted from SKU using CASE WHEN logic
  • Reference data join: adds product category, brand, and margin data from product catalog (lookup dataset)
  • Calculated metric: “Average Order Value” = Total Revenue / Number of Orders
  • Component renamed: Technical marketing.trackingCode appears as user-friendly “Campaign Source”
  • Attribution applied: Last-touch attribution model for revenue credit
  • Segment created: “High-Value Customers” (lifetime revenue > CHF1000)
  • Business users build reports using familiar terminology, never seeing technical implementation

Quick Wins: Control rules you can implement today

  1. XDM Schema Audit: Review your current schema and add type constraints, patterns, and min/max values where missing
  2. Business Rule Documentation: Document every conditional event firing rule in Launch with clear comments explaining business logic (using API, notes within Adobe Launch or a manually maintained change-log)
  3. Required Fields Check: Alert when completion rate falls below 100% for critical fields in raw dataset
  4. Deprecation Log: Maintain a log of events excluded in Dynamic Datastream with reasons, dates, and re-enablement conditions
  5. Cardinality Monitoring: Alert on sudden spikes in unique page IDs, user IDs, or product SKUs (often indicates tagging errors)
  6. Dataview Naming Convention: Standardize component naming (e.g., all calculated metrics start with “Calc:”, all segments with “Seg:”)
  7. Storage Audit Dashboard: Monthly review of raw vs. curated dataset sizes, row counts, and cost projections
  8. Transformation Count by Stage: Track how many transformations exist at each stage (goal: most should be in AEP curation)

Conclusion: Master Your E-T-T-T-L-T-L-T Pipeline

Adobe’s data collection ecosystem gives you unprecedented flexibility to transform data at multiple stages. The key is not to avoid using these capabilities, but to use them strategically with clear purpose, proper placement, and thorough documentation.

The Golden Rules:

  1. Start with schema – Define your data quality contract first
  2. Keep raw data immutable as possible
  3. Transform structural issues in AEP curation – that’s your powerhouse
  4. Use CJA for semantic translation, not data cleanup
  5. Document every transformation in your registry
  6. Monitor your costs: raw + curated = 2x storage

By following this framework, you’ll create a maintainable, auditable, and cost-effective data pipeline that delivers clean, trustworthy data to your analytics users – without the archaeological expeditions to figure out where transformations are hiding.

Most importantly, you’ll enable true self-service analytics where business users work with familiar concepts, never needing to understand the eight-step technical pipeline that makes it all work. That’s the ultimate goal: data quality that’s invisible to end users because it just works.

In