How to Streamline ETL Processes in Travel and Hospitality with Luigi?

Updated on August 28, 2023

Read — 5 minutes

ETL in Data Warehousing: A 2026 Guide

The ETL process in data warehousing extracts raw data from source systems, transforms it to match a target schema and business rules, then loads it into a central warehouse for analytics and reporting. It is the pipeline layer that turns incompatible, fragmented data from databases, SaaS tools, APIs, and IoT devices into something analysts and AI models can reliably use.

Every data team eventually hits the same wall. Sales data lives in Salesforce. Finance pulls from a different system entirely. The logistics team has its own database, built on a schema nobody outside the department fully understands. And someone, usually a data engineer who has been around long enough to know what is coming, has to figure out how to make all of it usable in one place.

That is not a new problem. It has been the core challenge of enterprise data management for decades. What has changed is the scale. Global data creation is on track to exceed 220 zettabytes in 2026. More sources, more formats, faster accumulation. The teams responsible for managing that data need infrastructure that does not buckle under it.

ETL, which stands for extract, transform, load, is still the foundation most of that infrastructure runs on. Not always in its traditional form, and often sitting alongside newer approaches like ELT. But the underlying problem it solves has not changed.

The ETL Pipeline: Five Stages, Not Three

The textbook version of ETL has three steps. Extract, transform, load: clean, simple, easy to explain in a diagram. The production version has five, and the two that get left out of the diagram are frequently the two where things go wrong.

Extract

Extraction pulls raw data from source systems. In practice that means relational databases, cloud applications, flat files, REST APIs, streaming platforms, IoT sensors, and whatever combination of systems a given organisation has built up, often over many years, often without a unified data strategy behind any of it.

Two approaches exist. Full extraction pulls everything, every run. It is straightforward to implement and tends to become unmanageable as datasets grow. Incremental extraction pulls only what has changed since the last run, and what that looks like in practice depends on the source. For relational databases, it is usually log-based CDC, which reads transaction logs (WAL in Postgres, binlog in MySQL) to capture inserts, updates, and deletes as they happen. For SaaS sources like Salesforce or HubSpot, incremental sync typically relies on API queries filtered by a high-watermark column such as updated_at, which is closer to timestamp polling than CDC in the strict sense. Trigger-based CDC is a third variant, where database triggers write changes to shadow tables for downstream consumption. For any pipeline running against large or frequently updated sources, incremental in one form or another is the practical default.

Validate and Stage

Extracted data does not go straight into transformation. It lands in a staging area first, a temporary workspace where validation rules run before anything downstream is affected.

What does validation actually catch? Nulls in required fields. Type mismatches. Duplicate records that made it through. Values that fall outside expected ranges. Referential integrity problems where records reference IDs that do not exist in the parent table. None of these are exotic failure modes. They are ordinary data quality issues that appear constantly across real-world source systems.

The reason the staging layer matters is simple: errors caught here are contained. The same errors, discovered after three months of reporting has been built on top of them, are a different problem entirely.

A few production realities sit on top of the basic checks. Schema drift, where a source system quietly adds or renames a column without warning anyone downstream, will break any pipeline that assumes static structure, and handling it explicitly beats finding out the hard way at 3am. Loads need to be idempotent, meaning that a rerun produces the same result instead of duplicating rows, and that is not a property you get for free. Backfill, the job of filling in historical data after a new source comes online or transformation logic changes, is a genuinely different workload from regular incremental sync and worth planning for as its own thing. Data contracts, explicit agreements between producers and consumers about what a field means and how it is allowed to change, have moved from aspirational to something most serious teams actually enforce.

Transform

This is the stage most people mean when they talk about ETL work. Raw data gets restructured, cleaned, standardised, and reshaped to match the warehouse schema and whatever business rules apply.

The scope of transformation varies enormously by context. A basic pipeline might just remap field names and convert date formats. A complex one handles currency normalisation across multiple sources, deduplication logic that requires fuzzy matching, aggregation across incompatible time granularities, and compliance requirements such as masking PII before it reaches storage, generating the audit trail that a regulator will eventually request, and enforcing data minimisation rules under GDPR.

That last category tends to be underestimated at the design stage. Compliance logic built into transformation is far easier to audit and maintain than compliance logic scattered across downstream systems.

Load

Transformed data moves into the target system. Cloud data warehouses such as Snowflake, BigQuery, and Amazon Redshift are the most common destination now, though data lakes and lakehouse architectures are also widely used depending on the use case.

Four loading strategies are common in practice. Full load replaces warehouse contents entirely each run, which is fine for small reference tables and wasteful for anything larger. Incremental append adds only new records and works well for immutable event data where history never changes. Incremental merge, usually called upsert, inserts new records and updates existing ones matched by primary key, typically via a SQL MERGE statement. Partition overwrite replaces specific partitions wholesale, which is efficient when data is naturally partitioned by date or region. The right choice depends on data volume, how frequently the source changes, query patterns, and storage cost tolerance. There is no universally correct answer; it depends on the specific architecture.

The Load stage also has to handle something textbook descriptions tend to gloss over: dimensions that change over time. Slowly Changing Dimensions, or SCDs, are the standard vocabulary for it. Type 1 simply overwrites the old value, which is easy and loses history entirely. Type 2 keeps the full history by adding a new row whenever a value changes, with effective-from and effective-to columns marking the validity window. Anyone doing analytics where historical accuracy matters, say attributing a sale to the account owner at the time of the deal rather than whoever owns the account today, needs SCD Type 2 for at least some tables. dbt has had native support for it through snapshots since 2016, which is one of the reasons it spread so quickly as the standard transformation layer.

Monitor and Audit

A pipeline that runs without observability is, functionally, a pipeline that fails without warning.

Row count reconciliation between source and target. Latency tracking. Alerting when jobs fail or run longer than expected. Data lineage documentation that lets teams trace any record back to its origin. Scheduled quality checks that verify data matches expectations, not just that the pipeline completed.

Row count reconciliation is the simplest check, and it only really catches gross failures. More reliable approaches include hash or checksum comparison across partitions, and sample-based row diff. Fivetran, for instance, continuously samples around 10,000 random rows to verify source and target consistency without having to compare the full table. Freshness SLAs deserve the same weight as failure alerts, too. Data that is present but stale often looks fine to downstream dashboards, which is the worst kind of silent drift in reporting, because no one notices until a decision has already been made on the wrong numbers.

Tools like Apache Airflow and Dagster handle scheduling and orchestration. Frameworks like Great Expectations apply continuous quality rules. This layer is unglamorous. It is also what separates a warehouse teams trust from one they quietly work around.

ETL vs ELT: Where They Actually Differ

The distinction is about sequencing, not technology.

ETL transforms data before it enters the warehouse. A processing layer between source and destination handles business logic first, then sends clean data downstream. This model made sense when warehouse compute was expensive and storage was constrained. Loading raw, unprocessed data was not a viable option.

ELT reverses it. Data gets extracted and loaded into the warehouse in raw form, then transformed in place using the warehouse's own compute. The approach became practical when cloud warehouses such as Snowflake, BigQuery, and Redshift arrived with elastic scaling. Loading raw data became cheap and fast. SQL transformations running inside the warehouse became more maintainable than transformation logic sitting in a separate layer.

dbt (data build tool) formalised this shift. By treating SQL models as version-controlled, testable code running inside the warehouse, dbt gave data teams a software engineering discipline for what had previously been a collection of scripts and stored procedures. By 2026, ELT with dbt as the transformation layer is the default for most cloud-native data teams.

ETL still makes sense in specific situations: when data must be masked before it reaches any storage system at all, when the target is an on-premises warehouse that cannot accept raw input, when transformation logic requires specialised compute that SQL cannot handle, or when data residency rules prevent raw data from entering a cloud environment. Most mature organisations run both. ELT for the bulk of pipelines, ETL for the flows where pre-load processing is a hard requirement.

One newer pattern worth mentioning alongside these two is zero-ETL. The term refers to cloud-native integrations where data flows from a source system directly into the warehouse without a separate pipeline layer in between. AWS Aurora replicating to Redshift is one example, Snowflake's direct connection to Salesforce Data Cloud is another. It does not replace ETL or ELT for most cases, but for specific source-to-warehouse flows inside a single vendor ecosystem, it cuts out a meaningful amount of pipeline code.

Real-Time vs Batch: Picking the Right Model

Not every pipeline needs to move data in real time. Applying streaming infrastructure to problems that batch processing handles well is one of the more expensive habits in data engineering.

Batch pipelines move data at scheduled intervals: hourly, nightly, weekly, depending on the use case. End-of-month financial reporting. Weekly marketing attribution. Catalogue sync that runs at 2am when source system load is low. Batch is simpler to build, easier to debug when something breaks, and considerably cheaper to operate than streaming alternatives.

Streaming ETL processes data continuously. Apache Kafka or AWS Kinesis carry the events; Flink or Spark Streaming handle transformation on the fly. Results appear in the warehouse within seconds of the originating event. For fraud detection, where the decision window might be a single transaction, that latency difference matters enormously. Same for live inventory during a high-traffic trading period, or patient monitoring systems where delays have clinical implications.

The distinction between true streaming and microbatch is worth understanding before anyone picks a tool. Flink handles events one at a time with very low latency and is the closest thing to continuous processing. Spark Structured Streaming, despite the name, actually processes data in small batches (typically every few seconds), which is usually enough for analytics use cases and tends to give better throughput for the compute cost. Kafka itself has two layers that get confused a lot: Kafka Connect moves data in and out of Kafka topics through source and sink connectors and is effectively an ingestion layer, while Kafka Streams is a client library for building stream-processing applications directly on top of Kafka without needing a separate cluster.

Two concepts cause more pain in streaming pipelines than anything else. The first is delivery semantics. At-least-once is the default in most systems and it means a given event may end up being processed more than once during a failover, which forces every downstream consumer to be idempotent. Exactly-once is achievable but comes with real cost in throughput and complexity, so it is worth being deliberate about where you actually need it. The second is time. Event time and processing time are not the same thing, events arrive late, out of order, or both, and streaming systems handle this through watermarks (how long the system is willing to wait for stragglers) and windowing (how events are grouped into finite chunks for aggregation). Misconfigured watermarks tend to show up as numbers that quietly drift from reality, and by the time anyone notices, the dashboards have been wrong for a while.

The mistake is not choosing streaming over batch or vice versa. It is choosing streaming because it sounds more sophisticated, then spending engineering time maintaining infrastructure that a nightly batch job would have handled perfectly well.

Most organisations that operate at any meaningful scale end up running both. Streaming for the handful of data flows where latency genuinely affects outcomes. Batch for everything else.

ETL in Cloud Environments

Cloud warehousing changed what ETL looks like in practice and who is responsible for building it.

On-premises ETL meant dedicated servers, licenced software, infrastructure teams, and significant upfront capital expenditure. Cloud-native ETL, usually ELT in modern stacks, replaced most of that with managed services billed on consumption.

The current standard stack: a managed ingestion tool (Fivetran or Airbyte) extracts from sources and loads into the warehouse; dbt runs transformations as SQL models inside the warehouse; Airflow or Dagster orchestrates scheduling and monitoring; Snowflake, BigQuery, or Redshift provides compute and storage. Each layer can be swapped independently. That modularity is a large part of why this architecture has become dominant. It avoids the monolithic vendor lock-in that plagued earlier generations of data infrastructure.

One thing cloud ETL design consistently underweights: data residency. Organisations operating under GDPR, or handling cross-border flows between EU and non-EU jurisdictions, need explicit planning around which data moves where and in what state. Cloud convenience and regulatory compliance do not always align. Discovering the conflict after the pipeline is in production is substantially more expensive than designing for it upfront.

One architectural shift worth noting before moving on to tools: open table formats such as Apache Iceberg and Delta Lake are increasingly part of the modern warehouse stack. They bring database-style transactional guarantees (ACID, schema evolution, time travel) to data stored as files in cloud object storage. Iceberg in particular is now supported across major platforms including Snowflake, Databricks, AWS via S3 Tables, and Google Cloud via BigLake, which makes lakehouse architectures far more practical than they used to be.

ETL Tools in 2026: What Teams Are Actually Choosing

The ETL tools comparison landscape has consolidated. Different tools dominate different parts of the pipeline, and most teams assemble a stack rather than looking for a single platform that does everything.

Fivetran built its reputation on reliability and zero-maintenance operation. Over 700 pre-built connectors. Automated schema management. Change data capture that handles database-level incremental sync without engineering involvement. Teams choose Fivetran when they need predictable, production-grade ingestion and do not want to staff a team to maintain it. Pricing is based on Monthly Active Rows, manageable at moderate volumes and worth modelling carefully before committing at scale. As of January 2026, Fivetran moved to per-connector MAR billing, which eliminates the bulk discount that used to apply across connectors; deletes now count toward paid MAR as well, and there is a $5 minimum charge per standard connection. For multi-connector setups, those changes shift the cost model meaningfully, so any projection done before that needs to be revisited. In October 2025, Fivetran and dbt Labs announced a definitive agreement to merge, subject to regulatory approval. Once closed, the deal will tie ingestion and transformation into a single commercial offering; for teams already running both, that is a meaningful consolidation, though until closing the two companies continue to operate independently.

Airbyte started as open source in 2020 and has grown to over 600 connectors, with community contributions covering sources that Fivetran does not offer. Self-hosted deployment is free; Airbyte Cloud provides a managed alternative. The flexibility is real. So is the operational burden. Self-hosted Airbyte requires engineering capacity to run reliably, and that cost rarely appears in the initial calculation. Teams choosing Airbyte for cost reasons should model the full picture, not just the licence cost.

dbt handles transformation only. It does not extract or load anything. It runs SQL models inside the warehouse and treats them as version-controlled, testable code. dbt Core is the open-source CLI; dbt Cloud adds a managed environment with scheduling and collaboration features. In 2025, dbt Labs launched dbt Fusion, a Rust-based rewrite of the engine currently in public beta, focused on faster performance and native SQL understanding. In ELT workflows, dbt is now effectively the standard for the transformation layer. Its adoption has moved from differentiator to baseline expectation in most data engineering job descriptions.

Custom pipelines built on Kafka, Spark, Airflow, or purpose-built frameworks remain the right answer when off-the-shelf tools genuinely cannot meet the requirement. Proprietary source systems with no available connector. Streaming use cases with transformation logic that SQL cannot express. Data sovereignty requirements that rule out SaaS platforms entirely. Custom development costs more and requires sustained engineering ownership. For the cases where the requirement falls outside what managed tooling covers, it is not a luxury. It is the only option that actually works.

Where ETL Matters Most in Practice

The business case for ETL is not abstract. It shows up in specific places where fragmented or delayed data is already costing something: a decision made on stale information, a forecast built on incomplete inputs, a compliance audit that surfaces data quality problems nobody knew existed.

Ecommerce operations consolidating data from storefronts, logistics platforms, and customer service tools into a single warehouse for demand forecasting and inventory management. Financial services teams reconciling transactions across core banking systems and third-party data providers, where transformation logic handles regulatory requirements, multiple currencies, and audit trail generation simultaneously. Healthcare organisations integrating patient records across clinical systems, lab feeds, and insurance claims, with pre-load masking and HL7/FHIR compliance built into the transformation layer.

Machine learning teams depend on ETL in a way that is easy to underestimate. The pipeline quality is inseparable from the model quality. Inconsistent source data, inadequately transformed, produces models that behave inconsistently, and debugging a model trained on a poorly managed dataset is significantly harder than fixing the pipeline before training begins.

When Managed Tools Are Not Enough

Fivetran and Airbyte solve the standard problem well. They were designed for it.

The standard problem: sources have connectors, transformation logic fits within dbt's SQL framework, compliance requirements do not restrict cloud data movement, and pipeline volumes are within the pricing tier that makes commercial sense. A lot of organisations fit that profile.

Some do not.

A logistics operation tracking tens of thousands of vehicles in real time cannot run on nightly batch ingestion. A financial institution under strict data residency requirements cannot route unmasked records through a US-hosted SaaS platform. A healthcare provider integrating a proprietary clinical system has no connector to wait for. It does not exist.

When the actual requirement falls outside what managed tooling covers, the question shifts from "which tool" to "what does custom development actually involve." That means pipeline architecture designed for the specific sources and data volumes, transformation logic that handles the compliance and governance requirements precisely, and engineering ownership of infrastructure that off-the-shelf tools would have maintained automatically.

It is more expensive upfront. For the cases where it is the right answer, it is the only answer that works at production scale.

Conclusion

ETL in data warehousing has been the foundational data movement layer for decades. The tooling has changed, with cloud warehouses, managed connectors, and ELT-native transformation frameworks, but the underlying problem is the same: disparate source systems, incompatible formats, and a business that needs clean, reliable data to make decisions.

Getting the architecture right means picking the correct processing model for each data flow, choosing tools that match the actual compliance and operational requirements, and building monitoring infrastructure that surfaces problems before they reach the analysts depending on the warehouse.

For most organisations, a combination of managed tooling and standard cloud infrastructure handles the majority of this well. For those with architecturally complex requirements such as proprietary sources, strict data residency, high-throughput streaming, or custom transformation logic, the managed options do not fully cover the ground, and purpose-built development is the more durable path.

Frequently Asked Questions

What is the ETL process in data warehousing?

The ETL process in data warehousing extracts raw data from source systems, transforms it to match the target schema and business rules, then loads it into a central data warehouse for analytics and reporting. Production pipelines typically have five stages rather than three: extraction, staging and validation, transformation, load, and ongoing monitoring. The staging layer, often omitted from introductory descriptions, is where data quality problems are caught before they affect downstream reporting.

What is the difference between ETL and ELT?

ETL transforms data before loading it. A dedicated processing layer handles business logic first, then sends clean data to the warehouse. ELT loads raw data into the warehouse first, then transforms it using the warehouse's own compute. ELT has become the default for cloud data warehouses like Snowflake, BigQuery, and Redshift, where elastic compute makes in-warehouse transformation practical and cost-effective. ETL remains appropriate where data must be masked or governed before reaching storage, with healthcare, financial services, and cross-border data flows under GDPR being the most common cases.

What are the main steps in an ETL pipeline?

Five steps, not three: extraction from source systems; staging and validation in a temporary workspace where data quality rules run before transformation; transformation to restructure and clean data for the target schema; load into the data warehouse; and monitoring, which covers row reconciliation, latency tracking, alerting, and lineage documentation. The monitoring stage is frequently under-built in early implementations and is usually the first thing that causes problems when pipelines scale.

Which ETL tools are most widely used in 2026?

Most teams build a stack. Fivetran or Airbyte handles ingestion; dbt runs SQL transformations inside the warehouse; Apache Airflow or Dagster manages orchestration and scheduling; Snowflake, BigQuery, or Redshift provides the warehouse layer. Fivetran suits teams that want zero-maintenance managed connectors; Airbyte suits teams that need flexibility, open-source control, or coverage for niche sources. dbt is effectively standard for transformation in ELT workflows. Custom pipelines on Kafka or Spark cover use cases that fall outside what managed tooling reaches.

Is real-time ETL always better than batch processing?

No, and applying streaming infrastructure to problems that batch processing handles well is a common and expensive mistake. Real-time ETL is justified when latency directly affects business outcomes: fraud detection, live inventory management, patient monitoring. For most reporting and analytics workloads, batch processing is simpler, cheaper, and more resilient. Most production environments eventually run both, with streaming for the small subset of flows where latency matters and batch for everything else.

When does building a custom ETL pipeline make more sense than a managed tool?

When the managed tools cannot actually meet the requirement. That means proprietary source systems with no available connector, compliance obligations preventing raw data from entering a cloud SaaS environment, streaming use cases requiring transformation logic that SQL cannot express, or data volumes where per-row pricing becomes structurally uneconomical. Custom development costs more upfront and requires engineering ownership ongoing, but for requirements that genuinely fall outside standard tooling, it is the only approach that holds up in production.