Building a Robust Data Pipeline | Biweekly Engineering - Episode 34

How Agoda builds its high performing data pipelines

Welcome to the 34th episode of Biweekly Engineering!

The episode features an article from Agoda on designing and maintaining a high-performing data pipelines.

First, I will provide some context on data pipelines—what they are about. After that, we will have a brief overview of the lessons learnt from Agoda.

Ready, set, go!

Amsterdam Central Station on a lovely evening

Data Pipelines: A Brief Overview

What is a Data Pipeline?

A data pipeline is a series of processes that move data from one system to another for storage, transformation, or analysis.

It typically involves data extraction from a source, transformation (e.g., cleaning, aggregating), and loading into a destination like a database or data warehouse.

This idea is popularly known as ETL: Extract → Transform → Load.

A Real Life Example

Let’s give an example from a real use-case. I will explain it in detail. This is an important context that readers need, specially if you are not so familiar with data pipelines.

Scenario

A retail company collects customer interaction data from its e-commerce website. They want to use this data for generating insights into customer behavior, such as purchase trends and product preferences.

The above scenario is pretty much a valid use-case in any e-commerce platform. How would we build a data pipeline for this?

Building the Data Pipeline

We can think of a 4-step approach to build a data pipeline for our scenario.

Diagram taken from Databricks

Data Source (Extract Phase)

The data is collected from multiple sources, such as:

  • Website logs: Clickstream data, tracking user interactions on the site.

  • Sales database: Information about completed purchases, customer details, and payment methods.

  • Third-party services: Social media, ad clicks, or marketing campaign performance.

This data might be in different formats (e.g., JSON from the website, CSV files from databases).

Data Transformation (Transform Phase)

  • Data cleaning: Remove duplicates, fill in missing values, and correct inconsistencies (e.g., standardizing date formats).

  • Data integration: Combine the data from various sources into a unified format, merging customer interaction logs with purchase data.

  • Aggregation: Summarize the data to make it useful for analysis, such as calculating total sales per product, or grouping customers by region.

  • Enrichment: Additional data, such as demographic info from external sources, could be added to enrich the dataset.

Data Loading (Load Phase)

The transformed data is then loaded into a destination like:

  • Data warehouse: A centralized repository (e.g., Amazon Redshift or Google BigQuery) where the company stores cleaned and processed data.

  • Analytics tool: The data is sent to tools like Power BI or Tableau for visualization and analysis, enabling the marketing team to understand customer behavior or identify trends.

Automation and Scheduling

Lastly, this entire pipeline is automated to run regularly (e.g., daily), ensuring up-to-date data is always available for reporting and analytics.

Note how the first 3 stages are basically an ETL process. ETL is the core of any data pipeline, and data engineers in real life spend a lot of time for developing ETL jobs.

To read more, check this piece from Databricks.

Designing & Maintaining High Performing Data Pipelines with Agoda

Let us now discuss today’s article on Agoda’s approach to build and run high performing data pipelines at scale.

Considerations at Agoda for Building a Data Pipeline

The post mentions 5 core considerations that are prerequisites for any data pipeline built at Agoda. We will discuss briefly.

Data

Where is the data stored? What is the data behavior?

The data source and behavior influence how data is processed and stored. In one Agoda data pipeline, they split the process into three sub-processes based on data characteristics.

The pipeline loads data into the Fact Booking table, which stores booking-level information. The design follows the Star Schema method, handling static, dynamic, and dimensional data differently.

The focus at this stage is to understand the nature of the data for designing an efficient pipeline.

Resource Used

How much resources should be allocated for our data pipeline?

The resources used in a data pipeline refer to the amount allocated for processing. Allocating too few resources risks pipeline delays or failure, while too many waste resources and negatively impact other pipelines.

At Agoda, most pipelines run on Apache Spark, where resource configurations can be fine-tuned for optimal efficiency. Resources are also allocated based on data volume, with high-volume processes run less frequently to optimize resource usage.

Bottomline, tune your resource for they are costly!

Partitioning

How should we partition our table (in Hadoop)?

In Hadoop, partitioning involves dividing data into smaller subsets called partitions, often based on specific table columns. Proper partitioning improves writing performance and reduces data pipeline run time.

For the Fact Booking table, Agoda uses the booking date's month (datamonth) as the partitioning column to ensure balanced data distribution.

Note that partitioning is a must have for efficient data processing and querying.

Job Scheduling

How frequently should our data pipeline run?

Job scheduling automates data pipelines by determining when and how often they should run. The frequency depends on factors like data usage, SLA requirements, and how long the pipeline takes to complete.

Additionally, the freshness of the data source affects how frequently the job should be scheduled. This can get tricky. It’s always recommended to align with product requirements.

Data Dependency

Does the pipeline depend on the data of other tables?

Data pipelines can have dependencies on upstream pipelines, meaning one pipeline's data relies on another pipeline's output. So before building a pipeline, it’s equally important as the previous stages to figure out all the dependencies.

How Agoda Maintains a Data Pipeline

As you can expect, maintaining a data pipeline takes a whole set of efforts. Let’s have a look.

Monitoring

Once a data pipeline is in place, monitoring it is essential to ensure smooth operation and quick resolution of any issues.

At Agoda, the BI team mainly tracks the duration and status of submitted pipelines. Spark logs provide details like start and end times, success or failure, and other critical information. The Hadoop Data team manages the platform and log data, while the BI team consolidates this log data into destination tables.

Finally, dashboards are created to provide real-time monitoring of the pipeline's performance.

Data Quality

This is probably the hardest part of maintaining a high performing data pipeline. What’s the point of such a complex system if the data quality is poor?

Agoda approaches towards data quality using a variety of measures:

  • Data Freshness: Data freshness refers to how timely the data is, ensuring it's available when needed.

    At Agoda, an in-house tool tracks freshness by monitoring key columns, like booking_datetime in the Fact Booking table, and alerts the team if the data is older than the 6-hour SLA.

  • Data Integrity and Completeness: Data integrity ensures the uniqueness of records, while data completeness ensures no NULL or empty values are present. For example, a column like id should uniquely identify each record, with no duplicates, and critical fields like order_id should not be NULL.

    At Agoda, data is validated before being written into target tables, and any data failing integrity or completeness checks is not processed further.

  • Data Accuracy and Consistency: Data accuracy ensures the current data aligns with previous trends. At Agoda, the Business Intelligence team uses a tool called ThirdEye to detect anomalies by comparing current values to predicted trends. When deviations exceed a set threshold, ThirdEye alerts the team.

    For data consistency, the team uses Quilliup to ensure the data matches between the source and destination tables. The Quilliup test is triggered after the data pipeline runs to verify consistency.

  • Data Monitoring: Data monitoring involves collecting, analyzing, and responding to data quality test results that are derived from the previous stages.

    Agoda uses an in-house alert system called Hedwig, which notifies the team via email and Slack when data quality tests fail. Hedwig also automates JIRA ticket creation, helping track the issue's details, root cause, and resolution.

So we can clearly see, Agoda has a bunch of useful tools to ensure data quality. From my experience with data pipelines, I can tell that the overall process they have built is quite mature. :)

So that wraps up today’s episode! I hope you have enjoyed learning about data pipelines and how they are built in the industry. See you in the next episode!

So long and thanks for reading! 🍃 

Reply

or to participate.