AWS Data Pipeline

Project Overview

This project is an end-to-end serverless data pipeline built on AWS that ingests public procurement data from the Texas Open Data portal, stores versioned raw data in S3, and transforms it into an analytics-ready format using Athena and Parquet.

The pipeline runs automatically on a schedule, processes new data incrementally, and is fully managed using Terraform. It demonstrates how to design and deploy a modern cloud-based data engineering workflow from ingestion to transformation.

AWS data pipeline from EventBridge to curated S3 data

Problem

Public procurement data is available through open data portals, but it is not structured for easy analysis or monitoring over time. Raw data is often updated frequently, lacks versioning, and requires manual effort to track changes or analyze trends.

The goal of this project was to build a scalable system that automatically collects, stores, and organizes this data in a way that supports querying, validation, and future analytics.

Approach

I built a serverless data pipeline using AWS services and Python to automate the full data flow:

SNS alert notification for the ingest pipeline
SNS alerting for the ingest stage to surface pipeline health and failures quickly.
SNS alert notification for the curate pipeline
SNS alerting for the curate stage to track transformation runs and downstream issues.
S3 bucket structure showing raw, curated, and Athena results folders
S3 bucket structure used to separate raw, curated, and query output layers.
AWS Lambda dashboard for the ingest function
Ingest Lambda for scheduled collection and raw data writes.
AWS Lambda dashboard for the curate function
Curate Lambda for transformation and Parquet dataset generation.

All infrastructure was defined and deployed using Terraform, making the system reproducible and scalable.

Results

The final system runs automatically each day and produces a fully queryable dataset with both raw and curated layers. Each run generates a new snapshot of procurement data, which can be inspected or analyzed using SQL.

Athena query results comparing raw and curated procurement data
Query results showing the raw and curated layers exposed for inspection and validation.
Aggregated query results from the curated procurement dataset
Aggregate query output highlighting the analytics-ready dataset produced by the pipeline.

This project demonstrates key data engineering concepts including data lake design, partitioned storage, schema-on-read querying, serverless orchestration, and infrastructure as code.

Through this project, I gained hands-on experience with AWS services such as Lambda, S3, Athena, Glue, EventBridge, and CloudWatch, as well as real-world challenges like API pagination, data type inconsistencies, and incremental pipeline design.

Links

View Project on GitHub