Data Pipeline for Processing CSV Files Using S3, Lambda, Glue, and QuickSight

Data pipeline for processing csv files

Project Overview:

To design and implement a serverless data pipeline that ingests CSV files, transforms them into a clean format, and makes them available for visualization in Amazon QuickSight (Quick Suite)—all with minimal manual intervention.CSV files uploaded to Amazon S3 csv-input-bucket5 trigger a Lambda function, which launches a Glue job to clean and transform the data before saving it back to Amazon S3 csv-output9-bucket5. A manifest file then connects the processed data to Amazon QuickSight, where it can be visualized in a piechart. The result is a fully automated, scalable workflow that demonstrates cloud engineering, automation, and data visualization skills.

Architecture workflow:

1. Amazon S3(csv-input-bucket5)

Data pipeline for processing csv files

- Stores raw CSV files uploaded by the user
- Acts as the entry point for the pipeline
- Amazon S3 is a reginal serverless storage


2. Creating IAM Roles


Lambda Role

Lambda Role:
- This role allows my lambda function to call Start Job Run on AWS Glue
- Without this role, Lambda can't trigger the ETL process when a new CSV is uploaded
- Lastly IAM roles are Global


Glue Role

Glue Role:
- Allows Glue to raed input data from Amazon S3 input bucket
- Grants permission to write output data to Aamazon S3 output bucket
-Without it Glue can't read ,transform or write data


3. AWS Lambda


AWS Lambda

AWS Lambda

- Triggered automatically when a new CSV file is uploade insied the Amazon S3 input bucket
- Starts the Glue job Without a manual execation


4. AWs Glue


AWS Glue
AWS Glue
Amazon S3 bucket

- Reads the raw CSV from S3.
- Cleans and transforms the data
- Writes the processed data back into an output S3 buckt in a CSV Format


5. Amazon S3 output bucket


Amazon S3 output bucket

- Stores the transformed/cleaned data
- Organized into folders for an easy access and scalability


6. Manifest file

Manifest file

- A json file that tells QuickSight where to find the processed data in Amazon S3


7. Amazon QuickSight (Quick Suite)


Amazon QuickSight (Quick Suite)
Amazon QuickSight (Quick Suite)
Amazon QuickSight (Quick Suite)
Amazon QuickSight (Quick Suite)
Amazon QuickSight (Quick Suite)
Amazon QuickSight (Quick Suite)

- Connects to the output Amazon S3 bucket via the manifest file
- loads the cleaned dataset
- Provides interactive dashboards,charts and reports


Key Features

- Automation: Every new CSV upload triggers the pipeline automatically.
- Scalability: Can handle multiple files and scale with AWS services.
- Flexibility: Output can be JSON, CSV, or Parquet depending on needs.
- Visualization: QuickSight dashboards make insights accessible and shareable.
- Security: IAM roles and S3 access points ensure least-privilege access.