Data Pipeline for Processing CSV Files Using S3, Lambda, Glue, and QuickSight

Project Overview:

To design and implement a serverless data pipeline that ingests CSV files, transforms them into a clean format, and makes them available for visualization in Amazon QuickSight (Quick Suite)—all with minimal manual intervention.CSV files uploaded to Amazon S3 csv-input-bucket5 trigger a Lambda function, which launches a Glue job to clean and transform the data before saving it back to Amazon S3 csv-output9-bucket5. A manifest file then connects the processed data to Amazon QuickSight, where it can be visualized in a piechart. The result is a fully automated, scalable workflow that demonstrates cloud engineering, automation, and data visualization skills.

Architecture workflow:

1. Amazon S3(csv-input-bucket5)

- Stores raw CSV files uploaded by the user
- Acts as the entry point for the pipeline
- Amazon S3 is a reginal serverless storage

2. Creating IAM Roles

Lambda Role:
- This role allows my lambda function to call Start Job Run on AWS Glue
- Without this role, Lambda can't trigger the ETL process when a new CSV is uploaded
- Lastly IAM roles are Global

Glue Role:
- Allows Glue to raed input data from Amazon S3 input bucket
- Grants permission to write output data to Aamazon S3 output bucket
-Without it Glue can't read ,transform or write data

3. AWS Lambda

- Triggered automatically when a new CSV file is uploade insied the Amazon S3 input bucket
- Starts the Glue job Without a manual execation

4. AWs Glue

- Reads the raw CSV from S3.
- Cleans and transforms the data
- Writes the processed data back into an output S3 buckt in a CSV Format

5. Amazon S3 output bucket

- Stores the transformed/cleaned data
- Organized into folders for an easy access and scalability

6. Manifest file

- A json file that tells QuickSight where to find the processed data in Amazon S3

7. Amazon QuickSight (Quick Suite)

- Connects to the output Amazon S3 bucket via the manifest file
- loads the cleaned dataset
- Provides interactive dashboards,charts and reports

Key Features

- Automation: Every new CSV upload triggers the pipeline automatically.
- Scalability: Can handle multiple files and scale with AWS services.
- Flexibility: Output can be JSON, CSV, or Parquet depending on needs.
- Visualization: QuickSight dashboards make insights accessible and shareable.
- Security: IAM roles and S3 access points ensure least-privilege access.