AWS datalake

About the project

Migration of an ETL pipeline into AWS
Batch pipelines to fill a datalake using AWS EMR
- Using different zones: landing, structured, processed, export
- Using Delta updates, watermarking to avoid reprocessing of data
- Using AWS Step Functions for scheduling and orchestration
- Using AWS SQS and AWS SNS for messaging and notification
Import from
- Oracle, MS-SQL-Server
- CSV and Excel files
Export to
- Parquet files on AWS S3
- PostgreSQL, managed by AWS RDS
- CSV and Excel files
Using AWS Athena and Athena notebooks for interactive queries
Using Python to access data stored in Athena
- Access data using SQLAlchemy Core
- Access data using SQLAlchemy ORM
- Access data using Pandas on top of SQLAlchemy
Create charts to visualize reporting data
- Using Matplotlib
- Using Leaflet/Folium
Create interactive dashboards using Power BI for quick data access
Metadata configuration using the Dhall configuration language
- to describe the system on a meta-level, like tables, attributes, relations, constraints etc.
- for orchestration
- to generate code
- for authorization
Generate Scala case classes to represent records, to be used in Spark Datasets
- To get type safety
- To provide test data as code
- To simplify comparison in unit tests
- To provide quick information (type, nullable, comment, precision/scale, …) for attributes
- To get constants for attribute names instead of hard-coded values
Generate SQL scripts
- To create tables in PostgreSQL
Create charts using Mermaid to visualize data flows
Create data model charts using Graphviz to visualize tables and their relations
Infrastructure-as-code using Terraform

Roles

System Architect, Data Architect, Data Engineer
Design, development, integration, test, performance tuning

Industry, industrial sector

Transportation, energy management

Tags

etl apache-spark scala sql sbt dhall python poetry pyspark pandas boto3 sqlalchemy pytest cloud amazon-web-services amazon-athena amazon-athena-query amazon-athena-notebooks amazon-emr amazon-emr-serverless amazon-rds amazon-s3 amazon-kinesis amazon-sqs amazon-sns amazon-glue-job amazon-glue-crawler amazon-elastic-container-service amazon-sagemaker aws-cli aws-lambda aws-step-functions postgresql sql-server oracle terraform snowflake-schema scd-1 scd-2 dimension-history data-visualization matplotlib folium leaflet powerbi tdd bdd scalatest localstack mermaid graphviz git agile scrum jira