About the project
- Migration of an ETL pipeline into AWS
- Batch pipelines to fill a datalake using AWS EMR
- Using different zones: landing, structured, processed, export
- Using Delta updates, watermarking to avoid reprocessing of data
- Using AWS Step Functions for scheduling and orchestration
- Using AWS SQS and AWS SNS for messaging and notification
- Import from
- Export to
- Using AWS Athena and Athena notebooks for interactive queries
- Using Python to access data stored in Athena
- Access data using SQLAlchemy Core
- Access data using SQLAlchemy ORM
- Access data using Pandas on top of SQLAlchemy
- Create charts to visualize reporting data
- Create interactive dashboards using Power BI for quick data access
- Metadata configuration using the Dhall configuration language
- to describe the system on a meta-level, like tables, attributes, relations, constraints etc.
- for orchestration
- to generate code
- for authorization
- Generate Scala case classes to represent records, to be used in Spark Datasets
- To get type safety
- To provide test data as code
- To simplify comparison in unit tests
- To provide quick information (type, nullable, comment, precision/scale, …) for attributes
- To get constants for attribute names instead of hard-coded values
- Generate SQL scripts
- To create tables in PostgreSQL
- Create charts using Mermaid to visualize data flows
- Create data model charts using Graphviz to visualize tables and their relations
- Infrastructure-as-code using Terraform
Roles
- System Architect, Data Architect, Data Engineer
- Design, development, integration, test, performance tuning
Industry, industrial sector
- Transportation, energy management
Tags
etl
apache-spark
scala
sql
sbt
dhall
python
poetry
pyspark
pandas
boto3
sqlalchemy
pytest
cloud
amazon-web-services
amazon-athena
amazon-athena-query
amazon-athena-notebooks
amazon-emr
amazon-emr-serverless
amazon-rds
amazon-s3
amazon-kinesis
amazon-sqs
amazon-sns
amazon-glue-job
amazon-glue-crawler
amazon-elastic-container-service
amazon-sagemaker
aws-cli
aws-lambda
aws-step-functions
postgresql
sql-server
oracle
terraform
snowflake-schema
scd-1
scd-2
dimension-history
data-visualization
matplotlib
folium
leaflet
powerbi
tdd
bdd
scalatest
localstack
mermaid
graphviz
git
agile
scrum
jira