ETL pipeline into an Azure data lake for Power BI

Development of an ETL pipeline into an Azure data lake for Power BI reporting, Insurance, October 2020 – December 2020

About the project

Development of an ETL pipeline from multiple information systems into an Azure data lake for reports using Power BI

Implement jobs to generate classes to represent records, based on meta data of existing databases
Extract data from different sources like Oracle databases, Palantir Foundry etc.
Implement batch jobs to transform data
Use Kafka and Spark streaming as an alternative to batch jobs
- Using Delta Lake for updating and deleting records
Validate and aggregate data from different sources into records suitable for reporting
Provide data suitable for Power BI
Expose data for Power BI using Hive
Test-Driven Development, Behavior-Driven Development
Specification-based testing: Tests as executable specifications
Package the Spark jobs using a hierarchical Maven project into a single jar file (including dependencies)
Run Spark jobs on Databricks using the single jar file
Performance optimizations (data co-location, Spark partition tuning, parallelization of database extracts)
Provide alternative implementations using Kafka Streams and Kafka Connect/ksqlDB
Project management: Agile, Scrum/ScrumXP as defined by the Scaled Agile Framework (SAFe)