ETL pipeline into an Azure data lake for Power BI
Development of an ETL pipeline into an Azure data lake for Power BI reporting, Insurance, October 2020 – December 2020
About the project
Development of an ETL pipeline from multiple information systems into an Azure data lake for reports using Power BI
- Implement jobs to generate classes to represent records, based on meta data of existing databases
- Extract data from different sources like Oracle databases, Palantir Foundry etc.
- Implement batch jobs to transform data
- Use Kafka and Spark streaming as an alternative to batch jobs
- Using Delta Lake for updating and deleting records
- Validate and aggregate data from different sources into records suitable for reporting
- Provide data suitable for Power BI
- Expose data for Power BI using Hive
- Test-Driven Development, Behavior-Driven Development
- Specification-based testing: Tests as executable specifications
- Package the Spark jobs using a hierarchical Maven project into a single jar file (including dependencies)
- Run Spark jobs on Databricks using the single jar file
- Performance optimizations (data co-location, Spark partition tuning, parallelization of database extracts)
- Provide alternative implementations using Kafka Streams and Kafka Connect/ksqlDB
- Project management: Agile, Scrum/ScrumXP as defined by the Scaled Agile Framework (SAFe)
Roles
- System Architect, Data Engineer
- Design, development, integration, test, performance tuning
Industry, industrial sector
- Insurance
Stack overflow
References
- https://www.scaledagileframework.com/scrumxp/
- https://www.scaledagileframework.com/test-driven-development/
- https://www.scaledagileframework.com/behavior-driven-development/
- https://www.toolsqa.com/software-testing/specification-based-testing/
- https://www.scalatest.org/user_guide/tests_as_specifications
- https://www.scalatest.org/user_guide/selecting_a_style