ETL pipeline for big data reporting

Development of an ETL pipeline for big data reporting, Corporate sustainability reporting/Environmental Accounting, December 2018 – June 2020

About the project

Development of an ETL pipeline from an information system into a big data database for reporting.

Customers use an information system to store data like water or gas consumption of different locations (stores for example) in different periods of time. Ater a user has entered data, these changes are propagated in “near real-time” into the ETL pipeline. The pipeline prepares the data for the Big Data Reporting front end. Users can interactively drill up/down in the front end or build dashboards. The front end contains a scripting language to write user-defined functions.

Establish Spark streaming from the information system via Kafka into Parquet files
- The messages in Kafka represent changes of records (like insert, update, delete)
- Handle updated or deleted records in Parquet files on an “append-only” HDFS base using meta data in additional Parquet files
- For each update or delete operation an additional record is written which tags old versions with a deleted flag
- Deleted flags are considered in all transformations using SQL joins to hide old records
Run Spark streaming job on YARN
All services (Kafka, Hadoop, Presto etc.) are hosted on the Open Telekom Cloud
Transform incoming data into records suitable for reporting
Multi tenancy
Handle both small transactions fast (focus on latency), while being capable of processing huge transacttions as well (through put)
Performance optimizations (Spark partitioning), stress tests
Provide a docker-compose environment to contain all services for development and integration tests
Prepare application to switch the filesystem from HDFS to Amazon S3
The reporting front end is implemented in Akka; Akka is used to compute the values of functions, and the functions can be implemented by power users

ETL pipeline for big data reporting

About the project

Roles

Industry, industrial sector

Stack overflow

Tags