ETL pipeline for big data reporting
Development of an ETL pipeline for big data reporting, Corporate sustainability reporting/Environmental Accounting, December 2018 – June 2020
About the project
Development of an ETL pipeline from an information system into a big data database for reporting.
Customers use an information system to store data like water or gas consumption of different locations (stores for example) in different periods of time. Ater a user has entered data, these changes are propagated in “near real-time” into the ETL pipeline. The pipeline prepares the data for the Big Data Reporting front end. Users can interactively drill up/down in the front end or build dashboards. The front end contains a scripting language to write user-defined functions.
- Establish Spark streaming from the information system via Kafka into Parquet files
- The messages in Kafka represent changes of records (like insert, update, delete)
- Handle updated or deleted records in Parquet files on an “append-only” HDFS base using meta data in additional Parquet files
- For each update or delete operation an additional record is written which tags old versions with a deleted flag
- Deleted flags are considered in all transformations using SQL joins to hide old records
- Run Spark streaming job on YARN
- All services (Kafka, Hadoop, Presto etc.) are hosted on the Open Telekom Cloud
- Transform incoming data into records suitable for reporting
- Multi tenancy
- Handle both small transactions fast (focus on latency), while being capable of processing huge transacttions as well (through put)
- Performance optimizations (Spark partitioning), stress tests
- Provide a docker-compose environment to contain all services for development and integration tests
- Prepare application to switch the filesystem from HDFS to Amazon S3
- The reporting front end is implemented in Akka; Akka is used to compute the values of functions, and the functions can be implemented by power users
Roles
- System Architect, Data Engineer
- Design, development, integration, test, performance tuning
Industry, industrial sector
- Corporate sustainability reporting/Environmental Accounting
Stack overflow
- https://stackoverflow.com/questions/60484693/how-to-configure-spark-2-4-correctly-with-user-provided-hadoop
- https://stackoverflow.com/questions/56989068/what-do-foreachbatches-contain-in-a-streaming-query-from-multiple-kafka-topics