ETL pipeline for network infrastructure and customer service reporting

ETL pipeline using Spark/Impala/Kudu for MicroStrategy-based reporting, Telecommunication, April 2021

About the project

Implement Spark jobs
- To provide data for network infrastructure reports
- To provide data for customer service reports
Design of the data model
- Establish naming conventions to simplify access
- Specification of the content to define the meaning/scope of attributes
- Remove redundancies to avoid duplicate data
- Design data model for efficient queries in MicroStrategy
- Use Kudu partitioning for optimal performance
Add meta data to extend the transformations (foreign key relations, table/column comments etc.)
- To get a single source for this information
- Generate data model charts from various perspectives (job, table, report, lineage, …) using Graphviz
- Generate SQL DDL statements (create table) based on the meta data defined in the Scala sources
Migration of Parquet files (HDFS) to Impala (Kudu storage back end)
- To unify the storage format
- To avoid queries across different database engines
Implement batch jobs
- Batch jobs are to notify Oozie using file triggers to start further jobs
Spark unit and integration tests
- To check the quality/correctness when a new version is installed
Performance tuning
- Kudu partitioning (hash, range, mixed), Impala queries
- Providing job group and call site information for Spark UI to improve performance tuning tasks
- Optimize queries (Hive partition pruning, Spark repartitioning)