ETL pipeline for network infrastructure and customer service reporting
ETL pipeline using Spark/Impala/Kudu for MicroStrategy-based reporting, Telecommunication, April 2021
About the project
-
Implement Spark jobs
- To provide data for network infrastructure reports
- To provide data for customer service reports
-
Design of the data model
- Establish naming conventions to simplify access
- Specification of the content to define the meaning/scope of attributes
- Remove redundancies to avoid duplicate data
- Design data model for efficient queries in MicroStrategy
- Use Kudu partitioning for optimal performance
-
Add meta data to extend the transformations (foreign key relations, table/column comments etc.)
- To get a single source for this information
- Generate data model charts from various perspectives (job, table, report, lineage, …) using Graphviz
- Generate SQL DDL statements (create table) based on the meta data defined in the Scala sources
-
Migration of Parquet files (HDFS) to Impala (Kudu storage back end)
- To unify the storage format
- To avoid queries across different database engines
-
Implement batch jobs
- Batch jobs are to notify Oozie using file triggers to start further jobs
-
Spark unit and integration tests
- To check the quality/correctness when a new version is installed
-
Performance tuning
- Kudu partitioning (hash, range, mixed), Impala queries
- Providing job group and call site information for Spark UI to improve performance tuning tasks
- Optimize queries (Hive partition pruning, Spark repartitioning)
Roles
- System Architect, Data Engineer
- Design, development, integration, test, performance tuning
Industry, industrial sector
- Telecommunication