Measurement data pipeline
March 2022
About the project
-
Build a pipeline to visualize data from manufacturing systems
- Manufacturing systems produce data for individual parts for each process step
-
Use Kafka as source to provide near-real-time streaming
-
Use Flink to implement the streaming application
- Parse incoming data, validate, transform
- Stateful transformations
- Real-time joins using event time semantics, watermarks, idleness handling
- Use checkpointing to prevent data loss in case of failures
- Handle errors gracefully, so that the pipeline continues to work
- Parse incoming data, validate, transform
-
Use a Medallion architecture to store data
- Bronze tables to store data “as-is”
- Silver tables to provide an enterprise view and for self-service analytics
- Gold tables fully prepared for reports
-
Use Flink’s state broadcast
- To change the configuration of the transformations at run-time
- Without down time, without service interruption
-
Implement custom sink for InfluxDB
- To have bulk writes for high performance and throughput
-
Implement custom sink for Oracle
- Flink’s JDBC sink does not manage BLOBs correctly
-
Write output stream to different systems for inspection and analysis
- InfluxDB
- Oracle
- PostgreSQL
- Kafka
-
Visualize data
- Using Influx web UI
- Using Grafana
-
Write error stream to Kafka
- To provide feedback to the operators in case of failures
-
Use Flink/Flink SQL to implement batch jobs
- To remove old data in Oracle
- To import data regularly, using Kubernetes cronjobs
-
Implement tests for quality assurance and to prevent regressions
- Implement unit tests
- Implement integration tests using H2 and TestContainers
-
Provide docker-compose environment for a full development environment with all services
- ZooKeeper, Kafka
- PostgreSQL
- Oracle
- InfluxDB
- Grafana
- Flink job manager and task manager
-
Deployment to OpenShift 4 (Kubernetes)
- Test deployments to Minikube
- Test deployments to “CodeReady containers (CRC)” (OpenShift 4)
- Rancher test installation and evaluation
-
Use Kubernetes cronjobs to trigger Flink batch jobs
Roles
- System Architect, Data Architect, Data Engineer
- Design, development, integration, test, performance tuning
Industry, industrial sector
- Manufacturing systems engineering
Stack overflow
- https://stackoverflow.com/questions/72219901/how-to-drain-the-window-after-a-flink-join-using-cogroup
- https://stackoverflow.com/questions/74027742/how-can-i-update-a-configuration-in-a-flink-transformation