Data ingestion pipeline to update product data
Streaming pipeline using Kafka on AWS, Marketplace, retail e-commerce company, June 2021
About the project
-
Overall goal: Design a pipeline to update product data continuously using Kafka Streams on AWS
- For example product information, price, availability, attributes for filters etc.
-
Extend the existing ETL batch system so that it can support delta processing
- As an intermediate step towards a streaming solution
- Store and manage delete operations, as this information was required to support incremental changes
- Using MongoDB as state store
- Using mongo-java-server for MongoDB integration tests
- Using spring-test for Spring Boot integration tests
-
Implement a streaming pipeline using AWS Lambdas to support continuous incremental changes
- As an intermediate step on the base of the existing batch system
- Using S3, bucket notification, SQS, SNS, SQS subscriptions, and message filtering based on tags
- Using Python and Java for AWS Lambdas
- Using Quarkus as framework, and as the base to build native executables for AWS Lambda
- Deployment to AWS using CloudFormation
-
Store results in Solr incrementally
- To provide continuous updates so that updates are visibile as fast as possible
- Solr is used as full text search engine
- Handle out-of-order events: Ignore old updates using conditional updating of a document based on its version (document based versioning)
- Optimize performance by adjusting auto commit settings
-
Aggregate XML files into larger chunks
- Suitable for Spark
- To be able to analyze processed messages using PySpark/spark-xml
-
Implement a proof-of-concept using Kafka Streams
- To check design options
- To evaluate stream processing performance, including RocksDB
- Setup Managed Streaming for Kafka (MSK) on AWS
-
Implement a proof-of-concept using Redis as a possible alternative to RocksDB
- To evaluate pros and cons of a non-distributed key/value store
- Setup AWS Elasticache using Redis
-
Design and implement a streaming ETL pipeline
-
Extract data from the upstream systems
- Using AWS Lambdas implemented with Java
- Using Micronaut and Reactor as frameworks
-
Transform data using Kafka Streams
- Migrate code from batch system
- Validate, filter, and enrich incoming data
- Using RocksDB as the store for distributed state management
- Using Micronaut for configuration management, dependency injection, declarative Kafka producers/consumers etc.
- Using Jackson to read and write both XML and JSON
- Provide Docker images for installation
- Deploy to AWS using EC2, later migrated to AWS Fargate
- Using Kafka Streams Test Utilities to implement unit tests
- Using Testcontainers and Wiremock to implement integration tests
-
Load data into Solr using Kafka Connect
- To update data in Solr using the Kafka Connect Solr sink
-
Write metrics to CloudWatch
- To provide statistics, and to analyze the streaming performance
- Event types like new, update, delete
- Event categories like price and availability
- Using Grafana dashboards as an additional user interface
-
Implement a proof-of-concept for integration tests using localstack
-
Run load tests to find bottlenecks
- Apply performance optimizations
-
Use Spark to analyze streaming data
- To check data
- To get statistics for system design and performance optimization
Roles
- System Architect, Data Engineer
- Design, development, integration, test, performance tuning
Industry, industrial sector
- Retail
- E-Commerce
- Marketplace