Data ingestion pipeline to update product data
Streaming pipeline using Kafka on AWS, Marketplace, retail e-commerce company, June 2021
About the project
- 
Overall goal: Design a pipeline to update product data continuously using Kafka Streams on AWS
- For example product information, price, availability, attributes for filters etc.
 
 - 
Extend the existing ETL batch system so that it can support delta processing
- As an intermediate step towards a streaming solution
 - Store and manage delete operations, as this information was required to support incremental changes
 - Using MongoDB as state store
 - Using mongo-java-server for MongoDB integration tests
 - Using spring-test for Spring Boot integration tests
 
 - 
Implement a streaming pipeline using AWS Lambdas to support continuous incremental changes
- As an intermediate step on the base of the existing batch system
 - Using S3, bucket notification, SQS, SNS, SQS subscriptions, and message filtering based on tags
 - Using Python and Java for AWS Lambdas
 - Using Quarkus as framework, and as the base to build native executables for AWS Lambda
 - Deployment to AWS using CloudFormation
 
 - 
Store results in Solr incrementally
- To provide continuous updates so that updates are visibile as fast as possible
 - Solr is used as full text search engine
 - Handle out-of-order events: Ignore old updates using conditional updating of a document based on its version (document based versioning)
 - Optimize performance by adjusting auto commit settings
 
 - 
Aggregate XML files into larger chunks
- Suitable for Spark
 - To be able to analyze processed messages using PySpark/spark-xml
 
 - 
Implement a proof-of-concept using Kafka Streams
- To check design options
 - To evaluate stream processing performance, including RocksDB
 - Setup Managed Streaming for Kafka (MSK) on AWS
 
 - 
Implement a proof-of-concept using Redis as a possible alternative to RocksDB
- To evaluate pros and cons of a non-distributed key/value store
 - Setup AWS Elasticache using Redis
 
 - 
Design and implement a streaming ETL pipeline
 - 
Extract data from the upstream systems
- Using AWS Lambdas implemented with Java
 - Using Micronaut and Reactor as frameworks
 
 - 
Transform data using Kafka Streams
- Migrate code from batch system
 - Validate, filter, and enrich incoming data
 - Using RocksDB as the store for distributed state management
 - Using Micronaut for configuration management, dependency injection, declarative Kafka producers/consumers etc.
 - Using Jackson to read and write both XML and JSON
 - Provide Docker images for installation
 - Deploy to AWS using EC2, later migrated to AWS Fargate
 - Using Kafka Streams Test Utilities to implement unit tests
 - Using Testcontainers and Wiremock to implement integration tests
 
 - 
Load data into Solr using Kafka Connect
- To update data in Solr using the Kafka Connect Solr sink
 
 - 
Write metrics to CloudWatch
- To provide statistics, and to analyze the streaming performance
 - Event types like new, update, delete
 - Event categories like price and availability
 - Using Grafana dashboards as an additional user interface
 
 - 
Implement a proof-of-concept for integration tests using localstack
 - 
Run load tests to find bottlenecks
- Apply performance optimizations
 
 - 
Use Spark to analyze streaming data
- To check data
 - To get statistics for system design and performance optimization
 
 
Roles
- System Architect, Data Engineer
 - Design, development, integration, test, performance tuning
 
Industry, industrial sector
- Retail
 - E-Commerce
 - Marketplace