Data ingestion pipeline to update product data

Streaming pipeline using Kafka on AWS, Marketplace, retail e-commerce company, June 2021

About the project

Overall goal: Design a pipeline to update product data continuously using Kafka Streams on AWS
- For example product information, price, availability, attributes for filters etc.
Extend the existing ETL batch system so that it can support delta processing
- As an intermediate step towards a streaming solution
- Store and manage delete operations, as this information was required to support incremental changes
- Using MongoDB as state store
- Using mongo-java-server for MongoDB integration tests
- Using spring-test for Spring Boot integration tests
Implement a streaming pipeline using AWS Lambdas to support continuous incremental changes
- As an intermediate step on the base of the existing batch system
- Using S3, bucket notification, SQS, SNS, SQS subscriptions, and message filtering based on tags
- Using Python and Java for AWS Lambdas
- Using Quarkus as framework, and as the base to build native executables for AWS Lambda
- Deployment to AWS using CloudFormation
Store results in Solr incrementally
- To provide continuous updates so that updates are visibile as fast as possible
- Solr is used as full text search engine
- Handle out-of-order events: Ignore old updates using conditional updating of a document based on its version (document based versioning)
- Optimize performance by adjusting auto commit settings
Aggregate XML files into larger chunks
- Suitable for Spark
- To be able to analyze processed messages using PySpark/spark-xml
Implement a proof-of-concept using Kafka Streams
- To check design options
- To evaluate stream processing performance, including RocksDB
- Setup Managed Streaming for Kafka (MSK) on AWS
Implement a proof-of-concept using Redis as a possible alternative to RocksDB
- To evaluate pros and cons of a non-distributed key/value store
- Setup AWS Elasticache using Redis
Design and implement a streaming ETL pipeline
Extract data from the upstream systems
- Using AWS Lambdas implemented with Java
- Using Micronaut and Reactor as frameworks
Transform data using Kafka Streams
- Migrate code from batch system
- Validate, filter, and enrich incoming data
- Using RocksDB as the store for distributed state management
- Using Micronaut for configuration management, dependency injection, declarative Kafka producers/consumers etc.
- Using Jackson to read and write both XML and JSON
- Provide Docker images for installation
- Deploy to AWS using EC2, later migrated to AWS Fargate
- Using Kafka Streams Test Utilities to implement unit tests
- Using Testcontainers and Wiremock to implement integration tests
Load data into Solr using Kafka Connect
- To update data in Solr using the Kafka Connect Solr sink
Write metrics to CloudWatch
- To provide statistics, and to analyze the streaming performance
- Event types like new, update, delete
- Event categories like price and availability
- Using Grafana dashboards as an additional user interface
Implement a proof-of-concept for integration tests using localstack
Run load tests to find bottlenecks
- Apply performance optimizations
Use Spark to analyze streaming data
- To check data
- To get statistics for system design and performance optimization

Roles

System Architect, Data Engineer
Design, development, integration, test, performance tuning

Industry, industrial sector

Retail
E-Commerce
Marketplace

Stack overflow

https://stackoverflow.com/questions/70844904/does-the-kafka-streams-streambuilder-always-detect-duplicate-input-topics

Data ingestion pipeline to update product data

About the project

Roles

Industry, industrial sector

Stack overflow

References

Tools

Tags