Paid on delivery
Project Brief: Secondary Sales Data Engineering
Our company currently utilizes a network of channel partners (Super Stockists) across various Indian states to sell products with specific SKUs to a downstream distribution network. However, due to the diverse software used by these Super Stockists, product names and SKUs often diverge from our standardized format, hindering the collection and analysis of secondary sales data. This data arrives in various formats such as CSV, Excel, and PDF, further complicating the process.
2. Project Objective
To address this challenge, we propose a data engineering project focused on transforming and homogenizing secondary sales data received from Super Stockists. This project aims to achieve the following:
Standardize product names and SKUs: Map non-uniform names and SKUs used by Super Stockists to our standardized format.
Transform data format: Convert data from various formats (CSV, Excel, PDF) into a single, unified format.
Clean and validate data: Identify and correct any inconsistencies or errors within the data.
Aggregate and structure data: Organize the transformed data into a readily analyzable format for further utilization.
3. Expected Deliverables
Data pipeline: An automated pipeline for ingesting, transforming, and cleaning secondary sales data.
Standardized dataset: A clean and consistent dataset with uniform product names, SKUs, and format.
Data quality report: A detailed report outlining the data cleaning process, identified issues, and applied corrections.
Documentation: Comprehensive documentation outlining the data pipeline, data transformation steps, and data format specifications.
4. Key Success Factors
Accuracy: The standardized dataset must reflect accurate and consistent product information.
Completeness: The pipeline should capture and process all secondary sales data received from Super Stockists.
Efficiency: The pipeline should operate efficiently to minimize processing time and resource consumption.
Scalability: The solution should be scalable to accommodate future growth in data volume.
Maintainability: The pipeline and code should be well-documented and easy to maintain for future updates and modifications.
5. Next Steps
Detailed project proposal: Prepare a detailed project proposal outlining the proposed methodology, resources required, timelines, and project costs.
Data source review: Conduct a comprehensive review of the data sources (formats, content, etc.) from Super Stockists.
Data quality assessment: Evaluate the initial data quality and identify potential challenges and cleaning requirements.
Prototype development: Develop and test a prototype of the data pipeline to demonstrate feasibility and address any technical hurdles.
Project kickoff meeting: Convene a kickoff meeting with key stakeholders to finalize project scope, deliverables, and timeline.
We mostly use Microsoft in our organization, so that would be the preference.
Project ID: #37527200