Closed

ETL pipeline

Scenario:

Each day we receive data from a collaborating hospital about patients' blood glucose levels. A patient has their level measured three times, and those reading are averaged together to determine if the patient's blood sugar level is normal, pre-diabetic or diabetic (a blood sugar level less than 140 mg/dL (7.8 mmol/L) is normal, more than 200 mg/dL (11.1 mmol/L) after two hours indicates diabetes, and a reading between 140 and 199 mg/dL (7.8 mmol/L and 11.0 mmol/L) indicates prediabetes). Typically a file will contain all three readings for a patient, but occasionally the hospital's lab information system is out of sync and we will receive some readings for a patient at a later date.

The data we receive is in CSV format, and each file is named after the date it was transferred (2020-10-28 in the example attached). The files are uploaded each morning to the same directory in a shared S3 bucket. The files contain protected health information (PHI), which we are not allowed to store (PHI includes names, addresses, hospital identification numbers, etc., anything that could be used to personally identify the patient).

Goal:

Design an ETL application to run each morning that ingests the new CSV file and persists the data in a database or the data file format of your choice. Assume that eventually the volume of data will eventually grow to multiple TB and design your application accordingly.

Steps:

1. Make assumptions and justify them where things are unclear with comments in the code

2. Write tests to ensure that your code and the data is correct

3. Remove protected health information (PHI)

4. Remove any invalid values and normalize where reasonable

5. Add a column that calculates the average of all three glucose measurements (if present)

6. Add a column that indicates whether the patient's glucose levels are normal, prediabetes, diabetes, or unable to be determined

7. Account for late data (for example, if we receive two readings in one day's CSV file and the third reading in the next day's file)

Skills: Python, SQL, ETL, Spark

About the Employer:
( 0 reviews ) Brooklyn, United States

Project ID: #30550065

8 freelancers are bidding on average $247 for this job

saubhagyamweb

Hi, I hope you are doing good. I read your job post and I am the best match for this wonderful opportunity as I have 5+ years of relevant experience in required skills. Let's have a quick discussion. Thanks Virang P More

$20 USD in 7 days
(16 Reviews)
6.2
developer2581

Hi, Me and my team have read the information you have provided and we will be more than happy to work with you and ensure the quality of work and results you expect to receive. You won't regret your decision of cho More

$20 USD in 7 days
(8 Reviews)
6.0
Digiexpert90

Hello Hiring Manager, I read your job descriptions carefully, I am very interesting in your job of ETL pipeline I have the enough experience and good project done with good client feedbacks. Let me know if we can disc More

$200 USD in 7 days
(5 Reviews)
5.3
dineshrajputit

hi, I can do this Hadoop, hive work since, I have expertise in bigdata technology with 9 years of experience. good command over Hadoop, hive, spark, nosql, java, Linux, aws, gcp... please let me know if we can have q More

$556 USD in 5 days
(5 Reviews)
4.0
ahmadndiayee

Hi, I am an experienced Data Engineer with a solid background in Spark. I have worked on many Big Data projects with Spark, Scala, Python, Cassandra, Snowflake, AWS,... I suggest to use Snowflake as a datawarehouse and More

$30 USD in 5 days
(2 Reviews)
2.9
worksoft

Hello, I am an experienced AWS developer & can do this task to extract data from csv in S3 buckets & store it in a db. Let me know if you only want to store the data or also want to visualise it somewhere. Thank you More

$1111 USD in 10 days
(1 Review)
1.4
taocheng510

Dear client! I've read your job description carefully. I have more than five years of experience in Development. Your satisfaction with the project is my top priority! If you give me a chance to work with you, then I w More

$20 USD in 7 days
(0 Reviews)
0.0
RDurbano

I believe that I can help you developing the ETL process for ingest the data into a database and apply the calculations to delivery if the pacient is normal, prediabetes, diabetes. Firstly I will create a diagram to k More

$20 USD in 14 days
(0 Reviews)
0.0