
Closed
Posted
Paid on delivery
I have a Hadoop cluster holding several large data sets, and I need a seasoned PySpark developer who also writes rock-solid SQL. The immediate aim is to connect to the cluster (YARN/HDFS with Hive metastore), develop or refine PySpark jobs, optimise the accompanying SQL, and make sure everything runs smoothly end-to-end. You’ll receive access to a staging namespace plus a sample of the data. Once the logic checks out we’ll promote the code to the full environment. Deliverables • A clean, well-commented PySpark notebook or .py job that executes successfully on the cluster • The corresponding SQL script or view definitions ready for Hive or spark-sql • A concise README detailing execution steps, parameters, and expected outputs Acceptance criteria • Jobs finish error-free on a 100 GB test slice • Performance meets the runtime target we agree on before scaling up • Output matches the sample I provide during onboarding If you know Spark 3.x, HiveQL, and the nuances of tuning workloads on Hadoop, I look forward to working with you.
Project ID: 40227186
11 proposals
Remote project
Active 15 days ago
Set your budget and timeframe
Get paid for your work
Outline your proposal
It's free to sign up and bid on jobs
11 freelancers are bidding on average ₹7,141 INR for this job

Having spent over 7 years as a full stack developer and having worked with notable companies like Metlife GOSC and DXC technologies, I have accumulated extensive experience in various essential aspects of data processing and data analysis, particularly SQL, and can comfortably handle large-scale databases on platforms like Hadoop. In addition to being a proficient PySpark coder and a seasoned SQL expert, my skills in web scraping will also come in handy during the project. As the Hadoop cluster contains massive data sets, using PySpark for extracting valuable data from this cluster would be much more efficient than conventional SQL methods used by most developers. I can guarantee clean, well-commented PySpark notebook or .py job and the corresponding refined SQL script or view definitions needed for Hive or spark-sql. Finally, I understand that the stakes are high when it comes to data processing since even minor inconsistencies can have significant downstream effects. Reading through your project description demonstrates your appreciation for attention to detail and accuracy; those are values that I share wholeheartedly. That's why I offer a comprehensive free four-day support service in case issues arise with my work after delivery. With my dedication to guaranteeing client satisfaction demonstrated with each job I perform,I am confident I'm exactly what you need for this project.
₹12,000 INR in 5 days
8.3
8.3

Hi there, I’ve reviewed your project and understand you need a PySpark developer experienced with Hadoop clusters, YARN/HDFS, and Hive metastore. The goal is to develop and optimise PySpark jobs, refine SQL, and ensure end-to-end smooth execution on large datasets. I can create a clean, well-commented PySpark notebook or .py job that runs successfully on your cluster, along with corresponding SQL scripts or Hive views. I’ll also provide a concise README detailing execution steps, parameters, and expected outputs, ensuring the jobs finish error-free on a 100 GB test slice and meet the agreed performance targets. My approach focuses on performance tuning, modular code, and reliable SQL logic to make scaling seamless. I’m familiar with Spark 3.x, HiveQL, and workload optimisation on Hadoop, and I’ll validate all outputs against your sample data before promotion to the full environment. I’m ready to get started immediately and deliver a solution that runs efficiently, is easy to maintain, and fully documented. Best regards, Muhammad Adil Portfolio: https://www.freelancer.com/u/webmasters486
₹9,000 INR in 2 days
5.8
5.8

Hello, I can help you with this PySpark and SQL project. I have experience working with Python, Spark DataFrames, and SQL-based data transformations for ETL and analytics tasks. I’ve built data processing workflows using Spark, cleaned and transformed datasets, and generated structured outputs for analysis and reporting. I’m comfortable working in both local and cloud-based environments, and I focus on writing clear, efficient, and well-documented code. I’d be happy to review the requirements and start with a small test task if needed. Best regards.
₹7,000 INR in 7 days
2.6
2.6

With 7 years of experience in data engineering and analytics, I am the best fit to complete this requirement. I have the relevant skills to work on PySpark and SQL in a Hadoop environment. How I will complete this project: 1. Connect to the Hadoop cluster (YARN/HDFS with Hive metastore) 2. Develop or refine PySpark jobs 3. Optimise SQL queries for better performance 4. Ensure end-to-end smooth execution 5. Create a clean, well-commented PySpark notebook or .py job 6. Develop corresponding SQL scripts or view definitions for Hive or spark-sql 7. Provide a concise README with execution steps, parameters, and expected outputs Tech stack I will use: - PySpark - SQL - HiveQL - Hadoop I have worked on similar solutions in the past and have experience with Spark 3.x and tuning workloads on Hadoop. I am confident in delivering error-free jobs on a 100 GB test slice and meeting performance targets. Looking forward to collaborating with you on this project.
₹1,650 INR in 7 days
0.0
0.0

Hi, I have 4+ years of experience working with Hadoop (CDH/CDP), PySpark, Hive, and large-scale banking datasets (30M+ records/day). I have: • Optimized Spark jobs (executor memory, shuffle tuning) • Refactored MapReduce to Tez • Built and tuned PySpark pipelines on distributed clusters • Delivered production jobs handling high-volume data For your 100GB workload, I will: ✔ Optimize partitioning and shuffle strategy ✔ Ensure efficient SQL execution plans ✔ Deliver clean, production-ready code with documentation Estimated delivery: 6 days Bid: 15000 inr Looking forward to discussing your cluster configuration and runtime targets.
₹5,000 INR in 6 days
0.0
0.0

As a passionate and experienced Data Scientist, I have developed a deep understanding of the technologies you require for this project - PySpark, HiveQL, and Hadoop. My proficiency in Python matches the requirement of well-commented and functional PySpark notebooks or .py jobs. Moreover, I have firsthand experience in SQL with different databases, including PostgreSQL and Oracle, which aligns perfectly with your needs. One aspect that sets me apart is my curious nature. I have an innate drive to learn new things and keep myself updated with the latest advancements. This has led me to acquire relevant certificates like Introduction to Deep Learning, Natural Language Processing, and Sequence Models - which won't directly be used for this project but mirrors my commitment to self-improvement. Most importantly though, I'm not just about theory; I apply my knowledge in real-world situations. You can rely on me to design, develop, test, and optimize your PySpark jobs end-to-end. Whether it is connecting to your Hadoop cluster, running performance checks on data slices or scaling up while maintaining efficiency – I'm ready for the challenge! My data processing skills combined with an ability to deliver error-free results aligns well with your project goals. Let's improve your data capabilities together!
₹7,000 INR in 7 days
0.0
0.0

Hi there! I am a Big Data Engineer specializing in Apache Spark 3.x and Hadoop Ecosystems. I read your requirements for connecting to a Hive Metastore and optimizing SQL/PySpark jobs for a 100GB dataset. My Technical Approach: Connection: I will configure the SparkSession with the correct Hive support and YARN resource manager settings (Executors/Memory) to ensure stability. Optimization: I will optimize your PySpark jobs by: Using Partitioning strategies to prevent data skew. Leveraging Broadcast Joins for smaller lookup tables to speed up SQL queries. Caching intermediate DataFrames (persist(MEMORY_AND_DISK)) to avoid re-computation. Deliverables: A clean Jupyter Notebook and a standard SQL script compatible with spark-sql. Why me: I have fresh, hands-on experience with the latest Spark 3.x features (Adaptive Query Execution). I understand the nuances of HDFS and YARN resource allocation. I am ready to access your staging namespace and run a test on the sample data immediately. Best, Ilyas
₹6,500 INR in 4 days
0.0
0.0

Hello, I’m a seasoned PySpark and SQL developer with hands-on experience working on Hadoop clusters (YARN, HDFS, Hive Metastore) and delivering production-grade Spark 3.x pipelines. I’ve built and optimized PySpark jobs handling 100GB+ datasets, focusing on performance tuning (partitioning strategies, broadcast joins, adaptive query execution, memory/executor tuning) and rock-solid HiveQL / Spark SQL. I’m comfortable moving seamlessly from staging to full-scale environments while ensuring data correctness and stability. What I’ll deliver: A clean, well-documented PySpark notebook or .py job that runs reliably on your cluster Optimized SQL scripts / Hive views ready for Hive or spark-sql A concise README covering execution steps, parameters, dependencies, and expected outputs How I work: Validate logic thoroughly on the staging namespace and sample data Ensure jobs run error-free on a 100 GB test slice Profile and tune performance to meet the agreed runtime target before scaling Cross-check outputs against your onboarding samples for accuracy I’ve worked extensively with Spark 3.x, Hive, Hadoop ecosystems, and understand the operational nuances that matter in real-world big data environments. Looking forward to collaborating and getting this running smoothly end-to-end. Best regards, Narendra
₹9,500 INR in 6 days
0.0
0.0

I am a Data Engineer with experience in building scalable ETL/ELT pipelines, optimizing warehouse performance, and designing data models for analytics. I specialize in: Snowflake architecture & warehouse setup SQL transformation & query optimization Streams & Tasks automation Performance tuning & cost control Data migration & integration I focus on delivering high-performance, cost-efficient, and reliable data solutions.
₹7,000 INR in 7 days
0.0
0.0

I’m very interested in supporting your Hadoop cluster project. I have strong experience with Spark 3.x, PySpark, HiveQL, YARN, and HDFS, and I’ve worked on large-scale distributed environments handling multi-terabyte datasets. I specialize in developing optimized PySpark jobs and writing efficient SQL that performs reliably in production. For your staging setup, I will connect to the cluster, validate configurations, and profile the sample data. I’ll develop a clean, well-structured PySpark job (notebook or .py file as required), ensuring proper partitioning, join optimization, predicate pushdown, and shuffle tuning. The accompanying SQL scripts or Hive views will be optimized for performance and compatibility with Hive or spark-sql. Before scaling, I’ll benchmark the job on the 100GB test slice, tune Spark configurations (AQE, shuffle partitions, executor sizing), and ensure runtime meets the agreed target. Output validation will be performed to guarantee consistency with your onboarding sample. Deliverables will include production-ready PySpark code, optimized SQL scripts, and a concise README covering execution steps, parameters, and expected outputs. I look forward to collaborating and delivering a scalable, high-performance solution for your Hadoop environment.
₹7,000 INR in 4 days
0.0
0.0

Salem, India
Member since Feb 13, 2026
₹600-1500 INR
₹750-1250 INR / hour
$750-1500 USD
£10-30 GBP
$2-8 USD / hour
₹1000-10000 INR / hour
₹12500-37500 INR
$35-60 AUD / hour
₹600-1500 INR
₹600-1500 INR
₹37500-75000 INR
$30-250 USD
₹12500-37500 INR
₹12500-37500 INR
₹600-1000 INR
$2-8 USD / hour
₹1500-12500 INR
$2-8 USD / hour
$1500-3000 USD
₹12500-37500 INR