Expertise in designing and deployment of Hadoop Cluster and different analytical tools including Pig, Hive, HBase, Sqoop, Kafka Spark with Cloudera distribution.
Working on a live 20 nodes Hadoop cluster running on CDH4.4.
Working with highly unstructured and semi structured data of 40 TB in size (120 TB with replication factor of 3)
Managing external tables in Hive for optimized performance.
Very good understanding of Partitions and Bucketing in Hive
Developed Spark scripts using Scala as per the requirement using Spark 1.5 framework.
Using Spark API’s over Cloudera Hadoop Yarn to perform analytics on data used for Hive stored at HDFS.
Developed Scala Scripts, UDFs using both Data frames/SQL and RDD in Spark for data aggregation, queries and writing data back onto HDFS.
Exploring Spark to improve the performance and optimization of the existing algorithms in Hadoop using Spark context, Spark data frames, pair RDDs, double RDDs and Yarn.
Developed Spark code and Spark-SQL/Streaming for faster testing and processing of data.
Experience in deploying data from various sources into HDFS and facilitating report building on top of it as per the business requirement.
Performed transformations, cleaning, standardization and filtering of data using Spark Scala/Python and loaded the final required data to HDFS.
Load the data into Spark immutable RDDs and perform in-memory computation to generate quick and better response.
Analyzing how the data been processed by Informatica can be effectively processed using Spark and its API’s.