- Developed architecture document, process documentation, server diagrams,requisition documents¥Developed ETL data pipelines using Spark, Spark streaming and Scala.
- Loaded data from RDBMS to Hadoop using Sqoop
- Worked collaboratively to manage build outs of large data clusters and real time streaming with Spark.
- Responsible for loading Data pipelines from web servers using Sqoop, Kafka and SparkStreaming API
- Developed the Kafka producers, partitions in brokers and consumer groups.
- Used Spark for interactive queries, processing of streaming data and integration with popular NoSQL database for huge volume of data.
- Developed the batch scripts to fetch the data from AWS S3 storage and do required transformations in Scala using Spark framework.
- Implemented Spark using Scala and Sparks for faster testing and processing of data.
- Data Processing: Processed data using Map Reduce and Yarn.
- Worked on Kafka as a proof of concept for log processing.
- Tested Apache Tez, an extensible framework for building high performance batch and interactive data processing applications, on Pig and Hive jobs
- Monitoring the hive Meta store and the cluster nodes with the help of Hue.
- Developed Spark code using Scala and Spark-SQL/Streaming for faster processing of data.
- Created AWS EC2 instances and used JIT servers.
- Developed various UDFs in Map-Reduce and Python for Pig and Hive.¥Data Integrity checks have been handled using hive queries, Hadoop and Spark.
- Worked on performing transformations & actions on RDDs and Spark Streaming data with Scala.
- Implemented the Machine learning algorithms using Spark with Python.
- Defined job flows and developed simple to complex Map Reduce jobs as per the requirement.
- Optimized Map/Reduce Jobs to use HDFS efficiently by using various compression mechanisms
- Responsible in handling Streaming data from web server console logs.
- Installed Oozie workflow engine to run multiple Hive and Pig Jobs.
- Developed PIG Latin scripts for the analysis of semi structured data.
- Used Hive and created Hive tables and involved in data loading and writing Hive UDFs.
- Used Sqoop to import data into HDFS and Hive from other data systems.
- Installed and configured Apache Hadoop to test the maintenance of log files in Hadoop cluster.
- Installed and configured Hive, Pig, Sqoop, Flume and Oozie on the Hadoop cluster.¥Worked on developing ETL processes (Data Stage Open Studio) to load data frommultiple data sources to HDFS using FLUME and SQOOP and performed structuralmodifications using Map Reduce, HIVE.
- Involved in NoSQL database design, integration and implementation
- Loaded data into NoSQL database HBase.
- Developed Kafka producer and consumers, HBase clients, Spark and HadoopMapReduce jobs along with components on HDFS, Hive.
- Very good understanding of Partitions, Bucketing concepts in Hive and designed bothManaged and External tables in Hive to optimize performance.
Environment: Spark, Spark Streaming, Apache Kafka, Hive, Tez, AWS, ETL, PIG, UNIX, Linux, Tableau, Teradata, Pig, Sqoop, HDFS, Map Reduce, Flume, Hive, Informatica 9.1/8.1/7.1/6.1 , Oracle 11g, ETL, Hadoop 2.x, NOSQL, Flat files, Eclipse