Uploaded File
add photo
Balaji
pb_2910@yahoo.com
206-604-2864
13809 NE 11th Street
Bellevue, WA 98005
Big Data Developer
12 years experience W2
0
Recommendations
Average rating
39
Profile views
Summary

  • 9.9 years of experience in IT which includes Analysis, Design and Development of Big Data using Hadoop and SPARK and data bases includes SQLSERVER, My SQL, ORACLE, TERADATA, IMS DB and DB2.
  • Around 4.5+ years of work experience on Big Data with hands on experience in Hadoop ecosystem components like Hadoop Map reduce, HDFS, Zookeeper, Hive, Hbase, Sqoop, Apache NIFI, Oozie and SPARK, CTRL-M, KAFKA.
  • Good Understanding of Hadoop architecture and Hands on experience with Hadoop components such as YARN, Job Tracker, Task Tracker, Name Node, Data Node, Application master and Map Reduce concepts and HDFS Framework.
  • Experience in using Apache Ambari for installation and management of single-node and multi-node Hadoop cluster (Ambari 2.5).
  • Experience in Data load management, importing & exporting data using SQOOP.
  • Experience in scheduling and monitoring jobs thru Oozie, Zookeeper and CTRL-M.
  • Experience in writing Map Reduce programs & UDF's for Hive in java.
  • Experience in integrating Hive and Hbase for effective operations.
  • Experience in writing end to end Spark data processing (2.3) as a part of Azure cost usage project.
  • Experience in Developing Scala program by implementing Spark Streaming by integrating with Kafka and triggering via Nifi.
  • Having good knowledge in writing scripts using Bash shell in Linux.
  • Experience in understanding and querying Databases like TERADATA, ORACLE, MYSQL, SQL SERVER and integrating with Hadoop HDFS storage.
  • Worked on different file formats (ORCFILE, AVRO, TEXTFILE) and different Compression Codecs (GZIP, SNAPPY).
  • Strong understanding of Data warehouse concepts, ETL, data modeling experience using Normalization, Business Process Analysis, Reengineering, Dimensional Data Modeling, Physical & Logical data modeling.
  • Experience in JAVA concepts like Oops, collections, Multithreading, JDBC with strong analytical and problem solving skills and ability to follow through with projects from inception to completion.
  • Utilized Apache Hadoop environment by Hortonworks.
  • Knowledge in software Development Life Cycle (Requirements Analysis, Design, Development, Testing, Deployment and Support).
  • Implemented end to end cloud data processing framework by creating Azure hdinsights Spark cluster and Azure ADF pipelines and integrating with SQL server at the source and Azure BLOB storage accounts at the end.
  • Have good interpersonal communication skills, strong problem solving skills, explore/adopt to new technologies with ease and a good team member.

TECHNICAL PROFICIENCY – BIGDATA

  • OPERATING SYSTEMS LINUX, UNIX, WINDOWS
  • HADOOP ECO SYSTEM Hadoop (HDFS and Map-Reduce), YARN, Zookeeper, Hive, Oozie
  • PROCESSING FRAMEWORK Map Reduce and APACHE SPARK
  • DATA INGESTION SQOOP, KAFKA
  • Others Apache NIFI
  • LANGUAGES/SCRIPT Shell Script(bash), Java, Scala, Python
  • Cluster Management Hortonworks, Ambari
  • Operating Systems Unix, Windows, Linux (CentOS, Ubuntu, Redhat)
  • Development Tools Maven, Jenkins, GITHUB, Gerrit, gitbash
  • Cloud Technologies AWS EC2, EMR, S3 bucket, Azure Hd insights, Azure data factory, Azure BLOB storage account.

Experience
Big data Developer
Information Technology
Dec 2019 - Feb 2020
Seattle, WA

PROJECT : Azure Cost management and cost optimization

  • Azure cost management enables customers or enterprise organization to analyze cost, track resource usage, create and manage budgets on the usage across all clouds with a rich and unified view, and access rich operational and financial insights to make informed decisions.

Role & Responsibilities:

  • Implemented Spark-Scala data processing module which pulls usage data of Azure consumers and cross reference with price sheet details of azure components per hour and thus compute the cost and load the details to Azure Blob storage which will be consumed by BI tools for real time display.
  • Migrated Azure spark cluster (2.1) to (2.3) with necessary code changes inorder to implement Azure Hd insights auto scale feature.
  • Auto scale feature enables clusters to configure with minimum and maximum number of nodes while creation. Cluster runs with minimum number of nodes during idle time and based on load during runtime the cluster auto scales as per spark configuration.
  • Created Azure data factory pipelines (JSON modules) to orchestrate and schedule the functional flow between components.

Tools and Technologies:

  • Hadoop Distribution : HD insights(3.6 & 4.0), Ambari 2.6
  • Bigdata Stack : SPARK
  • Languages : Scala, shell-script
  • IDE : Eclipse
  • DevOps : Git, Jenkins pipeline
  • Cloud : Azure HDinsights cluster, AZURE BLOB storage, Azure Data Factory pipeline, Azure SQL server.
Ambari Big Data Devops Eclipse Git Hadoop Jenkins MS Azure Spark SQL SQL Server
Remove Skill
Big Data Developer
Information Technology
Nov 2017 - Nov 2019
Seattle, WA

PROJECT : Integrated Data Warehouse

  • Integrated Data warehouse (IDW) is the enterprise wide data management framework built to perform sourcing, staging and reporting functions of the data from the various source systems.
  • IDW framework is designed to deliver high quality, integrated and timely data to meet regulatory requirements, Risk and Finance management reporting and analytics.

Role & Responsibilities:

  • Worked in designing an integrated data warehouse framework to receive data from various source systems and storing to Hive/Hbase tables. Stored data are finally exported using Sqoop to Teradata core tables which are available to business.
  • Implemented ETL process of data-extract, data-cleaning, file, data validations and data-transformations and finally loading to HDFS storage. Reconciliation is performed to ensure there is no loss of data.
  • Created modules to perform cleaning and loading to Enterprise data warehouse By transforming the data into business needs.
  • Implemented slowly changing dimensions as per the requirements.
  • Created Teradata Bteq Stored procedures and trigger them via unix script.
  • Utilized highly compressed storage formats of Hive (ORC) which serves data storage utilization.
  • Implemented Oozie workflows with jobs which are scheduled and monitored by Ctrl-M
  • IDW framework is optimized by integrating ETL-Informatica workflows with Hadoop platform which ease the large enterprise data storage and processing.
  • Encryption of customer’s sensitive data are performed by incorporating voltage frameworks to ensure data-security for the information stored in datalake.
  • Designed a monitoring framework using Tripwire which monitors for authorized changes deployed to production and trigger an email report for authorized and unauthorized components.
  • Implemented High availability in Hive metastore and Resource manager. Identified and changed modules which are having hardcoded Hive Metastore and Resource Manager with generic parameters using python script which eases implementation of High availability.

Tools and Technologies:

  • Hadoop Distribution : Hortonworks Data Platform – HDP 2.3.4­, Ambari 2.5
  • Bigdata Stack : MapReduce, Hive, Hbase, Oozie, Sqoop, Ctrl-M, KAFKA
  • Languages : Java, Bash-shell (Red-Hat), Scala
  • IDE : Eclipse
  • DevOps : Git, Gerrit, Jenkins pipeline,
  • Tools : Tripwire Enterprise
Big Data Ambari Data Validation Data Warehousing Devops Eclipse ETL Git Hadoop Hbase HDFS Hive Informatica Java Jenkins MapReduce Oozie Python Sqoop Stored Procedure Teradata UNIX
Remove Skill
Hadoop SPARK Developer
Information Technology
Apr 2016 - May 2017

ORGANISATION: Capgemini India Pvt Ltd

PROJECT : Scudee – Data Ingestion Framework

  • Enterprise Data Management Platform (EDMp) is the enterprise wide information management platform built to perform sourcing, staging and reporting functions of the data from the various TP systems.
  • It is the strategic data management platform designed to deliver high quality, integrated and timely data to meet regulatory requirements, Risk and Finance management reporting and analytics.
  • The objective for EDM Data Sourcing is to source comprehensive data across all levels in the TP System Data Model – all tables and attributes available in the TPs are to be fed into Hadoop.

Role & Responsibilities:

  • Worked with Business analyst team in gathering requirements for Analysis and Design phase.
  • Involved in the process of data acquisition, data pre-processing and data exploration.
  • Designed a NIFI workflow which will drive the Data ingestion and inject data into Hive tables of both Staging and processing layers.
  • Analyzed current system and provided performance improvement solutions.
  • Created avro, orc, hql hive DDL from schema xml and primary key files provided from CDC system with proper datatype mapping.
  • Created Nifi custom processers to perform file level validation and data level validation and store data to hive avro staging layer.
  • Applied a Deduplication process in staging layer.
  • Written Java/Spark modules to perform data processing, filtering records of type such as insert, before update, after update, delete records and to store data from staging avro tables to Snapshot orc tables.
  • Written Reconciliation Report generating program based on Row count & checksum for all tables of a TP system and reports are sent on a daily basis.
  • Used necessary Collections and Oops concepts of Java to process the data Returned from Hdfs/Hive.

Tools and Technologies:

  • Hadoop Distribution : Hortonworks Data Platform – HDP 2.3.4­, Ambari 2.5
  • Bigdata Stack : Spark, MapReduce, Hive, Oozie, HDF(NIFI)
  • Languages : Java
  • IDE : Eclipse
  • SCM, DevOps : Git, Stash, Maven, Jenkins
Ambari Data Validation Data Warehousing Devops Eclipse ETL Git Hadoop Hbase HDFS Hive Java Jenkins MapReduce Oozie Python Spark Sqoop Stored Procedure Teradata UNIX Maven XML
Remove Skill
Hadoop Map Reduce Developer
Insurance
Nov 2014 - Mar 2016

ORGANISATION: Cognizant Technology Solutions

PROJECT : Risk Analytics

  • Worked with clients to understand the business needs and did the requirement gathering, Analysis and Design.
  • Worked with the source team to configure the data received from source systems to cluster nodes of Hortonworks.
  • Data received are picked and validated by Map reduce modules.
  • Validated data are then placed to HDFS and are available for the Map reduce modules to carry on with data processing.
  • Processed data are then inserted into Orc formatted Hive tables which will be available to end clients to run queries.
  • Predicting Insurer's Personality so that it can be used as a key term for Actuarial Calculations using Hadoop – HDFS, Hive.
  • To batch Actuarial Data using Hadoop so that processing time is reduced from one day to 2 hrs.

POC:

  • POC comprises of data migration which includes modules like Data Design, Data Ingestion and Data Processing.

Data Design:

  • This module involves extracting necessary fields from Source DB MYSQL followed by data cleaning which detects and corrects corrupted or inaccurate records from source database.
  • Cleaned data are transformed as a structured dataset.

Data Ingestion:

  • Loading the structured Healthcare Insurers data set and Healthcare plan dataset into new MYSQL tables.
  • Loading the newly created tables from MYSQL into HBASE using SQOOP.
  • Created both External & Managed Hive tables (version 0.11) for the data stored in HBASE.

Data Processing:

  • Hive Queries are performed over the data loaded in Hbase. Performance tuning is done using Bucketing/Partitioning. Query processing time is greatly reduced and random data access is efficiently fast when compared to MYSQL query.
  • Semi-structured data such as log dataset in HDFS are processed using Pig scripts.
  • Data ingestion and data processing operations are automated and scheduled using Oozie scheduler.

Role & Activities:

  • Worked on a live Multinode node Hadoop cluster running Apache Hadoop 1.2.1
  • Worked with highly structured and semi structured data of 5 to 6 TB in size (15 to 18 TB with replication factor of 3)
  • Extracted the data from MYSQL into HDFS using Sqoop.
  • Created and worked Sqoop (version 1.4.3) jobs with incremental load to populate Hive External tables.
  • Extensive experience in writing Pig (version 0.13) scripts to transform raw data from several data sources into forming baseline data.
  • Developed Hive (version 0.11) scripts for end user / analyst requirements to perform ad hoc analysis.
  • Very good understanding of Partitions, Bucketing concepts in Hive and designed both Managed and External tables in Hive to optimize performance.
  • Solved performance issues in Hive with understanding of Joins, Group and aggregation and how does it translate to MapReduce jobs.
  • Experience in using Sequence files and RCFile file formats.
  • Developed Oozie workflow for scheduling and orchestrating the ETL process.
  • Good working knowledge of Amazon Web Service components like EC2.
  • Very good experience with both MapReduce 1 (Job Tracker) and MapReduce 2 (YARN) setups
  • Very good experience in monitoring and managing the Hadoop cluster.
  • Used necessary Collections and oops concepts of Java to process the data returned from HDFS.

Tools and Technologies:

  • Hadoop Distribution : Hortonworks Data Platform – HDP 2.3.4­
  • Bigdata Stack : MapReduce, Hive, HBase, Sqoop, Pig, Oozie.
  • Languages : Java
Devops Git Hadoop Jenkins MapReduce Maven Apache Data Cleansing Data Migration ETL Hbase HDFS Hive Java MySQL node.js Oozie Pig Sqoop
Remove Skill
Mainframe Developer
Information Technology
Feb 2013 - Oct 2014

Organisation: IBM INDIA PVT LIMITED, Chennai, India

About the Project:

  • My Area of work is in applications POL, VAR, BIPER and VTAE where POL and VAR is mainly for Letter creation to customers and BIPER involves Pension policies and VTAE is condition check routine modules which displays Error message to Business dept. if any of the condition check is failed.

Role & Activities:

  • Developing new modules or do amendments.
  • Preparation of test data which means nothing but Test cases.
  • ATS testing is like first level Unit testing.
  • BTS online testing whenever code change impacts online screen.
  • ALF which moves the code to Next stage for system testing and User testing.
  • Any bugs identified in system testing and user testing will be fixed and staged again.
  • Production failures if any will be considered as high priority and supported immediate.
Apache ETL Java MySQL node.js
Remove Skill
About the Project
Jan 2013 - Feb 2013
  • My Area of work is in applications POL, VAR, BIPER and VTAE where POL and VAR is mainly for Letter creation to customers and BIPER involves Pension policies and VTAE is condition check routine modules which displays Error message to Business dept. if any of the condition check is failed. Role & Activities:
  • Developing new modules or do amendments.
  • Preparation of test data which means nothing but Test cases.
  • ATS testing is like first level Unit testing.
  • BTS online testing whenever code change impacts online screen.
  • ALF which moves the code to Next stage for system testing and User testing.
  • Any bugs identified in system testing and user testing will be fixed and staged again.
  • Production failures if any will be considered as high priority and supported immediate.
No skills were added
Remove Skill
Mainframe Developer
Information Technology
Sep 2009 - Jan 2013

Organisation: STERIA INDIA LIMITED, Chennai, India

About the Project:

  • A full cycle Insurance product development, Servicing and maintenance take place in Nova & Phase3 (Life Insurance) System and Pulsar (Pensions) System.

Role & Activities:

  • Amended online screen changes to include the Benefits.
  • Developing New Modules to integrate changes in online screen with the DB.
  • Developing New Modules to update, validate new changes entered by users in online screen.
  • Test Data Setup by loading new policies into online database and did unit testing
  • and System Testing
  • Code and Data fix during User Acceptance Testing.
No skills were added
Remove Skill
About the Project
Information Technology
Sep 2009 - Sep 2009
  • A full cycle Insurance product development, Servicing and maintenance take place in Nova & Phase3 (Life Insurance) System and Pulsar (Pensions) System. Role & Activities:
  • Amended online screen changes to include the Benefits.
  • Developing New Modules to integrate changes in online screen with the DB.
  • Developing New Modules to update, validate new changes entered by users in online screen.
  • Test Data Setup by loading new policies into online database and did unit testing
  • and System Testing
  • Code and Data fix during User Acceptance Testing.
No skills were added
Remove Skill
Edit Skills
Non-cloudteam Skill
Education
Bachelor's in Pollachi
Mahalingam college of Engineering and Technology 2009
Skills
ETL
2019
6
Java
2019
6
Devops
2020
4
Git
2020
4
Hadoop
2020
4
Hbase
2019
4
HDFS
2019
4
Hive
2019
4
Jenkins
2020
4
MapReduce
2019
4
Oozie
2019
4
Sqoop
2019
4
Ambari
2020
3
Data Validation
2019
3
Data Warehousing
2019
3
Eclipse
2020
3
Python
2019
3
Stored Procedure
2019
3
Teradata
2019
3
UNIX
2019
3
Apache
2016
2
Big Data
2020
2
Informatica
2019
2
Maven
2017
2
MySQL
2016
2
node.js
2016
2
Data Cleansing
2016
1
Data Migration
2016
1
Pig
2016
1
Spark
2020
1
XML
2017
1
AWS
0
1
BaSH
0
1
CentOS
0
1
Data Modeling
0
1
JDBC
0
1
Linux
0
1
MS Azure
2020
1
Oracle
0
1
RedHat
0
1
SQL
2020
1
SQL Server
2020
1
Ubuntu
0
1
Windows
0
1