ORGANISATION: Capgemini India Pvt Ltd
PROJECT : Scudee – Data Ingestion Framework
- Enterprise Data Management Platform (EDMp) is the enterprise wide information management platform built to perform sourcing, staging and reporting functions of the data from the various TP systems.
- It is the strategic data management platform designed to deliver high quality, integrated and timely data to meet regulatory requirements, Risk and Finance management reporting and analytics.
- The objective for EDM Data Sourcing is to source comprehensive data across all levels in the TP System Data Model – all tables and attributes available in the TPs are to be fed into Hadoop.
Role & Responsibilities:
- Worked with Business analyst team in gathering requirements for Analysis and Design phase.
- Involved in the process of data acquisition, data pre-processing and data exploration.
- Designed a NIFI workflow which will drive the Data ingestion and inject data into Hive tables of both Staging and processing layers.
- Analyzed current system and provided performance improvement solutions.
- Created avro, orc, hql hive DDL from schema xml and primary key files provided from CDC system with proper datatype mapping.
- Created Nifi custom processers to perform file level validation and data level validation and store data to hive avro staging layer.
- Applied a Deduplication process in staging layer.
- Written Java/Spark modules to perform data processing, filtering records of type such as insert, before update, after update, delete records and to store data from staging avro tables to Snapshot orc tables.
- Written Reconciliation Report generating program based on Row count & checksum for all tables of a TP system and reports are sent on a daily basis.
- Used necessary Collections and Oops concepts of Java to process the data Returned from Hdfs/Hive.
Tools and Technologies:
- Hadoop Distribution : Hortonworks Data Platform – HDP 2.3.4, Ambari 2.5
- Bigdata Stack : Spark, MapReduce, Hive, Oozie, HDF(NIFI)
- Languages : Java
- IDE : Eclipse
- SCM, DevOps : Git, Stash, Maven, Jenkins