Close

Thangarajan

Data Engineer

Download Resume

About Thanga

Its me Thanga (preferred to call) with 8 years of experience in big data and building data products. Implemented multiple data engineering projects across many domains such as Retail, Market Research, Finance, and Healthcare. Deep knowledge of Hadoop, Spark, Flume, Hive, NiFi, Kafka, Solr, ELK (Elasticsearch, Logstash, Kibana), MongoDB, Neo4j. Acquired programming knowledge in Python, SQL. Having hands-on dirty data to curated data.

Experience

TVS Credit Service

Senior Data Engineer

Accumulate massive, complex data sets that meet functional / non-functional business requirements. Build the infrastructure required for optimal extraction, transformation, and loading of data from a wide variety of data sources using AWS Stack. Identify, design, and implement internal process improvements: automating manual processes, optimizing data delivery, re-designing infrastructure for greater scalability, etc.

Pathfinder Global FZCO

Senior Data Engineer and R&D Lead

Building Data Lake for our retail customers using AWS stack and providing curated data to data science team based on their use case requirements. Also handing all Research and Development projects with 5-8 team members. Giving my knowledge to junior data engineers by providing pieces of training.

AC Nielsen

Senior Executive

Understanding client’s requirements and providing suitable analytics bridging the gap between manufacturer and customers. Creating dashboards/reports on sales and customer data for targeting marketing using SQL and Python.

Hexaware Technologies

Senior Software Engineer

Experience in developing insights based on customer behavior data. Acquired Customer behavior using weblogs using Python and SQL. Playing a key role in the implementation of Big Data initiatives.

Education

Madurai Kamaraj University

June 2010 - May 2013

Master Computer Application

N.M.S.S.V.N Colleage, Madurai

June 2007 - May 2010

Bachelor of Science in Computer Science

Web Profiles

Projects

Enterprise Data Lake

Built a centralized data repository by sourcing data from OLTP, FTP, CRM, and application logs for providing a unified data platform for data analytics which was built using AWS tech stack. Building curated datasets for ML engineers to build their models in the cloud. Manageing AWS account and service throughout the organization. Built a high-level API for data access throughout the organization for leveraging big data and cloud within the organization.

MLOPS Framework

Built a python framework for deploying and monitoring Machine learning models. By building a data pipeline to feed data to models in real-time. Providing a dashboard to ML engineers to measure data drift and model drift over time for model enhancement where ML technologies can generate business value by swiftly and reliably building, testing, scaling, and deploying ML technology into production.

Retail Index and Benchmarking

Providing sales benchmarks based on different demography such as location, trade category. By Collection sales data of all clients create Spark jobs for data transformation and aggregation and jobs to get a summary table from a large data set that reduces the workload on report generation. Data cleansing and some pre-processing methods for the data science team creating KPI for clients so that they can easily find out how they improve their business strategy.

OData 360° Catchment Analysis – New store location recommendation engine

Recommend whether the new store location able to get more profit or not. As a data engineer Identifying trusted data sources for the new store location recommendation engine. Developing data pipeline using Python to extract relevant data from open data and Google place API. Extract data from many websites using python web scraping. The challenges are building a relationship between heterogeneous open data and providing business insights. Creating distance matrix between provided store location any nearby location to generate the scores.

Planning and Route Optimization

Create a travel schedule for auditors to visit nearby stores based on the auditor's Location. As a Team Member of this project provided an Open Source Solution for Route Optimization. Setting up OSRM (Open Street Routing Machine) in Centos for finding the fastest path between two geo-points using OSRM. Providing Adhoc routes to the auditor in real-time with low time latency by implementing OSRM instead of a commercial product as a cost-saving to the business.

Match and Merge on Global Store Data

Enables GSD (Global Store Data) to accept data from multiple data sources and perform matching in such a way that only one golden copy of the store is retained. Performs the matching across various attributes and gives match score for all the matches displayed. Building data migration framework using Python to copy data from Oracle to Apache Solr.

Text analytics project for leading international educational firm

Detect plagiarism in the assignments submitted by students online. Setting up Apache Hadoop environment for development and Hortonworks for production. Handling a massive volume of unstructured data. Building content extractor from DOC, DOCX, and PDF using Apache TIKA and load it in HDFS. Leveraging Jaccard co-efficient Algorithm for getting scores by comparing all documents in the repository. Creating external tables in HIVE for reporting and creating Hive Queries for reporting. Performance improvement through Apache Tez for Hive. Creating file watcher for student assignments to move files to HDFS using Apache Flume. Moving structural data to MySQL using SQOOP for reporting.

Twitter Real-time Named Entity Extraction

The purpose of project is to create application which can pull streaming Twitter data into Bigdata environment HDFS and Elasticsearch. The dashboard has been visualize using Kibana. Then named entity extraction done using spark and stored in Neo4j as a batch. Using Neo4j data an interactive network Visualization has been developed using Vis.js.

Detect and Report Web-Traffic Anomalies in Near Real-Time

Configuring Flume agents and Hadoop/Elasticsearch sink for capturing real-time data for web server logs. Extracting useful information of web server log like user info, time stamps, the web page accessed, IP, The weblog is generated in large volume daily of a minimum of 20 GB. Creating HIVE external tables and Queries for reporting in Hive data warehouse. Performance improvement through Apache Tez for HIVE and providing Kibana GUI for exploration of weblog data using Elasticsearch.

Fraud detection for Auto Insurance Company

In the Insurance Company claims are submitted, out of which few claims can be fraud. Job involves gathering requirements and setting up a design for dataflow. Creating ETL mappings and loading data into the database for report generation. The idea of the work is to find the fraudulent claims and analyze the factors which normally makes a claim a fraudulent claim like high claim amount, number of parts damaged etc.

Skills

Get in Touch