Its me Thanga (preferred to call) with 8 years of experience in big data and building data products. Implemented multiple data engineering projects across many domains such as Retail, Market Research, Finance, and Healthcare. Deep knowledge of Hadoop, Spark, Flume, Hive, NiFi, Kafka, Solr, ELK (Elasticsearch, Logstash, Kibana), MongoDB, Neo4j. Acquired programming knowledge in Python, SQL. Having hands-on dirty data to curated data.
Accumulate massive, complex data sets that meet functional / non-functional business requirements. Build the infrastructure required for optimal extraction, transformation, and loading of data from a wide variety of data sources using AWS Stack. Identify, design, and implement internal process improvements: automating manual processes, optimizing data delivery, re-designing infrastructure for greater scalability, etc.
Building Data Lake for our retail customers using AWS stack and providing curated data to data science team based on their use case requirements. Also handing all Research and Development projects with 5-8 team members. Giving my knowledge to junior data engineers by providing pieces of training.
Understanding client’s requirements and providing suitable analytics bridging the gap between manufacturer and customers. Creating dashboards/reports on sales and customer data for targeting marketing using SQL and Python.
Experience in developing insights based on customer behavior data. Acquired Customer behavior using weblogs using Python and SQL. Playing a key role in the implementation of Big Data initiatives.
Built a centralized data repository by sourcing data from OLTP, FTP, CRM, and application logs for providing a unified data platform for data analytics which was built using AWS tech stack. Building curated datasets for ML engineers to build their models in the cloud. Manageing AWS account and service throughout the organization. Built a high-level API for data access throughout the organization for leveraging big data and cloud within the organization.
Built a python framework for deploying and monitoring Machine learning models. By building a data pipeline to feed data to models in real-time. Providing a dashboard to ML engineers to measure data drift and model drift over time for model enhancement where ML technologies can generate business value by swiftly and reliably building, testing, scaling, and deploying ML technology into production.
Providing sales benchmarks based on different demography such as location, trade category. By Collection sales data of all clients create Spark jobs for data transformation and aggregation and jobs to get a summary table from a large data set that reduces the workload on report generation. Data cleansing and some pre-processing methods for the data science team creating KPI for clients so that they can easily find out how they improve their business strategy.
Recommend whether the new store location able to get more profit or not. As a data engineer Identifying trusted data sources for the new store location recommendation engine. Developing data pipeline using Python to extract relevant data from open data and Google place API. Extract data from many websites using python web scraping. The challenges are building a relationship between heterogeneous open data and providing business insights. Creating distance matrix between provided store location any nearby location to generate the scores.
Create a travel schedule for auditors to visit nearby stores based on the auditor's Location. As a Team Member of this project provided an Open Source Solution for Route Optimization. Setting up OSRM (Open Street Routing Machine) in Centos for finding the fastest path between two geo-points using OSRM. Providing Adhoc routes to the auditor in real-time with low time latency by implementing OSRM instead of a commercial product as a cost-saving to the business.
Enables GSD (Global Store Data) to accept data from multiple data sources and perform matching in such a way that only one golden copy of the store is retained. Performs the matching across various attributes and gives match score for all the matches displayed. Building data migration framework using Python to copy data from Oracle to Apache Solr.
Detect plagiarism in the assignments submitted by students online. Setting up Apache Hadoop environment for development and Hortonworks for production. Handling a massive volume of unstructured data. Building content extractor from DOC, DOCX, and PDF using Apache TIKA and load it in HDFS. Leveraging Jaccard co-efficient Algorithm for getting scores by comparing all documents in the repository. Creating external tables in HIVE for reporting and creating Hive Queries for reporting. Performance improvement through Apache Tez for Hive. Creating file watcher for student assignments to move files to HDFS using Apache Flume. Moving structural data to MySQL using SQOOP for reporting.
The purpose of project is to create application which can pull streaming Twitter data into Bigdata environment HDFS and Elasticsearch. The dashboard has been visualize using Kibana. Then named entity extraction done using spark and stored in Neo4j as a batch. Using Neo4j data an interactive network Visualization has been developed using Vis.js.
Configuring Flume agents and Hadoop/Elasticsearch sink for capturing real-time data for web server logs. Extracting useful information of web server log like user info, time stamps, the web page accessed, IP, The weblog is generated in large volume daily of a minimum of 20 GB. Creating HIVE external tables and Queries for reporting in Hive data warehouse. Performance improvement through Apache Tez for HIVE and providing Kibana GUI for exploration of weblog data using Elasticsearch.
In the Insurance Company claims are submitted, out of which few claims can be fraud. Job involves gathering requirements and setting up a design for dataflow. Creating ETL mappings and loading data into the database for report generation. The idea of the work is to find the fraudulent claims and analyze the factors which normally makes a claim a fraudulent claim like high claim amount, number of parts damaged etc.