Career Summary

I am a Data Scientist at KPMG US - LightHouse. I have worked on projects involving cybersecurity and NLP. I am interested in building end-to-end machine learning pipelines to provide data-driven solutions and actionable insights.

My past endeavors have also involved:
-Assembling the genome of parasites and analyzing them to identify mutations that correlate with experimental outcomes.
-Building 3D models of proteins using X-ray crystallography to design drugs rationally.

Data Projects

The ultimate goal is to be able to classify top news articles from popular media into positive or negative sentiment. We can all appreciate that people today, more than ever, are relying on digital media to obtain their daily news. Consequently, competition is high among news outlets to attract readership. While competition is great, the unfortunate down-side in this case is that the news articles are becoming increasingly more polarized. Polarized articles tend to excite more emotion, and therefore, more likes (or dislikes) which in-turn leads to more attention. The hypothesis in this case is that we should be able to obtain a pattern of the underlying propensities of each news media towards certain issues. For example, certain left-leaning or right-leaning news outlet might cover the same issues in completely contrasting light. This project uses NLP models to classify current news articles based on sentiment.

Key Features:

  • Downloads latest news on-the-fly.
  • Uses LSTM to accurately and automatically classify the conveyed sentiment of news articles.
  • Trained on a database of amazon reviews.

Technologies/Techniques used:

  • LSTM
  • TensorFlow
  • Keras
  • R
  • API
  • ggplot2

This project builds a prediction algorithm to tell when to buy, sell or hold a stock. This prediction task has been set up in such a way that the algorithm will make a buy/sell/hold prediction once every day. The data which is used are the historical prices of Netflix stock dating back to 2002. A sequential neural network is built using Keras and TensorFlow. It also uses Pandas and NumPy for handling the data and doing some computations.

Key Features:

  • Uses historical price movements.
  • Integrates S&P500 trends to incorporate overall market health.
  • Feature engineering extracts signal from the noise of market fluctuations.

Technologies/Techniques used:

  • Python
  • Pandas
  • NumPy
  • TensorFlow
  • Keras

This project classifies images from the CIFAR-10 dataset. The dataset consists of airplanes, dogs, cats, and other objects. The algorithm preprocesses the images, then trains a convolutional neural network on all the samples.

Key Features:

  • Uses fully connected convolutional neural network.
  • Utilizes MaxPooling for down-sampling.
  • Trained on GPUs using AWS instances.

Technologies/Techniques used:

  • CNN
  • TensorFlow
  • AWS

The R interface to Keras and TensorFlow, is used to predict biological activity of molecules. Numerical descriptors of the molecule's chemical structure are used as features for the prediction. The data came from an old Kaggle competition. The predictions obtained here are comparable to the top finishers in the original competition. One caveat: test data used to evaluate this model is not the same as the one used in the original competition. The model here was trained and tested on one molecular activity from the dataset. But, the method used here is generally applicable to other molecular activities as well.

Key Features:

  • Utilizes high-dimensional dataset.
  • Leverages PCA for feature projection.

Technologies/Techniques used:

  • R
  • Caret
  • TensorFlow
  • Keras

Work Experience

Data Scientist - Sr. Associate

KPMG US - LightHouse
2019 - Present

Build end-to-end production-ready machine learning pipelines.

Technologies/Techniques used:

  • Python
  • scikit-learn
  • Airflow
  • MLflow
  • Kubernetes
  • Keras

Data Science Fellow

Insight Data Science
2018 - 2019

For a company that wants to increase sales, it is far more critical to keep existing customers than to find new ones. There are two reasons for it: 1) it is far cheaper to keep customers than to look for new ones, 2) in most cases, customers with a prior order history have a lower barrier to ordering again than those who have never ordered on the platform before. Add to that the additional metrics the company prioritizes, like the sustainability of the products being purchased by the consumers. The problem now gets even more convoluted. So Imagine you are the head of a company that wants to increase sales and wants its customers to order more sustainable products while doing that. For my Insight project, I consulted for a company that faced this very problem. The company cares deeply about the sustainability of the products they sell. Without readjusting their focus on it, they also want, like any other company, to increase sales and grow their profits. Therefore, I built a dashboard to help them track the KPIs they care about through data visualizations. The dashboard also included a recommendation engine to provide personalized suggestions to customers and introduce them to relevant products.

Key Achievements:

  • Developed and delivered recommendation engine and dashboard to a consulting company.
  • Utilized PCA and KNN to identify similar customers based on order history.
  • Created dashboard in Bokeh for in-depth visualizations.
  • Deployed data visualizations using Flask, Bootstrap, and CSS.

Technologies/Techniques used:

  • Python
  • PCA
  • KNN
  • scikit-learn
  • Bokeh
  • Flask

Research Associate

University of Washington
2011 - Present

My research aim is to eradicate malaria. To achieve this bold goal, I used hundreds of malaria genomes from all over the world (>1 TB of data) to unravel for first time the surprising structure within parasite populations. This revealed that Indian parasites are more diverse than those of entire Africa or Asia. To make this highly computationally intensive task possible, I created a highly parallel analysis pipeline and deployed a network of nodes run entirely on Amazon Web Services. Additionally, I used biomolecular X-ray crystallography to watch novel drugs in action at atomic resolution. This involved shining high-energy X-ray beams of a synchrotron, like the Advanced Photon Source, on a protein crystal to record a diffraction pattern. This pattern in which X-rays scatter can then be used to infer the 3-D structure of the drug bound to its protein target. I have used this to solve the first ever atomic-resolution structure of two high-value drug targets (Topoisomerase II and OPRTase).

Key Achievements:

  • Independently migrated data storage and warehousing completely to the cloud.
  • Created fault tolerant analysis platform with AWS spot instances, which cut costs by 80%.
  • Cut analysis times by more than 100X by creating parallelized code and SGE architecture in the cloud, with no increase in compute costs.
  • Processed terabyte scale data to discover the surprising extent of parasite diversity in India.
  • Determined atomic resolution protein structure of anti-malarial drug targets using X-ray crystallography and Cryo-electron microscopy.

Technologies/Techniques used:

  • R
  • AWS
  • Next-Gen Sequencing
  • X-Ray Crystallography
  • Cryo-Electron Microscopy

Graduate Student

National University of Singapore and Van Andel Research Institute
2006 - 2011

Several physiological functions in the immune, nervous, endocrine, and muscular systems are controlled by pituitary adenylate cyclase activating polypeptide (PACAP). It acts by binding to its receptor, PAC1R, a member of class B G-protein coupled receptors (GPCR). Crystal structures of a number of Class B GPCR extracellular domains (ECD) bound to their respective peptide hormones have revealed a consensus mechanism of hormone binding. However, the last piece of the puzzle remained as elusive as the mechanism of how PACAP binds to its receptor remained controversial. This was due to an NMR structure of the PAC1R ECD/PACAP complex that revealed a distinct topology of the ECD and an aberrant mode of ligand recognition. In this study we showed a high-resolution (1.9 Å) crystal structure of the PAC1R ECD, which adopts the same fold as commonly observed for other members of Class B GPCR. Biochemical binding studies and cell-based assays with serial mutant scanning of peptides and receptor support a model that PAC1R uses the same conserved fold of Class B GPCR ECD for PACAP binding, thus unifying the consensus mechanism of hormone binding for this family of receptors. The result of this research provided the crucial missing link that enabled a unified theory of class B GPCR activation.

Key Achievements:

  • Solved the first atomic resolution structure of a Human cell receptor.
  • Formalized Human cell receptor activation mechanism involved in cancer.
  • Analyzed protein-protein interactions using Hydrogen-Deuterium exchange Mass-Spectrometry.

Technologies/Techniques used:

  • X-Ray Crystallography
  • Hydrogen-Deuterium Exchange Mass-Spectrometry
  • Cell Culture

Publications

  • Kumar S, Pioszak A, Zhang C, Swaminathan K, Xu HE. Crystal structure of the PAC1R extracellular domain unifies a consensus fold for hormone recognition by class B G-protein coupled receptors. PLoS One. 2011;6(5):e19682. doi: 10.1371/journal.pone.0019682. Epub 2011 May 19. PubMed PMID: 21625560; PubMed Central PMCID: PMC3098264.

  • Pal K, Kumar S, Sharma S, Garg SK, Alam MS, Xu HE, Agrawal P, Swaminathan K. Crystal structure of full-length Mycobacterium tuberculosis H37Rv glycogen branching enzyme: insights of N-terminal beta-sandwich in substrate specificity and enzymatic activity. J Biol Chem. 2010 Jul 2;285(27):20897-903. doi: 10.1074/jbc.M110.121707. Epub 2010 May 5. PubMed PMID: 20444687; PubMed Central PMCID: PMC2898361

  • Kumar S, Badireddy S, Pal K, Sharma S, Arora C, Garg SK, Alam MS, Agrawal P, Anand GS, Swaminathan K. Interaction of Mycobacterium tuberculosis RshA and SigH is mediated by salt bridges. PLoS One. 2012;7(8):e43676. doi: 10.1371/journal.pone.0043676. Epub 2012 Aug 24. PubMed PMID: 22937074; PubMed Central PMCID: PMC3427169.

  • Kumar S, Krishnamoorthy K, Mudeppa DG, Rathod PK. Structure of Plasmodium falciparum orotate phosphoribosyltransferase with autologous inhibitory protein-protein interactions. Acta Crystallogr F Struct Biol Commun. 2015 May;71(Pt 5):600-8. doi: 10.1107/S2053230X1500549X. Epub 2015 Apr 21. PubMed PMID: 25945715; PubMed Central PMCID: PMC4427171.

  • Mudeppa DG, Kumar S, Kokkonda S, White J, Rathod PK. Topoisomerase II from Human Malaria Parasites: EXPRESSION, PURIFICATION, AND SELECTIVE INHIBITION. J Biol Chem. 2015 Aug 14;290(33):20313-24. doi: 10.1074/jbc.M115.639039. Epub 2015 Jun 8. PubMed PMID: 26055707; PubMed Central PMCID: PMC4536438.

  • Chery L, Maki JN, Mascarenhas A, Walke JT, Gawas P, Almeida A, Fernandes M, Vaz M, Ramanan R, Shirodkar D, Bernabeu M, Manoharan SK, Pereira L, Dash R, Sharma A, Shaik RB, Chakrabarti R, Babar P, White J 3rd, Mudeppa DG, Kumar S, Zuo W, Skillman KM, Kanjee U, Lim C, Shaw-Saliba K, Kumar A, Valecha N, Jindal VN, Khandeparkar A, Naik P, Amonkar S, Duraisingh MT, Tuljapurkar S, Smith JD, Dubhashi N, Pinto RG, Silveria M, Gomes E, Rathod PK. Demographic and clinical profiles of Plasmodium falciparum and Plasmodium vivax patients at a tertiary care centre in southwestern India. Malar J. 2016 Nov 25;15(1):569. PubMed PMID: 27884146; PubMed Central PMCID: PMC5123287.

  • Kumar S, Mudeppa DG, Sharma A, Mascarenhas A, Dash R, Pereira L, Shaik RB, Maki JN, White J 3rd, Zuo W, Tuljapurkar S, Duraisingh MT, Gomes E, Chery L, Rathod PK. Distinct genomic architecture of Plasmodium falciparum populations from South Asia. Mol Biochem Parasitol. 2016 Nov - Dec;210(1-2):1-4. doi: 10.1016/j.molbiopara.2016.07.005. Epub 2016 Jul 22. PubMed PMID: 27457272; PubMed Central PMCID: PMC5249597.

Skills & Tools

Python

  • NumPy
  • Pandas
  • Scikit-learn
  • Jupyter notebook
  • Keras
  • TensorFlow
  • Bokeh
  • Flask

R

  • ggplot2
  • Caret
  • BiocParallel
  • Lubridate
  • Knitr
  • Stringr
  • RColorBrewer

Machine Learning

  • Linear regression
  • Logistic regression
  • Gradient boosting
  • Random forests
  • SVM
  • Unsupervised learning
  • Neural net
  • DC-GAN
  • LSTM

Web

  • Shiny web apps
  • Markdown
  • Jekyll
  • Elastic Beanstalk (AWS)
  • Bootstrap

Systems

  • Ubuntu Linux
  • EC2 (AWS)
  • Docker
  • Git
  • SGE clusters

Others

  • Protein crystallography
  • Cryo-electron microscopy
  • Evolution and population structure
  • Biochemistry
  • Gene cloning
  • Mass-spectrometry

Education

  • Ph.D. Biophysics and Structural Biology
    National University of Singapore
    2006 - 2011
  • B. Tech. Bioinformatics
    Vellore Institute of Technology
    2002 - 2006

Certification

  • Nanodegree in Deep Learning
    Udacity
    2017 - 2017

Awards

  • Graduate Research Scholarship

Language

  • English (Native)
  • Hindi (Native)

Interests

  • Cricket
  • Tennis
  • Non-fiction books