Education

M.S. in Statistical Practice, Carnegie Mellon University, May 2021
B.S. in Statistics & Machine Learning, Carnegie Mellon University, Dec 2019

Skills

Computing: Python (TensorFlow, pandas, Pytorch, Keras, NumPy, Jupyter, spaCy, NLTK, scikit-learn, Matplotlib)
R, SQL/KQL, PySpark, C/C++, Java, Standard Meta Language (SML), Visual Molecular Dynamics (VMD)

AI/ML Skills: Large language Model, Deep Learning, Natural Language Processing, Machine Learning, Data Mining, Modern Regression, Statistical Inference, Probability Theory, Imperative/Functional Programming, Time Series Analysis, Causal Analysis, A/B Testing

Languages

English, Chinese, Japanese, French, Latin

Work Experience

Microsoft

Data & Applied Scientist II

July 2022 - Present

• Employ GPT-3 to streamline on-call engineering operations by distilling customer queries and summarizing prior actions for optimized response times during handoffs
• Develop an automated monitoring pipeline utilizing REST APIs to continuously evaluate system performance, proactively triggering alerts in the event of performance degradation, ensuring rapid issue identification and resolution
• Conduct A/B testing to compare Time to Mitigation (TTM) metrics between two groups, confirming that transitioning to the enhanced plan resulted in a significant reduction in TTM, leading to enhanced incident response and operational efficiency
• Leverage BERT to automate and accelerate the tagging process of incoming customer incident tickets with keyword extraction and feedback classification with 82% precision to save hundreds of hours per year on manual tagging
• Query and run aggregated analysis using R on past incident tickets to develop a time series model capable of forecasting future incident volumes based on region and product offering to prepare internal resources for incoming customer escalations
• Design and implement a scoring system using PySpark and SQL on customer email replies to create a new customer satisfaction metric by rating email content and suggesting best practices for support engineers

TuSimple

Data Scientist

August 2021 - July 2022

• Create pipelines to synchronize raw data sources into clean datasets ready for statistical analysis and modeling
• Classified driving patterns with supervised and unsupervised learning to provide baseline for modeling
• Built a stochastic process model for event aggregation to increase accuracy by 32%
• Initiate a study group within the data science team on internal toolkits

Pittsburgh Penguins

Analytics Consultant

January - May 2021

There are multiple traditional paths hockey prospects can take to get to the NHL – they stay in some “development leagues” such as USHL, NCAA and AHL before moving on to NHL. This project is to investigate the causal effect of players’ development paths on their performance and success in the NHL.

• Web scrapped and aggregated 15786 players’ biological and performance information from https://www.eliteprospects.com/
• Selected 1830 forward and 996 backward players who played in some developmental leagues in development year 0
• Ranked the importance of predictors and downsized the number of predictors to 7 (Games, Goals, Assists, PenaltyMinutes, Position, Height (cm), Weight (kg))
• Deployed Bayesian Additive Regression Trees (BART) to model performance using weak-learning (small) trees and additively combine these trees to reduce bias
• Calculated projected performance using conditional average treatment effect (CATE) through different development path
• Found that a path through NCAA is projected to score 13.36 more points for forward and 14.65 more for backward players when compared to player trained through USHL.

Little Moochi

Machine Learning Scientist

Dec 2019 - August 2020

• Implemented back-end algorithm at student startup, Little Moochi, to expand product which generated 500+ users since app's launch in February
• Construct a convolutional neural network with Python that detects ingredients in users’ food images
• Build an NLP model with skip-thought vectors to match recipes with corresponding images for ingredient analysis
• Develop nutrition database for food plate classification to provide users with food suggestions based on eating style
• Collaborate with front-end developers and graphic designers to understand users’ needs and enhance user experience through functional and cohesive codes

Pittsburgh
Supercomputing Center

Deep Learning Research Intern

May 2019 - May 2020

• Automated the detection of axons in zebrafish microscopy data with convolutional neural networks in TensorFlow
• Experimented with modifying network structure, rotating cells and parameter testing to improve accuracy to 95%
• Presented during symposium to 40+ professors and students at Pittsburgh Supercomputing Center
• Submitted a 8-page research paper to PEARC ’20 and facilitated a virtual presentation

Ogata Lab,
Institute for Chemical Research,
Kyoto University

Amgen Scholar

June - Aug 2018

Project: Insight into the role of glycosyltransferases encoded in PkV-RF01 using sequence-based comparison with other viruses
• Was selected as one of the 23 Amgen Scholar at Kyoto University among over 400 applicants;
• Manually manipulated CAZydb, a database dedicated for Carbohydrate-Active Enzymes annotation from experiments and literature, to compare PkV-RF01 against all known carbohydrate-active enzymes;
• Employed HMMER to annotated CAZyme domain boundaries determination according to the dbCAN CAZyme domain HMM database;
• Extracted protein sequence of viruses of interests, including PkV-RF01, from Virus-Host Database with Python to perform a homology search
•Employed dbCAN to search protein sequences in PkV-RF01 against glycosyltransferases' HMM profile with e-value = 10-5, 
coverage = 0.35 to find significant hits;
•Mapped protein sequences from previous steps and filtered output to obtain only glycosyltransferases-related genes;
• Searched genes against known viruses to infer functions of glycosyltransferases-related genes in PkV-RF01;
• Presented poster and oral presentation at symposium to Amgen Scholars and professors from Kyoto University and the University of Tokyo

Genewiz

Bioinformatics Intern

May - Aug 2017

• Developed innovative bioinformatics software, algorithms and statistical methods to detect how non-coding RNAs are involved in developing disease
• Derived insight through statistical analysis including PCA and SVM of multi-level data with R
• Collaborated with biologists, NGS technologists and software developers across the company

Projects

Wikipedia Article Question & Answer System

Class: Natural Language Processing

• Implemented a question type detector based on the top level nonterminal node in its parse tree using SpaCy and NLTK
• Built and trained a Recurrent Neural Network for sentiment analysis using PyTorch
• Used mean-pooled GLoVe embeddings for retrieving semantically relevant sentences
• Used SpaCy for tokenization, POS tagging, and dependency parsing
• Created a sentence relevance ranker with TF-IDF

Classifying Asians Using Face Recognition by Fine-Grained Deep Learning

Class: Deep Learning

• Cleaned 54,852 prelabeled profile pictures of Twitter users from China, Japan and Korea and resized images with OpenCV to capture important facial structures and features for an efficient training process
• Built a Convolutional Neural Network with Keras containing a pretrained 16-layer VGG with 3 fully connected layers
• Achieved 62.3% classification accuracy versus 33.33% by chance and 49% human accuracy

What Makes Shakespeare Distinctive? A Textual Analysis of Oxford Dictionary of National Biography

• Converted words to numerical vectors as word embeddings for quantitative analysis in Python
• Compared words and usages in Shakespeare with different time periods using linear transformation of word vectors
• Analyzed word context using Principal Component Analysis and K-Means Clustering to determine distinct properties of Shakespeare’s language

Detecting Alternate Start Codons in RNA

• Identified positional information of the main annotated start codons with Python of over 15,000 genes to look for features shared among the start codons
• Applied Lasso Regression to select the most important features of upstream start codons
• Trained machine learning model in R to predict alternative start codons from large ribosome profiling datasets
• Improved the efficiency of previous algorithm by reducing runtime by half and achieved 93% accuracy

Predicting AirBnb price and review score

Class: Data Mining

• Performed EDA to clean data and remove outliers on 5,201 listings.
• Regrouped geographic information (latitude-longitude) to the number of nearby landmarks within a 1-mile radius and quantified amenities per listing for feature engineering
• Tuned parameters on Random Forest models with 5-fold cross validation for price prediction.
• Decreased training and validation MSE to 1372.49 and 5951.704 from 14288 and 14574 respectively from the baseline linear regression model in price prediction.
• Used Xgboost for variable selection and hyperparameter search to predict review score (2 levels, 0,1).
• Improved accuracy from 0.862 to 0.9206 from the baseline linear regression model in review prediction.