 
									
									Data & Applied Scientist II
 Microsoft 
 July 2022 - Present
							M.S. in Statistical Practice, Carnegie Mellon University, May 2021
							B.S. in Statistics & Machine Learning, Carnegie Mellon University, Dec 2019
 
						 Computing: Python (TensorFlow, pandas, Pytorch, Keras, NumPy, Jupyter, spaCy, NLTK, scikit-learn, Matplotlib) 
 R, SQL/KQL, PySpark, C/C++, Java, Standard Meta Language (SML), Visual Molecular Dynamics (VMD)
AI/ML Skills: Large language Model, Deep Learning, Natural Language Processing, Machine Learning, Data Mining, Modern Regression, Statistical Inference, Probability Theory, Imperative/Functional Programming, Time Series Analysis, Causal Analysis, A/B Testing
 
						English, Chinese, Japanese, French, Latin
 
						 
									
									 Microsoft 
 July 2022 - Present
 
									
									 TuSimple 
 Aug 2021 - July 2022
 Pittsburgh Penguins 
 January - May 2021
 Little Moochi 
 Dec 2019 - August 2020
 
									
									Pittsburgh Supercomputing Center 
 May 2019 - May 2020
 
									
									Ogata Lab, Kyoto Univeristy 
 May - Aug 2018
 
									
									Genewiz 
 May - Aug 2017
• Employ GPT-3 to streamline on-call engineering operations by distilling customer queries and summarizing prior actions for optimized response times during handoffs
												• Develop an automated monitoring pipeline utilizing REST APIs to continuously evaluate system performance, proactively triggering alerts in the event of performance degradation, ensuring rapid issue identification and resolution
												• Conduct A/B testing to compare Time to Mitigation (TTM) metrics between two groups, confirming that transitioning to the enhanced plan resulted in a significant reduction in TTM, leading to enhanced incident response and operational efficiency
												• Leverage BERT to automate and accelerate the tagging process of incoming customer incident tickets with keyword extraction and feedback classification with 82% precision to save hundreds of hours per year on manual tagging
												• Query and run aggregated analysis using R on past incident tickets to develop a time series model capable of forecasting future incident volumes based on region and product offering to prepare internal resources for incoming customer escalations
												• Design and implement a scoring system using PySpark and SQL on customer email replies to create a new customer satisfaction metric by rating email content and suggesting best practices for support engineers
• Create pipelines to synchronize raw data sources into clean datasets ready for statistical analysis and modeling 
												• Classified driving patterns with supervised and unsupervised learning to provide baseline for modeling 
 
												• Built a stochastic process model for event aggregation to increase accuracy by 32% 
 
                                                • Initiate a study group within the data science team on internal toolkits 
There are multiple traditional paths hockey prospects can take to get to the NHL – they stay in some “development leagues” such as USHL, NCAA and AHL before moving on to NHL. This project is to investigate the causal effect of players’ development paths on their performance and success in the NHL. 
 
											• Web scrapped and aggregated 15786 players’ biological and performance information from https://www.eliteprospects.com/ 
 
											• Selected 1830 forward and 996 backward players who played in some developmental leagues in development year 0  
 
											• Ranked the importance of predictors and downsized the number of predictors to 7 (Games, Goals, Assists, PenaltyMinutes, Position, Height (cm), Weight (kg)) 
 
											• Deployed Bayesian Additive Regression Trees (BART) to model performance using weak-learning (small) trees and additively combine these trees to reduce bias 
 
											• Calculated projected performance using conditional average treatment effect (CATE) through different development path 
 • Found that a path through NCAA is projected to score 13.36 more points for forward and 14.65 more for backward players when compared to player trained through USHL.
• Implemented back-end algorithm at student startup, Little Moochi, to expand product which generated 500+ users since app's launch in February
											• Construct a convolutional neural network with Python that detects ingredients in users’ food images
											• Build an NLP model with skip-thought vectors to match recipes with corresponding images for ingredient analysis
											• Develop nutrition database for food plate classification to provide users with food suggestions based on eating style
											• Collaborate with front-end developers and graphic designers to understand users’ needs and enhance user experience through functional and cohesive codes
• Automated the detection of axons in zebrafish microscopy data with convolutional neural networks in TensorFlow 
												• Experimented with modifying network structure, rotating cells and parameter testing to improve accuracy to 95% 
												• Presented during symposium to 40+ professors and students at Pittsburgh Supercomputing Center
												• Submitted a 8-page research paper to PEARC ’20 and facilitated a virtual presentation
Project: Insight into the role of glycosyltransferases encoded in PkV-RF01 using sequence-based comparison with other viruses
 
											• Was selected as one of the 23 Amgen Scholar at Kyoto University among over 400 applicants;
 
											• Manually manipulated CAZydb, a database dedicated for Carbohydrate-Active Enzymes annotation from experiments and literature, to compare PkV-RF01 against all known carbohydrate-active enzymes;
 
											• Employed HMMER to annotated CAZyme domain boundaries determination according to the dbCAN CAZyme domain HMM database;
 • Extracted protein sequence of viruses of interests, including PkV-RF01, from Virus-Host Database with Python to perform a homology search
 
											•Employed dbCAN to search protein sequences in PkV-RF01 against glycosyltransferases' HMM profile with e-value = 10-5, 
coverage = 0.35 to find significant hits;
 
											•Mapped protein sequences from previous steps and filtered output to obtain only glycosyltransferases-related genes;
 
											• Searched genes against known viruses to infer functions of glycosyltransferases-related genes in PkV-RF01;
 
											• Presented poster and oral presentation at symposium to Amgen Scholars and professors from Kyoto University and the University of Tokyo
• Developed innovative bioinformatics software, algorithms and statistical methods to detect how non-coding RNAs are involved in developing disease 
											• Derived insight through statistical analysis including PCA and SVM of multi-level data with R
											• Collaborated with biologists, NGS technologists and software developers across the company
• Implemented a question type detector based on the top level nonterminal node in its parse tree using SpaCy and NLTK
• Built and trained a Recurrent Neural Network for sentiment analysis using PyTorch 
• Used mean-pooled GLoVe embeddings for retrieving semantically relevant sentences 
• Used SpaCy for tokenization, POS tagging, and dependency parsing 
• Created a sentence relevance ranker with TF-IDF
• Cleaned 54,852 prelabeled profile pictures of Twitter users from China, Japan and Korea and resized images with OpenCV to capture important facial structures and features for an efficient training process 
• Built a Convolutional Neural Network with Keras containing a pretrained 16-layer VGG with 3 fully connected layers 
• Achieved 62.3% classification accuracy versus 33.33% by chance and 49% human accuracy
• Converted words to numerical vectors as word embeddings for quantitative analysis in Python 
• Compared words and usages in Shakespeare with different time periods using linear transformation of word vectors 
• Analyzed word context using Principal Component Analysis and K-Means Clustering to determine distinct properties of Shakespeare’s language
• Identified positional information of the main annotated start codons with Python of over 15,000 genes to look for features shared among the start codons 
• Applied Lasso Regression to select the most important features of upstream start codons 
• Trained machine learning model in R to predict alternative start codons from large ribosome profiling datasets 
• Improved the efficiency of previous algorithm by reducing runtime by half and achieved 93% accuracy
• Performed EDA to clean data and remove outliers on 5,201 listings. 
• Regrouped geographic information (latitude-longitude) to the number of nearby landmarks within a 1-mile radius and quantified amenities per listing for feature engineering
• Tuned parameters on Random Forest models with 5-fold cross validation for price prediction.
• Decreased training and validation MSE to 1372.49 and 5951.704 from 14288 and 14574 respectively from the baseline linear regression model in price prediction.
• Used Xgboost for variable selection and hyperparameter search to predict review score (2 levels, 0,1).
• Improved accuracy from 0.862 to 0.9206 from the baseline linear regression model in review prediction. 
								
								Applied Linear Models
 
								Algorithms & Advanced Data Analysis
 
								Data Mining  
 
								Deep Learning  
 
								Functional Programming 
 
								Fundamentals of Programming and Computer Science 
  Great Ideas in Theoretical Computer Science 
 
								Machine Learning (Graduate Level)
								Modern Regression 
 
								Natural Language Processing
 
								Principles of Imperative Computation
 
								Probability Theory and Random Processes 
								Statistical Computing
 
								Statistical Machine Learning
 
								Statistical Methods in Epidemiology
 
								Statistical Inference