I graduated from Santa Clara University as an Accounting & Information Systems major pursuing a career in IT Risk Management. After teaching myself Data Science and Machine Learning (check out my projects below!), I fell in love with the field and left to study this full-time at UCLA Anderson's M.S. in Business Analytics program!
In the short term, I aspire to become a Data Scientist working with experiments to improve products, analytics to support decision making, and machine learning to improve business outcomes. In the long term, I wish to become a data-driven leader in technology, working closely with product teams to develop products/services that consumers and businesses love. I'm always open to connecting with other people, so please feel free to say "hello" and we can share stories!
I'm proficient in Python, R, machine learning, data visualization, and causal inference techniques. In addition to my DS and ML experience, I also have experience with Backend Development (building RESTful APIs) and building solutions around AWS services.
I like building models and applications, but I recently found that actually deploying them and thinking of how to architect them can be just as fun and interesting!
Hover for more...
Distributed Processing | Spark ML
Sparkify is an imaginary music app company, and I used a small subset (128MB) of their user activity data to predict churn on a Jupyter notebook, then the same workflow to a larger dataset (12GB) on a 4-node AWS EMR cluster.
I engineered various user-level metrics from a dataset of event-level data using PySpark, and trained an algorithm using PySpark's ML module to predict churn with about 88% accuracy on test data!
I published an article on Towards Data Science explaining my findings. Click the button below to read it! Also, click here for a link to the Jupyter Notebook and here for the GitHub Repo.
Deep Learning | Computer Vision
This is a Python command-line application that allows a user to specify a Convolutional Neural Network architechture to train on a set of images and save the trained model checkpoint.
The user can also load the saved model to guess the class of an image and save the predictions in a local .csv file. Check out a brief demo here.
For more details on how the model was built, click here for the Github Repo and the button below for a notebook that explains the preprocessing and training process.
Machine Learning | Data Cleaning
This is a blog post, checklist, and exercise ipynb that anyone can use to practice and guide their machine learning projects.
It covers everything from data exploration, cleaning, preparation, and ML modeling. Check out the checklist here!
Gradient Boosting | Hyperopt Optimization
IEEE-CIS Fraud Detection Kaggle competition. Given two datasets: 1) credit card transactions and 2) anonimized user information (combined together to create a dataset with over 400 features).
Used feature engineering, feature selection, Extreme Gradient Boosting (XGBoost), and Hyperopt optimization to fine tune the algorithm.
Achieved a .896 testing ROC AUC.
Matrix Factorization | Recommendation Systems
A notebook analysis exploring the following techniques of recommendation systems useing data from IBM Watson's articles and user interactions...
Knowledge-Based: based on popularity
Collaborative Filtering: based on similar users' most popular read articles
Content-Based Filtering: based on TF-IDF vectorized euclidean distance between article descriptions
Matrix Factorization using SVD: SVD analysis to predict whether a user will interact with an item.
Github repo here.
Data Analysis | Model Interpretaion
One of the most important steps in a data science/machine learning project is communicating results.
This analysis aims to answer 3 questions...
1. What are the strongest predictors for listing price?
2. Do hosts with many properties give better or worse service than hosts with only one?
3. Do reviews matter when considering price?
Check out my Jupyter Notebook for the code and the Medium post below explaining my findings.
Also, Github repo here.
NLP Preprocessing | Text Classification
It's important to find people in need in times of crisis. If messages sent by people during a natural disaster, attack, etc., can be quickly identified as important, lives can be saved.
This project uses basic NLP preprocessing techniques with a Bernoulli Naive Bayes classifier to predict whether a given message falls under one or more of 36 possible disaster-related categories.
Click the button below for a dashboard that you can use to interact with a trained model. More details are in my Github here
Dimensionality Reduction | Clustering
In this project, I used unsupervised learning techniques to identify segments of the population that form the core customer base for a mail-order sales company in Germany.
These segments can then be used to direct marketing campaigns towards audiences that will have the highest expected rate of returns.
The data that was used was provided by Udacity's partners at Bertelsmann Arvato Analytics.
Github repo here.
Model Evaluation | Model Tuning
This is a supervised learning notebook analysis that was created to identify potential donors for a fictitious company to direct marketing efforts. This was originally developed for a workshop I gave at UCLA.
It incorporates a data preprocessing pipeline, a simple model selection process, and hyperparameter tuning with Ranomized Search Cross Validation.
Github repo here.
Data Visualization | Tableau
Game of Thrones Deaths
A Dashboard visualizing various deaths, methods, and houses from the Game of Thrones franchise.
A high level dashboard meant for strategic decision-making for businesses.
Two dashboards and a story using YouTube video and tag data.
Data Visualization | Data Munging
One of the biggest problems one of my teams had at PwC was monitoring the progress of the Associates' and Managers' IT Audit testing.
I built and deployed this interactive dashboard to summarize project testing metrics. You can slice data by person, team, testing period, and more.
Click on the button below to interact with it with fake data!