The secrets to building a robust portfolio from scratch - with descriptions and data sources for projects.
While educational credentials matter, the key to landing your first data science job is a robust portfolio that showcases the full-stack data science skills you’ll need to succeed as a professional data scientist. If you’re job searching as a new grad or career switcher, independent projects are even more essential to ensure your resume doesn’t sail straight into the reject pile.
As you build your portfolio, choose projects that show off a variety of skills -- from data visualization to statistical analysis and programming. Explain your methodology in detail -- where you sourced the data, how you cleaned the data, and what machine learning algorithms you selected -- as this shows hiring managers how you think and helps you stand out.
If you’ve recently switched careers, use it to your advantage. Create projects that show off your domain expertise, whether that’s healthcare, education, or sales. As business needs grow more complex, hiring managers increasingly look to hire candidates who have deep industry knowledge.
Here are five data science skills that will look good on your resume, with corresponding project ideas to help you show off each.
Also known as “opinion mining,” sentiment analysis refers to the use of natural language processing to analyze text and speech, to identify specific emotions or even biomarkers that can detect various psychiatric diseases.
Luckily, there is plenty of publicly available data to pull from for a sentiment analysis project -- social media being one of them. You can aggregate tweets or Facebook posts to analyze how customers feel about a particular brand, or what voters think of a political candidate or trending issue such as climate change.
Aspect-based sentiment analysis involves mining the data to understand sentiment around specific keywords or topics, such as the customer service at a certain restaurant or the response to the Black Lives Matter movement. Try to narrow down your analysis and cross-reference variables like demographic data or geographic data. For example, how does gender, political affiliation and household income correlate with support for racial justice campaigns?
Ideally, your project should go beyond simply categorizing sentiment as positive or negative. More complex data science models recognize gradations of emotion (eg: ‘negative’ versus ‘very negative’) and can even make sophisticated inferences around intentions or urgency -- known as fine-grained sentiment analysis.
Add a data visualization aspect to your project by showing how sentiment changes over time, or by cross-referencing some other variable, such as geographic location or time. For example, if you’re performing a sentiment analysis of a brand’s reputation, pay attention to how customer sentiment changes in response to specific news events. For example, how did riders feel about Uber and Lyft when Prop 22 passed in California, safeguarding the rideshare companies from having to classify their drivers as employees?
Businesses use sentiment analysis to detect sentiment in emails, survey responses, online reviews, phone calls and online chats. While most of this data is proprietary, you can do web scraping of public review sites like Yelp, or forums like Quora and Reddit, to obtain subjective data. Other sources include Amazon product reviews, news articles, the comments section on Youtube, posts on Facebook and Twitter, and movie reviews on IMDb. Don't forget to check what's available on Kaggle.
Recommendation engines are one of the most valuable applications of machine learning -- they’re integral to Netflix’s business model, news feeds on Facebook and Instagram, playlists on music streaming sites like Spotify and Pandora, and ecommerce sites like Amazon.
Building a recommender involves training a model to segment a user into a particular persona based on actions they’ve taken or choices they’ve made, then recommending items that correspond to their user persona. Sometimes, data on user behavior is cross-referenced with other variables, such as product ratings. Use your project to show hiring managers you know how to use Python and R libraries, tune algorithm parameters to build an optimal algorithm, and perform exploratory data analysis. (quick note: if you're not fluent in Python, it's OK - use any language you prefer).
Interested in fitness or the health sciences? Build an app that makes lifestyle recommendations based on metrics like sleep time and level of physical activity. Recommendations for music, movies, and books are more obvious -- and there are many publicly available datasets to support these projects.
Another popular project is building a movie recommendation engine, using simple classifiers to solve a real-world need, like finding the best movie to watch on a Sunday.
Since you probably don’t have access to data on user behavior, ratings are a good place to start when you’re building a recommender. If you’re building a recommender for music, movies, video games, or books, you can scrape data from any ratings site, such as Last.fm, MovieLens or Goodreads. There's also some great data on Kaggle.
Decision trees represent how computers make decisions using a tree-like graph, where each branch leads to a node with a different attribute. Tree models represent a set of sequential, hierarchical decisions culminating in one or more final results.
Training data enables the algorithms to make predictions (eg: determining the likelihood that a passenger on the Titanic would have survived based on certain factors) or recommendations (eg: where to go on vacation or whether or not to go to grad school). This is done using one of two methods:
Two of the most mainstream applications of decision trees are credit card fraud detection and financial risk assessments, such as evaluating the probability that a borrower will default on a loan. Building a decision tree shows you know how to help businesses make decisions in the face of uncertainty, which is the fundamental reason why companies hire data scientists in the first place.
If you aren’t interested in financial data, you can perform different types of risk assessments. For example, you can use health examination data from the UCI Machine Learning Repository to predict a patient’s risk of heart disease or diabetes based on the most common predictors of each.
If you’re performing a risk assessment, government datasets like Data.gov are a good place to start. You can use these data to assess the risks and impacts of potential policy decisions or explore specific socioeconomic issues. For weather and climate science-related data, the National Centers for Environmental Information have detailed datasets on local meteorology. Finally, if you’re interested in epidemiology (an especially relevant issue!), look no further than the Centers for Disease Control data repository.
Time series prediction involves studying the behaviors of metrics over time. Examples include sales forecasting, weather forecasting and predictions regarding crop yield or the stock market. A time series model predicts future values based on previously observed values. One of the most commonly used methods for time series forecasting is the Autoregressive Integrated Moving Average (ARIMA), which uses linear regression to make predictions.
If statistical analysis is your strong suit, a time series project will allow your skills to shine. To earn extra brownie points with a hiring manager, create an interactive data visualization that enables the user to compare outcomes over time while adjusting for certain variables. Projects involving customer segmentation and logistic regression are viewed favorably too.
If you’re interested in making predictions about stock prices, you can pull financial data from Yahoo Finance or Bloomberg. Visit the Google Trends database to find inspiration from Google Search data, such as coronavirus search trends or the top gifts during the holiday season.
Epidemiology is another field where time series predictions are extremely valuable -- predicting the spread of a disease such as COVID-19 helps policymakers make key decisions about public health. For this, you can refer to government datasets that show infection rates on a state-by-state basis.
Computer vision is one of the most popular applications for machine learning technology, and involves teaching computers to interpret the content of digital images such as photos and videos. Facial recognition is useful for preventing retail crime, finding missing persons and unlocking mobile devices.
You can train your machine learning model to recognize specific objects (object detection) using training data that shows images of the same object in different angles and surroundings. The model grows more accurate the more varied and numerous the training data. For example, you can train a model to “read” handwriting or identify a specific landmark.
An advanced version of image recognition involves a class of algorithms called neural style transfer, which uses a machine learning model called Generative Adversarial Network. Neural style transfer involves taking two images -- a content image and a style reference image -- and putting them together such that the output image transforms the content image in the style of the reference image. For example, when you combine a photo of the Eiffel Tower with a watercolor painting by Monet, you’ll create a rendering of the Eiffel Tower “painted” in Monet’s style.
Image datasets are widely available through Google Images and other photo sharing sites like Flickr or Pinterest, but you’ll have to label and annotate the images yourself. Datasets like Labelme created by the MIT Computer Science and Artificial Intelligence Laboratory and the Columbia University Image Library contain thousands of images that have already been annotated for machine learning purposes.
Any of your learning projects can become part of your data science portfolio. With open-source data and courses on different programming languages available for free, all you really need is some good ideas and time to make it work. If you consistently do 1 project per month and increase complexity as you go, you will undoubtedly end up with the ultimate validation - an offer from a top company. We hope these data science project ideas were helpful to get you started!
The information provided herein is for general informational purposes only and is not intended to provide tax, legal, or investment advice and should not be construed as an offer to sell, a solicitation of an offer to buy, or a recommendation of any security by Candor, its employees and affiliates, or any third-party. Any expressions of opinion or assumptions are for illustrative purposes only and are subject to change without notice. Past performance is not a guarantee of future results and the opinions presented herein should not be viewed as an indicator of future performance. Investing in securities involves risk. Loss of principal is possible.
Third-party data has been obtained from sources we believe to be reliable; however, its accuracy, completeness, or reliability cannot be guaranteed. Candor does not receive compensation to promote or discuss any particular Company; however, Candor, its employees and affiliates, and/or its clients may hold positions in securities of the Companies discussed.