Predicting if OkCupid Users Want Children

By: Malcom Calivar

Published: July 9, 2021

Project description

Having children is something that people usually consider very carefully. Circumstances can often force us to be pushed to one side of the fence or the other, but for those who don't have the responsibilities of parenthood thrust upon them there are many factors that can influence which way they'll lean. Additionally, what certain people consider important before having children may not be important to others, or vice versa.

This project seeks to use features that are commonly found in dating profiles to make predictions on whether or not a user would want children.

Necessary libraries

Pandas is used for data exploration and manipulation, Matplotlib and Seaborn are used to create visualizations. Sci-Kit Learn is used to train machine learning models on our data. For a full list of libraries please view the Notebook included in the source code.

Exploring the Data

The first step is visualizing some of our data in order to see what our future analysis and observations will include. After all, it's all well and good to jump right into training machine learning models and have accurate predictions, but who are we making predictions about? Does our dataset include mostly men aged 50+ in Europe? Let's explore the data and find out.

One interesting fact that we notice right away by exploring the data is that the column indicating whether someone wants children or not has about 35,000 missing values. If our machine learning model ends up being effective at predicting, we could use it to fill in those misisng rows! Our model could tell a potential partner if that person is likely to want children or not even if they haven't filled that information in.

Additionally, the data required some cleaning. More details included on the notebook.

Visualizing our data

This right-skewed histogram shows that most people on OkCupid are younger than 40. 

Most people in this dataset are straight.

Surprisingly, most people in this dataset identify as agnostic, "other," or atheist. 

Most people appear to be located in the United States, and particularly, in California. 

Identifying features

Which columns could we use to make our prediction? Sometimes people want to wait until they're at a certain point in their education or career before having children. Maybe at a certain age the desire for having children increases or falls. Could religion influence this decision? 

In the end, these are the features I decided to use to train a logistic regression model and K-Nearest Neighbor classifier on:

  • Education
  • Job
  • Whether the person drinks or not
  • Whether the person smokes or not
  • Whether the person uses drugs or not
  • Orientation
  • Religion
  • Age

Findings

The logistic regression model had about 79% accuracy predicting whether someone would want a child or not based on these features. The KNN classifier's accuracy was about 72%. That's pretty good! I believe we could use the logistic regression model to fill in a lot of the missing values in the rest of the dataset. 

What features affected this potential decision? All the coefficients are available in the notebook if you'd like to check them each individually. In short, people who graduated from high-level education programs such as a PhD or med school were more likely to want children. Religion and little to no drug use were also contributing factors. Higher age, drug use, and dropping out of university programs meant the person was less likely to want children.

Further research

Naturally, these models can be improved. There are factors outside of personal choice (such as family pressure, culture) that can influence whether someone wants a child or not. Additionally, this dataset mostly represents individuals in the United States. How would these factors apply to people in other countries? Could religion be a more important factor in Latin American countries? Could education and job not be a concern in countries with socialized health care programs? 

The only way to keep improving these classifiers and their predictions is simply to find more and more data!