Scraping Celebrity Marriage Data

By: Malcom Calivar

Published: Sept. 2, 2021

Project description

This project uses the BeautifulSoup Python library to scrape data from celebrity pages on Wikipedia. Different functions then obtain specific information about this celebrity. This data is scraped to be used in a different project, which will analyze marriage duration, and divorce rates between different types of celebrities. The functions scrape the following information:

  • The person's marriage full marriage data(past partners, current partner)
  • The person's occupation (actor, politician, musician)
  • Their partners' Wikipedia page (if it exists) to obtain that person's information

Once the data has been scraped, it's organized by a different set of functions into a marriage dictionary. Finally, all the information is moved to a DataFrame so it can be easily exported into a .CSV file.

Included celebrities

Naturally, it would be an enormous undertaking to include every celebrity or famous person ever. In order to reduce the size of potential candidates to scrape data from, a few restrictions were applied:

This is important as it considers contemporary celebrities and politicians. We want to know if we can detect trends with a pool of current politicians and actors/artists that are currently alive.

Organizing the dataframe

Each row represents one observation or one marriage. The features created after parsing the data are as follows:

  • The person's name
  • The person's occupation (Actor/Actress, Politician, Musician)
  • The marriage number
  • The partner's name
  • Duration of marriage (in years) - "0" indicates marriage lasted less than one year
  • If the couple are currently married
  • The partner's occupation (if Wikipedia page is found)

Results and observations

In the end we have a .CSV file with 940 rows. This .CSV file will be analyzed in a different project so that we can explore and visualize it. 

There are interesting observations that were made while trying to scrape this data. Wikipedia's biographical information is fairly consistent. Sometimes, the information had to be parsed slightly differently. For instance, some people's marriage information had one particular format: Name (m. 1999; div. 2007) and other times the format was different enough to cause issues (separated by comma or a slash instead of a semicolon, using the word "divorced" or "separated" instead of the abbreviation, etc.).

For occupation data, a similar problem occurred when trying to obtain it. Some people's occupations are one simple list item separated by commas, while others have each of the person's occupation as a separate list item.  

Another interesting observation was made when the rare occurrence of a celebrity or politician's partner shared a similar enough name with an unrelated celebrity. For example, politician Nancy Mace was married to a man named Curtis Jackson. As a result, the script attempted to look for a "Curtis Jackson" on Wikipedia and found the artist 50 Cent, listing Mace's former partner as a "Musician." However, the Curtis Jackson in question has no relation to the artist of the same name. There may be other unobserved cases of this behavior. 

Lessons learned

If I had to do it over, I may change a few things. I may organize the entire biographical data using a dictionary from the getgo (instead of just for marriage), using Wikipedia's labels as keys, and placing the values in a list. It may make cleaning the data and iterating through it a little easier. It could also help prevent some of the issues I faced with inconsistency in some pages. 

Overall, I greatly enjoyed the project and I can't wait to start actually exploring and visualizing the data!