Cleaning US Census Data

By: Malcom Calivar

Published: June 29, 2021

Project information

This project takes US Census Data that's spread across ten CSV files in order to combine them into one dataframe and clean it. Much of the data found in this dataset would be very easy for a person to read as it includes percent signs, dollar signs, and other notation along with the numbers that indicate what each column is meant to represent.

Unfortunately, this makes it very difficult for any sort of data analysis or visualization that would require aggregate functions or summary statistics.

Necessary libraries

Along with our standard analysis fare, we will be using Glob. Glob allows us to combine multiple files into a list we can iterate through in order to create one dataframe that includes all the data.

Problematic columns

  • Race breakdown contains a percent sign that can be removed to create numeric values (currently object type)
  • Income contains a dollar sign that can be removed to create numeric values (currently object type)
  • Gender population data is all on one column, can be separated and then further clean to create a numeric value
  • Extra column 'Unnamed: 0' which serves no real purpose after resetting the index earlier.

How the data was cleaned

Primarily, RegEx was used to remove problematic characters such as dollar and percent signs. For the GenderPop column specifically, which presents itself with two numbers separated by an underscore showing Male/Female population (e.g.: 300000M_300000F), the string was split at the underscore before removing the final character.

Finally, since the entire purpose of cleaning the data was to be able to use it for calculations, every column that was cleaned had its data type changed to a numeric type. Originally, these were string (or "object" in pandas) types.

Final steps

In total, 7 rows were missing information. This dataset was missing three rows for female population, which we were able to calculate by subtracting male population from total population. The remaining four missing observations were in the Pacific column. The numeric amounts that represent percentages for all other races were added together and subtracted from 100. 

In the included notebook, several visualizations were created using the now numerical data. 

This step can often be seen as a chore, and data analysts and scientists often want to jump straight into what the data can tell them and how it can be used. We may even be tempted to work exclusively with data that's either already tidy or easy to clean. This removes the opportunity to work with a lot of potentially useful data. In the real world, most data will require some sort of processing or cleaning. As such, this is an important skill that can't be overlooked.