Data Science Dojo
Copyright (c) 2016 - 2019


Level: Intermediate
Recommended Use: Classification Models
Domain: Social

Census Income Data Set

Estimate whether a person’s income exceeds $50K/year:

This intermediate level data set was extracted from the census bureau database. There are 48842 instances of data set, mix of continuous and discrete (train=32561, test=16281). The data set has 15 attribute which include age, sex, education level and other relevant details of a person. The data set will help to improve your skills in Exploratory Data Analysis, Data Wrangling, Data Visualization and Classification Models. Feel free to explore the data set with multiple supervised and unsupervised learning techniques. The Following data dictionary gives more details on this data set:

Data Dictionary:

Column Position Attribute Name Definition Data Type Example % Null Ratios
1 age Age (years) Quantitative 38, 42, 71 0
2 workclass Workclass 8 different categories: (Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked) Qualitative "Private", Local-gov", "Never-worked" 6
3 fnlwgt Final Weight* Quantitative 83311, 338409 0
4 education Education: (Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool) Qualitative "Bachelors", "9th", "Preschool" 0
5 education-num Years of education Quantitative 13, 9, 7 0
6 marital-status Marital Status: (Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse) Qualitative "Divorced", Separated", "Widowed" 0
7 occupation Occupation: (Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces) Qualitative "Tech-support", "Armed Forces", "Sales" 6
8 relationship Relationship:(Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried) Qualitative "Wife", "Unmarried", "Own-child" 0
9 race Race: (White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black) Qualitative "White", "Asian-Pac-Islander", "Other" 0
10 sex Sex: (Male, Female) Qualitative Male, Female 0
11 capital-gain Amount of capital gained Quantitative 14084, 0, 5178 0
12 capital-loss Amount of capital lost Quantitative 0, 2042, 1902 0
13 hours-per-week Number of hours worked per week Quantitative 40, 50, 70 0
14 native-country Native country: (United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands) Qualitative "China", "Italy", "Vietnam" 2
15 income Either the income is greater than $50,000 or lesser than and equal to $50,000: (>50K, <=50K) Qualitative ">50K", "<=50K" 0

*Description of fnlwgt (final weight):

The weights on the CPS files are controlled to independent estimates of the civilian noninstitutional population of the US. These are prepared monthly for us by Population Division here at the Census Bureau. We use 3 sets of controls.

These are:

  1. A single cell estimate of the population 16+ for each state.
  2. Controls for Hispanic Origin by age and sex.
  3. Controls by Race, age and sex.

We use all three sets of controls in our weighting program and "rake" through them 6 times so that by the end we come back to all the controls we used.

The term estimate refers to population totals derived from CPS by creating "weighted tallies" of any specified socio-economic characteristics of the population.

People with similar demographic characteristics should have similar weights. There is one important caveat to remember about this statement. That is that since the CPS sample is actually a collection of 51 state samples, each with its own probability of selection, the statement only applies within state.

Acknowledgement:

This data set has been sourced from the Machine Learning Repository of University of California, Irvine Census Income Data Set (UC Irvine). The UCI page mentions US Census Bureau as the original source of the data set.