Commit cb2e71b5 by Rahim Rasool

Update README.md

parent 84a4cc34
# Census Income Data Set
### Introduction:
This data was extracted from the census bureau database found at http://www.census.gov/ftp/pub/DES/www/welcome.html. This is also known as adult dataset.
Primarily, this dataset could be used to predict whether income exceeds $50K/yr based on census data. It income variable which is the variable to be predicted has 2 categories '>50K' or '<=50K'.
There are 48842 instances of data set, mix of continuous and discrete (train=32561, test=16281). Whereas, there are 45222 if instances with unknown values are removed (train=30162, test=15060). The Duplicate or conflicting instances are 6.
### Data Dictionary:
| Column Position | Atrribute Name | Definition | Data Type | Example | % Null Ratios |
|------------------- |---------------- |-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |-------------- |----------------------------------------- |--------------- |
| 1 | age | Age of the person. It is a continuous variable. | Quantitative | 38, 42, 71 | 0 |
| 2 | workclass | The workclass attribute has 8 different categories which include the following: [Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked]. It also needs to be noted that a large number of instances have '?' in this variable which indicated null value | Qualitative | "Private", Local-gov", "Never-worked" | 6 |
| 3 | fnlwgt | This is the Final Weight attribute which is constinuous. Following is its description. The weights on the CPS files are controlled to independent estimates of the civilian noninstitutional population of the US. These are prepared monthly for us by Population Division here at the Census Bureau. We use 3 sets of controls. These are: 1. A single cell estimate of the population 16+ for each state. 2. Controls for Hispanic Origin by age and sex. 3. Controls by Race, age and sex. We use all three sets of controls in our weighting program and "rake" through them 6 times so that by the end we come back to all the controls we used. The term estimate refers to population totals derived from CPS by creating "weighted tallies" of any specified socio-economic characteristics of the population. People with similar demographic characteristics should have similar weights. There is one important caveat to remember about this statement. That is that since the CPS sample is actually a collection of 51 state samples, each with its own probability of selection, the statement only applies within state. | Quantitative | 83311, 338409 | 0 |
| 4 | education | | Qualitative | "Bachelors", "9th", "Preschool" | 0 |
| 5 | education-num | Continuous variable which descirbes the number of years of education acquired by each individual | Quantitative | 13, 9, 7 | 0 |
| 6 | marital-status | The marital status of each individual. It contains the following 7 categories: [Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse] | Qualitative | "Divorced", Separated", "Widowed" | 0 |
| 7 | occupation | The occupation attribute has 14 unique categories which include the following: [Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces] It also has a few null values marked as '?' | Qualitative | "Tech-support", "Armed Forces", "Sales" | 6 |
| 8 | relationship | The relationship attribute lists 6 kinds of relationship type a person may have [Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried] | Qualitative | "Wife", "Unmarried", "Own-child" | 0 |
| 9 | race | The race attribute has 5 unique values: [White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black] | Qualitative | "White", "Asian-Pac-Islander", "Other" | 0 |
| 10 | sex | The sex attribute has only 2 unique values which include: [Male, Female] | Qualitative | Male, Female | 0 |
| 11 | capital-gain | A continuous variable describing the amount of capital gained | Quantitative | 14084, 0, 5178 | 0 |
| 12 | capital-loss | A continuous variable describing the amount of capital lost | Quantitative | 0, 2042, 1902 | 0 |
| 13 | hours-per-week | A continuous variable describing the number of hours worked per week by each person. | Quantitative | 40, 50, 70 | 0 |
| 14 | native-country | This attribute contains the native country of the person. It has 41 unique values which include the following: [ United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands] There are also null values marked as '?' | Qualitative | "China", "Italy", "Vietnam" | 2 |
| 15 | income | This is the variable that has only 2 type of values. Either the income is greater than $50,000 marked as ">50K" or it is lesser than and equal to $50,000 marked as "<=50K". This can also be taken as the value the variable that needs to be predicted using other attributes | Qualitative | ">50K", "<=50K" | 0 |
### Source:
Asuncion, A. & Newman, D.J. (2007). UCI Machine Learning Repository [http://www.ics.uci.edu/~mlearn/MLRepository.html]. Irvine, CA: University of California, School of Information and Computer Science.
\ No newline at end of file
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment