Data Science Dojo
Copyright (c) 2016 - 2019


Level: Beginner
Recommended Use: Regression/Classification Models
Domain: Healthcare/Life

Fertility Data Set

Predict seminal quality of an indivisual


This beginner level data set has 100 rows and 10 columns. The data set includes semen sample of 100 volunteers, analyzed according to the WHO 2010 criteria. This data set can be used to determine if it is possible to reach a diagnosis without a laboratory approach, which include expensive tests, sometime uncomfortable for the patients. Sperm concentration are related to socio-demographic data, environmental factors, health status, and life habits. Thes eattributes can be taken easily using a questionnaire.

This data set is recommended for learning and practicing your skills in exploratory data analysis, data visualization, and regression/classification modelling techniques. Feel free to explore the data set with multiple supervised and unsupervised learning techniques. The Following data dictionary gives more details on this data set:


Data Dictionary

Column Position Atrribute Name Definition Data Type Example % Null Ratios
1 Season Season in which the analysis was performed (-1: winter, -0.33: spring, 0.33: summer, 1: fall) Quantitative 1, -1, -0.33 0
2 Age Age at the time of analysis. Age is between 18-36 and scaled from 0 to 1 Quantitative 0.64, 0.78, 1 0
3 Childish Diseases Childish diseases i.e chicken pox, measles, mumps, polio (0: no, 1: yes) Quantitative 1, 0 0
4 Accident or serious trauma Accident or serious trauma (0: no, 1: yes) Quantitative 1, 0 0
5 Surgical intervention Surgical intervention (0: no, 1: yes) Quantitative 1, 0 0
6 High fevers in last year High fevers in the last year (-1: less than 3 months ago, 0: more than 3 months ago, 1: no fever) Quantitative 0, 1, -1 0
7 Frequency of alcohol consumption Frequency of alcohol consumption in 5 categories scaled from 0 to 1. Following are the categories in order: 1) several times a day, 2) every day, 3) several times a week, 4) once a week, 5) hardly ever or never Quantitative 0.2, 0.6, 1 0
8 Smoking Habit Smoking habit (-1: never, 0: occasional, 1: daily) Quantitative 0, 1, -1 0
9 Number of hours spent sitting per day Number of hours spent sitting per day. Between 0 and 16, scaled from 0 to 1 Quantitative 0.32, 0.83, 1 0
10 Output Output: Result of Diagnosis (N: Normal, O: Altered) Qualitative N, O 0

Acknowledgement

This data set has been sourced from the Machine Learning Repository of University of California, Irvine Fertiltiy Data Set (UC Irvine). The UCI page mentions the following publication as the original source of the data set:

David Gil, Jose Luis Girela, Joaquin De Juan, M. Jose Gomez-Torres, and Magnus Johnsson. Predicting seminal quality with artificial intelligence methods