--- title: "Data Exploration, Visualization, and Feature Engineering using R" author: "Yuhui Zhang, and Raja Iqbal" mode: standalone output: pdf_document framework: flowtime url: lib: /home/yuhui/Copy/YDSDojo/bootcamp/slidify/slidifyExamples/libraries --- ```{r, echo=FALSE} library(knitr) hook1 <- function(x){ gsub("```\n*```r*\n*", "", x) } hook2 <- function(x){ gsub("```\n+```\n", "", x) } ## knit_hooks$set(document = hook2) ``` # Basic plotting systems 1. Base graphics: constructed piecemeal. Conceptually simpler and allows plotting to mirror the thought process. 2. Lattice graphics: entire plots created in a simple function call. 3. ggplot2 graphics: an implementation of the Grammar of Graphics by Leland Wikinson. Combines concepts from both base and lattice graphics. (Need to install ggplot2 library) 4. Fancier and more telling ones. A list of interactive visualization in R can be found at: http://ouzor.github.io/blog/2014/11/21/interactive-visualizations.html --- ## Base plotting system ```{r, fig.width=6, fig.height=5} library(datasets) ## scatter plot plot(x = airquality$Temp, y = airquality$Ozone) ``` *** ## Base plotting system ```{r, fig.width=15, fig.height=4.5} ## par() function is used to specify global graphics parameters that affect all plots in an R session. ## Type ?par to see all parameters par(mfrow = c(1, 2), mar = c(4, 4, 2, 1), oma = c(0, 0, 2, 0)) with(airquality, { plot(Wind, Ozone, main="Ozone and Wind") plot(Temp, Ozone, main="Ozone and Temperature") mtext("Ozone and Weather in New York City", outer=TRUE)}) ``` *** ## Plotting functions (high level) **PHASE ONE: Mount a canvas panel on the easel, and draw the draft.** (Initialize a plot.) * plot(): one of the most frequently used plotting functions in R. * boxplot(): a boxplot show the distribution of a vector. It is very useful to example the distribution of different variables. * barplot(): create a bar plot with vertical or horizontal bars. * hist(): compute a histogram of the given data values. * pie(): draw a pie chart. Remember to use ?plot or str(plot), etc. to check the arguments when you want to make more personalized plots. A tutorial of base plotting system with more details: http://bcb.dfci.harvard.edu/~aedin/courses/BiocDec2011/2.Plotting.pdf *** ## Plotting functions (low level) **PHASE TWO: Add more details on your canvas, and make an artwork.** (Add more on an existing plot.) * lines: adds liens to a plot, given a vector of x values and corresponding vector of y values * points: adds a point to the plot * text: add text labels to a plot using specified x,y coordinates * title: add annotations to x,y axis labels, title, subtitles, outer margin * mtext: add arbitrary text to margins (inner or outer) of plot * axis: specify axis ticks *** ## Save your artwork R can generate graphics (of varying levels of quality) on almost any type of display or printing device. Like: * postscript(): for printing on PostScript printers, or creating PostScript graphics files. * pdf(): produces a PDF file, which can also be included into PDF files. * jpeg(): produces a bitmap JPEG file, best used for image plots. help(Devices) for a list of them all. Simple example: ```{r} ## png(filename = 'plot1.png', width = 480, height = 480, units = 'px') ## plot(x, y) ## dev.off() ``` *** ## Example: boxplot and hitogram ```{r, fig.width=8, fig.height=4.5} ## the layout par(mfrow = c(2, 1), mar = c(2, 0, 2, 0), oma = c(0, 0, 0, 0)) ## histogram at the top hist(airquality$Ozone, breaks=12, main = "Histogram of Ozone") ## box plot below for comparison boxplot(airquality$Ozone, horizontal=TRUE, main = "Box plot of Ozone") ``` --- ## Lattice plotting system ```{r, fig.width=15, fig.height=4.5} library(lattice) # need to load the lattice library set.seed(10) # set the seed so our plots are the same x <- rnorm(100) f <- rep(1:4, each = 25) # first 25 elements are 1, second 25 elements are 2, ... y <- x + f - f * x+ rnorm(100, sd = 0.5) f <- factor(f, labels = c("Group 1", "Group 2", "Group 3", "Group 4")) # first 25 elements are in Group 1, second 25 elements are in Group 2, ... xyplot(y ~ x | f) ``` *** ## Lattice plotting system Want more on the plot? Customize the panel funciton: ```{r, fig.keep = 'none'} xyplot(y ~ x | f, panel = function(x, y, ...) { # call the default panel function for xyplot panel.xyplot(x, y, ...) # adds a horizontal line at the median panel.abline(h = median(y), lty = 2) # overlays a simple linear regression line panel.lmline(x, y, col = 2) }) ``` *** ## Lattice plotting system ```{r, echo=FALSE} xyplot(y ~ x | f, panel = function(x, y, ...) { # call the default panel function for xyplot panel.xyplot(x, y, ...) # adds a horizontal line at the median panel.abline(h = median(y), lty = 2) # overlays a simple linear regression line panel.lmline(x, y, col = 2) }) ``` *** ## Lattice plotting system Plotting functions * xyplot(): main function for creating scatterplots * bwplot(): box and whiskers plots (box plots) * histogram(): histograms * stripplot(): box plot with actual points * dotplot(): plot dots on "violin strings" * splom(): scatterplot matrix (like pairs() in base plotting system) * levelplot()/contourplot(): plotting image data *** ## Very useful when we want a lot... ```{r} pairs(iris) ## iris is a data set in R ``` --- ## ggplot2 * An implementation of the Grammar of Graphics by Leland Wikinson * Written by Hadley Wickham (while he was a graduate student as lowa State) * A "third" graphics system for R (along with base and lattice) Available from CRAN via install.packages() web site: http://ggplot2.org (better documentation) * Grammar of graphics represents the abstraction of graphics ideas/objects Think "verb", "noun", "adjective" for graphics "Shorten" the distance from mind to page * Two main functions: **qplot()** hides what goes on underneath, which is okay for most operations **ggplot()** is the core function and very flexible for doing this qplot() cannot do *** ## qplot function The qplot() function is the analog to plot() but with many build-in features Syntax somewhere in between base/lattice Difficult to be customized (don't bother, use full ggplot2 power in that case) ```{r, fig.width=8, fig.height=3} library(ggplot2) ## need to install and load this library qplot(displ, hwy, data = mpg, facets = .~drv) ``` *** ## ggplot function When building plots in ggplot2 (ggplot, rather than using qplot) The "artist's palette" model may be the closest analogy Plots are built up in layers * Step I: Input the data **noun**: the data ```{r} library(ggplot2) ## need to install and load this library g <- ggplot(iris, aes(Sepal.Length, Sepal.Width)) ## this would not show you add plot ``` *** ## ggplot function * Step II: Add layers **adjective**: describe the type of plot you will produce. ```{r, fig.width=12, fig.height=4.5} g + geom_point() + geom_smooth(method = "lm") + facet_grid(. ~ Species) ``` *** ## ggplot function * Step III: Add metadata and annotation **adjective**: control the mapping between data and aesthetics. ```{r, fig.width=12, fig.height=4.5} g <- g + geom_point() + geom_smooth(method = "lm") + facet_grid(. ~ Species) g + ggtitle("Sepal length vs. width for different species") + theme_bw() ## verb ``` *** ## Great documentation Great **documentation** of ggplot with all functions in **step II** and **III** and demos: http://docs.ggplot2.org/current/ --- # Titanic tragedy data --- ## Reading RAW training data * Download the data set "Titanic_train.csv" from https://raw.githubusercontent.com/datasciencedojo/datasets/master/Titanic_train.csv * Set working directory of R to the directory of the file using setwd() ```{r} titanic = read.csv('Titanic_train.csv') ``` *** ## Look at the first few rows What would be some good features to consider here? ```{r} options(width = 110) head(titanic) ``` *** ## What is the data type of each column? ```{r} sapply(titanic,class) ``` *** ## Converting class label to a factor ```{r} titanic$Survived = factor(titanic$Survived, labels=c("died", "survived")) titanic$Embarked = factor(titanic$Embarked, labels=c("unkown", "Cherbourg", "Queenstown", "Southampton")) sapply(titanic,class) str(titanic$Survived) str(titanic$Sex) ``` --- ## Class distribution - PIE Charts ```{r, fig.width=3, fig.height=3} survivedTable = table(titanic$Survived) survivedTable par(mar = c(0, 0, 0, 0), oma = c(0, 0, 0, 0)) pie(survivedTable,labels=c("Died","Survived")) ``` *** ## Is Sex a good predictor? ```{r, fig.width=14, fig.height=4.5} male = titanic[titanic$Sex=="male",] female = titanic[titanic$Sex=="female",] par(mfrow = c(1, 2), mar = c(0, 0, 2, 0), oma = c(0, 1, 0, 1)) pie(table(male$Survived),labels=c("Dead","Survived"), main="Survival Portion Among Men") pie(table(female$Survived),labels=c("Dead","Survived"), main="Survival Portion Among Women") ``` --- ## Is Age a good predictor? ```{r} Age <- titanic$Age; summary(Age) ``` How about summary segmented by **survival** ```{r} summary(titanic[titanic$Survived=="died",]$Age) summary(titanic[titanic$Survived=="survived",]$Age) ``` *** ## Age distribution by Survival and Sex ```{r, fig.width=14, fig.height=6} par(mfrow = c(1, 2), mar = c(4, 4, 2, 2), oma = c(1, 1, 1, 1)) boxplot(titanic$Age~titanic$Sex, main="Age Distribution By Gender",col=c("red","green")) boxplot(titanic$Age~titanic$Survived, main="Age Distribution By Survival",col=c("red","green"), xlab="0:Died 1:Survived",ylab="Age") ``` *** ## Histogram of Age ```{r, fig.width=6, fig.height=6} hist(Age, col="blue", xlab="Age", ylab="Frequency", main = "Distribution of Passenger Ages on Titanic") ``` *** ## Kernel density plot of age ```{r, fig.width=6, fig.height=5.5} d = density(na.omit(Age)) # density(Age) won't work, need to omit all NAs plot(d, main = "kernel density of Ages of Titanic Passengers") polygon(d, col="red", border="blue") ``` *** ## Comparison of density plots of Age with different Sex ```{r, echo=FALSE} titanic_na_removed = na.omit(titanic) library(sm) # reference package, may need you to install sm library first sm.density.compare(titanic_na_removed$Age, titanic_na_removed$Sex,xlab="Age of Passenger") title(main="Kernel Density Plot of Ages By Sex") colfill<-c(2:(2+ length(levels(titanic_na_removed$Sex)))) legend("topright", legend=levels(titanic_na_removed$Sex), fill=colfill) ``` *** ## Did Age have an impact on survival? ```{r, echo=FALSE, fig.width=23, fig.height=8} library(sm) # reference package, may need you to install sm library first par(mfrow = c(1, 3), mar = c(4, 4, 5, 2), oma = c(1, 1, 2, 1)) plot(d, main = "kernel density of Ages of Titanic Passengers", cex.main=3) polygon(d, col="red", border="blue") sm.density.compare(titanic_na_removed$Age, titanic_na_removed$Sex,xlab="Age of Passenger") title(main="Kernel Density Plot of Ages By Sex", cex.main=3) colfill<-c(2:(2+ length(levels(titanic_na_removed$Sex)))) legend("topright", legend=levels(titanic_na_removed$Sex), fill=colfill) sm.density.compare(titanic_na_removed$Age, titanic_na_removed$Survived,xlab="Age of Passenger") title(main="Kernel Density Plot of Ages By Survival", cex.main=3) colfill<-c(2:(2+ length(levels(titanic_na_removed$Survived)))) legend("topright", legend=levels(titanic_na_removed$Survived), fill=colfill) ``` *** ## Create categorical groupings: Adult vs Child An example of **feature engineering**! ```{r} ## Multi dimensional comparison Child <- titanic$Age # Isolating age. ## Now we need to create categories: NA = Unknown, 1 = Child, 2 = Adult ## Every age below 13 (exclusive) is classified into age group 1 Child[Child<13] <- 1 ## Every child 13 or above is classified into age group 2 Child[Child>=13] <- 2 ``` ```{r} # Use labels instead of 0's and 1's Child[Child==1] <- "Child" Child[Child==2] <- "Adult" # Appends the new column to the titanic dataset titanic_with_child_column <- cbind(titanic, Child) # Removes rows where age is NA titanic_with_child_column <- titanic_with_child_column[!is.na(titanic_with_child_column$Child),] ``` --- ## Fare matters? ```{r, echo=FALSE, fig.width=8, fig.height=6.5} library(ggplot2) ggplot(titanic_with_child_column, aes(y=Fare, x=Survived)) + geom_boxplot() + facet_grid(Sex~Child) ## Plot may differ depending # on your definition of a child ``` *** ## How about fare, ship class, port embarkation? ```{r, echo=FALSE, fig.width=17, fig.height=5} library(ggplot2) titanic$Pclass = as.factor(titanic$Pclass) ggplot(titanic, aes(y=Fare, x=Pclass)) + geom_boxplot() + facet_grid(~Embarked) ``` --- # Diamond data --- ## Overview of the diamond data ```{r} data(diamonds) # loading diamonds data set head(diamonds, 16) # first few rows of diamond data set ``` *** ## Histogram of carat ```{r, fig.width=8, fig.height=5} library(ggplot2) ggplot(data=diamonds) + geom_histogram(aes(x=carat)) ``` *** ## Density plot of carat ```{r, fig.width=8, fig.height=5} ggplot(data=diamonds) + geom_density(aes(x=carat),fill="gray50") ``` *** ## Scatter plots (carat vs. price) ```{r, fig.width=9, fig.height=6} ggplot(diamonds, aes(x=carat,y=price)) + geom_point() ``` *** ## Carat with colors ```{r, fig.width=9, fig.height=6} g = ggplot(diamonds, aes(x=carat, y=price)) # saving first layer as variable g + geom_point(aes(color=color)) # rendering first layer and adding another layer ``` *** ## Carat with colors (more details) ```{r, fig.width=10, fig.height=7} g + geom_point(aes(color=color)) + facet_wrap(~color) ``` *** ## Let's consider cut and clarity ```{r, fig.width=15, fig.height=8, echo=FALSE} g + geom_point(aes(color=color)) + facet_grid(cut~clarity) ``` *** ## Your trun! What is your knowledge of diamond's price after exploring this data? --- # Interactive visualization in R - rCharts * What is rCharts? Is an R package to create, customize and publish interactive javascript visualizations from R using a familiar lattice style plotting interface. * What rCharts can make and how? Quick start at: http://ramnathv.github.io/rCharts/ * A list of interactive visualization in R can be found at: http://ouzor.github.io/blog/2014/11/21/interactive-visualizations.html --- # Tell your story - R Markdown * R Markdown is an authoring format that enables easy creation of dynamic documents, presentations, and reports from R. * It combines the core syntax of markdown (an easy-to-write plain text format) with embedded R code chunks that are run so their output can be included in the final document. * Many available output formats including HTML, PDF, and MS Word * **Installation** Use RStudio: already installed Outside of RStudio: install.packages("rmarkdown"). A recent version of pandoc (>= 1.12.3) is also required. See https://github.com/rstudio/rmarkdown/blob/master/PANDOC.MD to install it. *** ## Check out Markdown first > Markdown is a markup language with plain text formatting syntax designed so that it can be converted to HTML and many other formats using a tool by the same name. One minute you get the point, and always check the cheat sheets https://github.com/adam-p/markdown-here/wiki/Markdown-Cheatsheet#lists *** ## Then, R Markdown sample code Download the template: https://github.com/datasciencedojo/datasets/blob/master/rmarkdownd_template.Rmd ## R Markdown * YAML header * Edit Markdown, and R chunks * Run! RStudio: knitr button Command line: render("file.Rmd") Cheat sheet of rmarkdown: http://www.rstudio.com/wp-content/uploads/2015/02/rmarkdown-cheatsheet.pdf --- # Present your story of Titanic! Use * Titanic data * Plotting functions in R * R Markdown template * **The heart of data explorer** to write your story of Titanic... *** ## Hope this is inspiring :) [Titanic](https://vimeo.com/21941048)