Skip to content
Projects
Groups
Snippets
Help
This project
Loading...
Sign in / Register
Toggle navigation
tutorials
Overview
Overview
Details
Activity
Cycle Analytics
Repository
Repository
Files
Commits
Branches
Tags
Contributors
Graph
Compare
Charts
Issues
0
Issues
0
List
Board
Labels
Milestones
Merge Requests
0
Merge Requests
0
CI / CD
CI / CD
Pipelines
Jobs
Schedules
Charts
Wiki
Wiki
Snippets
Snippets
Members
Members
Collapse sidebar
Close sidebar
Activity
Graph
Charts
Create a new issue
Jobs
Commits
Issue Boards
Open sidebar
Daniel Huang
tutorials
Commits
08f5fb8e
Unverified
Commit
08f5fb8e
authored
6 years ago
by
Arham Akheel
Committed by
GitHub
6 years ago
Browse files
Options
Browse Files
Download
Email Patches
Plain Diff
Files for webinar/tutorial
parent
ca999c34
No related merge requests found
Hide whitespace changes
Inline
Side-by-side
Showing
4 changed files
with
252 additions
and
0 deletions
+252
-0
Data Visualization with ggplot2.R
...tion with R and ggplot2/Data Visualization with ggplot2.R
+252
-0
Data Visualization with ggplot2.pdf
...on with R and ggplot2/Data Visualization with ggplot2.pdf
+0
-0
H1B Metadata 2018.pdf
...ta Visualization with R and ggplot2/H1B Metadata 2018.pdf
+0
-0
Vislualizations - A thought Starter.jpg
...ith R and ggplot2/Vislualizations - A thought Starter.jpg
+0
-0
No files found.
Introduction to Data Visualization with R and ggplot2/Data Visualization with ggplot2.R
0 → 100644
View file @
08f5fb8e
Copyright
2017
Data
Science
Dojo
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
#
# This R source code file corresponds to the Data Science Dojo webinar
# titled "An Introduction to Data Visualization with R and ggplot2"
#
setwd
(
"C:/Users/Arham/Desktop/Data Visualization with ggplot2"
)
install.packages
(
"ggplot2"
)
install.packages
(
"dplyr"
)
library
(
dplyr
)
library
(
ggplot2
)
# Load Titanic data for analysis. Open in spreadsheet view.
titanic
<-
read.csv
(
"titanic.csv"
,
stringsAsFactors
=
FALSE
)
View
(
titanic
)
# Set up factors.
titanic
$
Pclass
<-
as.factor
(
titanic
$
Pclass
)
titanic
$
Survived
<-
as.factor
(
titanic
$
Survived
)
titanic
$
Sex
<-
as.factor
(
titanic
$
Sex
)
titanic
$
Embarked
<-
as.factor
(
titanic
$
Embarked
)
#
# We'll start our visual analysis of the data focusing on questions
# related to survival rates. Specifically, these questions will use
# the factor (i.e., categorical) variables in the data. Factor data
# is very common in the business context and ggplot2 offers many
# powerful features for visualizing factor data.
#
#
# First question - What was the survival rate?
#
# As Survived is a factor (i.e., categorical) variable, a bar chart
# is a great visualization to use.
#
ggplot
(
titanic
,
aes
(
x
=
Survived
))
+
geom_bar
()
# If you really want percentages.
prop.table
(
table
(
titanic
$
Survived
))
# Add some customization for labels and theme.
ggplot
(
titanic
,
aes
(
x
=
Survived
))
+
theme_bw
()
+
geom_bar
()
+
labs
(
y
=
"Passenger Count"
,
title
=
"Titanic Survival Rates"
)
#
# Second question - What was the survival rate by gender?
#
# We can use color to look at two aspects (i.e., dimensions)
# of the data simultaneously.
#
ggplot
(
titanic
,
aes
(
x
=
Sex
,
fill
=
Survived
))
+
theme_bw
()
+
geom_bar
()
+
labs
(
y
=
"Passenger Count"
,
title
=
"Titanic Survival Rates by Sex"
)
#
# Third question - What was the survival rate by class of ticket?
#
ggplot
(
titanic
,
aes
(
x
=
Pclass
,
fill
=
Survived
))
+
theme_bw
()
+
geom_bar
()
+
labs
(
y
=
"Passenger Count"
,
title
=
"Titanic Survival Rates by Pclass"
)
#
# Fourth question - What was the survival rate by class of ticket
# and gender?
#
# We can leverage facets to further segment the data and enable
# "visual drill-down" into the data.
#
ggplot
(
titanic
,
aes
(
x
=
Sex
,
fill
=
Survived
))
+
theme_bw
()
+
facet_wrap
(
~
Pclass
)
+
geom_bar
()
+
labs
(
y
=
"Passenger Count"
,
title
=
"Titanic Survival Rates by Pclass and Sex"
)
#
# Next, we'll move on to visualizing continuous (i.e., numeric)
# data using ggplot2. We'll explore visualizations of single
# numeric variables (i.e., columns) and also illustrate how
# ggplot2 enables visual drill-down on numeric data.
#
#
# Fifth Question - What is the distribution of passenger ages?
#
# The histogram is a staple of visualizing numeric data as it very
# powerfully communicates the distrubtion of a variable (i.e., column).
#
ggplot
(
titanic
,
aes
(
x
=
Age
))
+
theme_bw
()
+
geom_histogram
(
binwidth
=
5
)
+
labs
(
y
=
"Passenger Count"
,
x
=
"Age (binwidth = 5)"
,
title
=
"Titanic Age Distribtion"
)
#
# Sixth Question - What are the survival rates by age?
#
ggplot
(
titanic
,
aes
(
x
=
Age
,
fill
=
Survived
))
+
theme_bw
()
+
geom_histogram
(
binwidth
=
5
)
+
labs
(
y
=
"Passenger Count"
,
x
=
"Age (binwidth = 5)"
,
title
=
"Titanic Survival Rates by Age"
)
# Another great visualization for this question is the box-and-whisker
# plot.
ggplot
(
titanic
,
aes
(
x
=
Survived
,
y
=
Age
))
+
theme_bw
()
+
geom_boxplot
()
+
labs
(
y
=
"Age"
,
x
=
"Survived"
,
title
=
"Titanic Survival Rates by Age"
)
#
# Seventh Question - What is the survival rates by age when segmented
# by gender and class of ticket?
#
# A related visualization to the histogram is a density plot. Think of
# a density plot as a smoothed version of the histogram. Using ggplot2
# we can use facets to allow for visual drill-down via density plots.
#
ggplot
(
titanic
,
aes
(
x
=
Age
,
fill
=
Survived
))
+
theme_bw
()
+
facet_wrap
(
Sex
~
Pclass
)
+
geom_density
(
alpha
=
0.5
)
+
labs
(
y
=
"Age"
,
x
=
"Survived"
,
title
=
"Titanic Survival Rates by Age, Pclass and Sex"
)
# If you prefer histograms, no problem!
ggplot
(
titanic
,
aes
(
x
=
Age
,
fill
=
Survived
))
+
theme_bw
()
+
facet_wrap
(
Sex
~
Pclass
)
+
geom_histogram
(
binwidth
=
5
)
+
labs
(
y
=
"Age"
,
x
=
"Survived"
,
title
=
"Titanic Survival Rates by Age, Pclass and Sex"
)
# Load H1B data for analysis. Open in spreadsheet view.
h
1
b
<-
read.csv
(
"H-1B_FY2018.csv"
,
stringsAsFactors
=
FALSE
,
encoding
=
'UTF-8'
)
View
(
h
1
b
)
# Set up factors.
h
1
b
$
EMPLOYER_NAME
<-
as.factor
(
h
1
b
$
EMPLOYER_NAME
)
h
1
b
$
EMPLOYER_CITY
<-
as.factor
(
h
1
b
$
EMPLOYER_CITY
)
h
1
b
$
EMPLOYER_STATE
<-
as.factor
(
h
1
b
$
EMPLOYER_STATE
)
h
1
b
$
SOC_NAME
<-
as.factor
(
h
1
b
$
SOC_NAME
)
h
1
b
$
WORKSITE_CITY
<-
as.factor
(
h
1
b
$
WORKSITE_CITY
)
h
1
b
$
WORKSITE_STATE
<-
as.factor
(
h
1
b
$
WORKSITE_STATE
)
h
1
b
$
CASE_STATUS
<-
as.factor
(
h
1
b
$
CASE_STATUS
)
h
1
b
$
PW_WAGE_LEVEL
<-
as.factor
(
h
1
b
$
PW_WAGE_LEVEL
)
h
1
b
$
JOB_TITLE
<-
as.factor
(
h
1
b
$
JOB_TITLE
)
h
1
b
$
PREVAILING_WAGE
<-
as.numeric
(
h
1
b
$
PREVAILING_WAGE
)
h
1
b
$
PREVAILING_WAGE
[
is.na
(
h
1
b
$
PREVAILING_WAGE
)]
<-
round
(
mean
(
h
1
b
$
PREVAILING_WAGE
,
na.rm
=
TRUE
))
head
(
h
1
b
)
#We can use color to look at two aspects (i.e., dimensions)
# of the data simentiously
ggplot
(
h
1
b
,
aes
(
x
=
EMPLOYER_STATE
,
fill
=
CASE_STATUS
))
+
theme_bw
()
+
geom_bar
()
+
labs
(
y
=
"No. of Applications"
,
x
=
"Employer State"
,
title
=
"Distribution by Employer State"
)
# Subsetting the data to keep only "CERTIFIED" H1B cases
certified_h1b
<-
h
1
b
%>%
filter
(
CASE_STATUS
==
"CERTIFIED"
)
#Function to return the top N employers that have the most H1B workers
top_N_employers
<-
function
(
num_emp
)
{
certified_h1b
%>%
group_by
(
EMPLOYER_NAME
)
%>%
summarise
(
num_apps
=
n
())
%>%
arrange
(
desc
(
num_apps
))
%>%
slice
(
1
:
num_emp
)
}
# Bar plot to show the top 10 employers who filed the most h1b visa applications
ggplot
(
top_N_employers
(
10
),
aes
(
x
=
reorder
(
EMPLOYER_NAME
,
num_apps
),
y
=
num_apps
))
+
geom_bar
(
stat
=
"identity"
,
alpha
=
0.9
,
fill
=
"green"
,
width
=
0.7
)
+
coord_flip
()
+
scale_y_continuous
(
limits
=
c
(
0
,
11000
),
breaks
=
seq
(
0
,
11000
))
+
geom_text
(
aes
(
label
=
num_apps
),
hjust
=
-0.2
,
size
=
2
)
+
ggtitle
(
"Top 10 Employers with most applications"
)
+
theme_bw
()
+
labs
(
x
=
"Employer Name"
,
y
=
"No. of Applications"
)
# Function to return top N occupations that have the most H1B applicants
top_N_SOC
<-
function
(
num
)
{
certified_h1b
%>%
filter
(
!
is.na
(
certified_h1b
$
SOC_NAME
))
%>%
group_by
(
SOC_NAME
)
%>%
summarise
(
num_apps
=
n
())
%>%
arrange
(
desc
(
num_apps
))
%>%
slice
(
1
:
num
)
}
# Bar plot to show the top 10 H1B occupations
ggplot
(
top_N_SOC
(
10
),
aes
(
x
=
reorder
(
SOC_NAME
,
num_apps
),
y
=
num_apps
))
+
geom_bar
(
stat
=
"identity"
,
alpha
=
0.9
,
fill
=
"blue"
,
width
=
0.7
)
+
coord_flip
()
+
scale_y_continuous
()
+
geom_text
(
aes
(
label
=
num_apps
),
hjust
=
-0.2
,
size
=
2
)
+
ggtitle
(
"Top 10 occupations with most H1B petitions"
)
+
theme
(
plot.title
=
element_text
(
size
=
rel
(
1
)),
axis.text.y
=
element_text
(
size
=
rel
(
0.8
)))
+
labs
(
x
=
"SOC Name"
,
y
=
"No. of Applications"
)
This diff is collapsed.
Click to expand it.
Introduction to Data Visualization with R and ggplot2/Data Visualization with ggplot2.pdf
0 → 100644
View file @
08f5fb8e
File added
This diff is collapsed.
Click to expand it.
Introduction to Data Visualization with R and ggplot2/H1B Metadata 2018.pdf
0 → 100644
View file @
08f5fb8e
File added
This diff is collapsed.
Click to expand it.
Introduction to Data Visualization with R and ggplot2/Vislualizations - A thought Starter.jpg
0 → 100644
View file @
08f5fb8e
486 KB
This diff is collapsed.
Click to expand it.
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment