Commit 15a2fbed by Arham Akheel

Migrating meetup, datasets, web_scraping_r,…

Migrating meetup, datasets, web_scraping_r, IntroDataVisualizationWithRAndGgplot2 to tutorials repository
parent 22f44079
Col1
the
and
you
for.
that
have
but
just
with
get
not
day
was
now
this
can
work
all
out
are
http
today
your
too
time
what
got
thank
back
want
from
one
know
will
see
feel
com
think
about
don
realli
had
how
some
there
night
amp
make
watch
need
new
still
they
come
home
when
look
here
off
more
much
quot
twitter
morn
last
tomorrow
then
has
been
wait
sleep
again
her
onli
week
tri
whi
tonight
would
she
thing
way
did
say
follow
veri
bit
though
take
gonna
them
over
should
yeah
bed
even
start
tweet
could
school
hour
peopl
show
twitpic
didn
guy
hey
after
him
next.
weekend
play
down
final
let
cant
use
yes
were
who
soon
never
dont
life
girl
littl
everyon
year
rain
wanna
movi
first
find
where
call
done
sure
head
our
keep
ani
than
alway
his
leav
lot
talk
alreadi
won
man
readi
someth
made
anoth
live
read
eat
becaus
yet
yay
phone
ever
hous
went
song
befor
sound
thought
mayb
summer
someon
tell
give
guess
babi
check
mean
other
end
game
into
hear
listen
later
doesn
noth
while.
actual
happen
same
pic
stuff
birthday
mom
saw
weather
car
two
doe
put
stay
yesterday
world
those
run
also
might
until
gotta
meet
said
around
post
exam
monday
friday
seem
sinc
sunday
job
must
mani
updat
myself
found
haven
video
gone
such
famili
book
most
www
aww
month
their
boy
shop
move
least
dinner
total
woke
may
anyth
lunch
studi
pictur
hair
isn
# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
# Please determine the required text preprocessing steps using the following flag
replace_special_chars <- TRUE
remove_duplicate_chars <- TRUE
replace_numbers <- TRUE
convert_to_lower_case <- TRUE
remove_default_stopWords <- TRUE
remove_given_stopWords <- TRUE
stem_words <- TRUE
# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
# Map 1-based optional input ports to variables
dataset1 <- maml.mapInputPort(1) # class: data.frame
# get the label and text columns from the input data set
text_column <- dataset1[["tweet_text"]]
#label_column <- dataset1[["label_column"]]
stopword_list <- NULL
result <- tryCatch({
dataset2 <- maml.mapInputPort(2) # class: data.frame
# get the stopword list from the second input data set
stopword_list <- dataset2[[1]]
}, warning = function(war) {
# warning handler
print(paste("WARNING: ", war))
}, error = function(err) {
# error handler
print(paste("ERROR: ", err))
stopword_list <- NULL
}, finally = {})
# Load the R script from the Zip port in ./src/
source("src/text.preprocessing.R");
text_column <- preprocessText(text_column,
replace_special_chars,
remove_duplicate_chars,
replace_numbers,
convert_to_lower_case,
remove_default_stopWords,
remove_given_stopWords,
stem_words,
stopword_list)
Sentinment <- dataset1[["sentiment_label"]]
data.set <- data.frame(
Sentinment,
text_column,
stringsAsFactors = FALSE
)
# Select data.frame to be sent to the output Dataset port
maml.mapOutputPort("data.set")
\ No newline at end of file
<?xml version="1.0" encoding="utf-8"?>
<configuration>
<startup>
<supportedRuntime version="v4.0" sku=".NETFramework,Version=v4.5"/>
</startup>
<system.serviceModel>
<extensions>
<!-- In this extension section we are introducing all known service bus extensions. User can remove the ones they don't need. -->
<behaviorExtensions>
<add name="connectionStatusBehavior"
type="Microsoft.ServiceBus.Configuration.ConnectionStatusElement, Microsoft.ServiceBus, Culture=neutral, PublicKeyToken=31bf3856ad364e35"/>
<add name="transportClientEndpointBehavior"
type="Microsoft.ServiceBus.Configuration.TransportClientEndpointBehaviorElement, Microsoft.ServiceBus, Culture=neutral, PublicKeyToken=31bf3856ad364e35"/>
<add name="serviceRegistrySettings"
type="Microsoft.ServiceBus.Configuration.ServiceRegistrySettingsElement, Microsoft.ServiceBus, Culture=neutral, PublicKeyToken=31bf3856ad364e35"/>
</behaviorExtensions>
<bindingElementExtensions>
<add name="netMessagingTransport"
type="Microsoft.ServiceBus.Messaging.Configuration.NetMessagingTransportExtensionElement, Microsoft.ServiceBus, Culture=neutral, PublicKeyToken=31bf3856ad364e35"/>
<add name="tcpRelayTransport"
type="Microsoft.ServiceBus.Configuration.TcpRelayTransportElement, Microsoft.ServiceBus, Culture=neutral, PublicKeyToken=31bf3856ad364e35"/>
<add name="httpRelayTransport"
type="Microsoft.ServiceBus.Configuration.HttpRelayTransportElement, Microsoft.ServiceBus, Culture=neutral, PublicKeyToken=31bf3856ad364e35"/>
<add name="httpsRelayTransport"
type="Microsoft.ServiceBus.Configuration.HttpsRelayTransportElement, Microsoft.ServiceBus, Culture=neutral, PublicKeyToken=31bf3856ad364e35"/>
<add name="onewayRelayTransport"
type="Microsoft.ServiceBus.Configuration.RelayedOnewayTransportElement, Microsoft.ServiceBus, Culture=neutral, PublicKeyToken=31bf3856ad364e35"/>
</bindingElementExtensions>
<bindingExtensions>
<add name="basicHttpRelayBinding"
type="Microsoft.ServiceBus.Configuration.BasicHttpRelayBindingCollectionElement, Microsoft.ServiceBus, Culture=neutral, PublicKeyToken=31bf3856ad364e35"/>
<add name="webHttpRelayBinding"
type="Microsoft.ServiceBus.Configuration.WebHttpRelayBindingCollectionElement, Microsoft.ServiceBus, Culture=neutral, PublicKeyToken=31bf3856ad364e35"/>
<add name="ws2007HttpRelayBinding"
type="Microsoft.ServiceBus.Configuration.WS2007HttpRelayBindingCollectionElement, Microsoft.ServiceBus, Culture=neutral, PublicKeyToken=31bf3856ad364e35"/>
<add name="netTcpRelayBinding"
type="Microsoft.ServiceBus.Configuration.NetTcpRelayBindingCollectionElement, Microsoft.ServiceBus, Culture=neutral, PublicKeyToken=31bf3856ad364e35"/>
<add name="netOnewayRelayBinding"
type="Microsoft.ServiceBus.Configuration.NetOnewayRelayBindingCollectionElement, Microsoft.ServiceBus, Culture=neutral, PublicKeyToken=31bf3856ad364e35"/>
<add name="netEventRelayBinding"
type="Microsoft.ServiceBus.Configuration.NetEventRelayBindingCollectionElement, Microsoft.ServiceBus, Culture=neutral, PublicKeyToken=31bf3856ad364e35"/>
<add name="netMessagingBinding"
type="Microsoft.ServiceBus.Messaging.Configuration.NetMessagingBindingCollectionElement, Microsoft.ServiceBus, Culture=neutral, PublicKeyToken=31bf3856ad364e35"/>
</bindingExtensions>
</extensions>
</system.serviceModel>
<appSettings>
<!-- Service Bus specific app setings for messaging connections -->
<add key="Microsoft.ServiceBus.ConnectionString"
value="Endpoint=sb://tolltest.servicebus.windows.net/;SharedAccessKeyName=RootManageSharedAccessKey;SharedAccessKey=V93mgRhRp0d1FkslcsyjOZNLjo5iSZ730wJuWbZIbS8="/>
<add key="storageAccountName"
value="dojodemo"/>
<add key="storageAccountKey"
value="QPALUJTeuleyZLwLQ45uT5gLIe6KcrKtpO4VpDsRs/8blwphpkySk7FQwHO4lbgp633uNEG5UFePj/p+6bDmnw=="/>
</appSettings>
</configuration>
\ No newline at end of file
<?xml version="1.0" encoding="utf-8"?>
<configuration>
<startup>
<supportedRuntime version="v4.0" sku=".NETFramework,Version=v4.5"/>
</startup>
<system.serviceModel>
<extensions>
<!-- In this extension section we are introducing all known service bus extensions. User can remove the ones they don't need. -->
<behaviorExtensions>
<add name="connectionStatusBehavior"
type="Microsoft.ServiceBus.Configuration.ConnectionStatusElement, Microsoft.ServiceBus, Culture=neutral, PublicKeyToken=31bf3856ad364e35"/>
<add name="transportClientEndpointBehavior"
type="Microsoft.ServiceBus.Configuration.TransportClientEndpointBehaviorElement, Microsoft.ServiceBus, Culture=neutral, PublicKeyToken=31bf3856ad364e35"/>
<add name="serviceRegistrySettings"
type="Microsoft.ServiceBus.Configuration.ServiceRegistrySettingsElement, Microsoft.ServiceBus, Culture=neutral, PublicKeyToken=31bf3856ad364e35"/>
</behaviorExtensions>
<bindingElementExtensions>
<add name="netMessagingTransport"
type="Microsoft.ServiceBus.Messaging.Configuration.NetMessagingTransportExtensionElement, Microsoft.ServiceBus, Culture=neutral, PublicKeyToken=31bf3856ad364e35"/>
<add name="tcpRelayTransport"
type="Microsoft.ServiceBus.Configuration.TcpRelayTransportElement, Microsoft.ServiceBus, Culture=neutral, PublicKeyToken=31bf3856ad364e35"/>
<add name="httpRelayTransport"
type="Microsoft.ServiceBus.Configuration.HttpRelayTransportElement, Microsoft.ServiceBus, Culture=neutral, PublicKeyToken=31bf3856ad364e35"/>
<add name="httpsRelayTransport"
type="Microsoft.ServiceBus.Configuration.HttpsRelayTransportElement, Microsoft.ServiceBus, Culture=neutral, PublicKeyToken=31bf3856ad364e35"/>
<add name="onewayRelayTransport"
type="Microsoft.ServiceBus.Configuration.RelayedOnewayTransportElement, Microsoft.ServiceBus, Culture=neutral, PublicKeyToken=31bf3856ad364e35"/>
</bindingElementExtensions>
<bindingExtensions>
<add name="basicHttpRelayBinding"
type="Microsoft.ServiceBus.Configuration.BasicHttpRelayBindingCollectionElement, Microsoft.ServiceBus, Culture=neutral, PublicKeyToken=31bf3856ad364e35"/>
<add name="webHttpRelayBinding"
type="Microsoft.ServiceBus.Configuration.WebHttpRelayBindingCollectionElement, Microsoft.ServiceBus, Culture=neutral, PublicKeyToken=31bf3856ad364e35"/>
<add name="ws2007HttpRelayBinding"
type="Microsoft.ServiceBus.Configuration.WS2007HttpRelayBindingCollectionElement, Microsoft.ServiceBus, Culture=neutral, PublicKeyToken=31bf3856ad364e35"/>
<add name="netTcpRelayBinding"
type="Microsoft.ServiceBus.Configuration.NetTcpRelayBindingCollectionElement, Microsoft.ServiceBus, Culture=neutral, PublicKeyToken=31bf3856ad364e35"/>
<add name="netOnewayRelayBinding"
type="Microsoft.ServiceBus.Configuration.NetOnewayRelayBindingCollectionElement, Microsoft.ServiceBus, Culture=neutral, PublicKeyToken=31bf3856ad364e35"/>
<add name="netEventRelayBinding"
type="Microsoft.ServiceBus.Configuration.NetEventRelayBindingCollectionElement, Microsoft.ServiceBus, Culture=neutral, PublicKeyToken=31bf3856ad364e35"/>
<add name="netMessagingBinding"
type="Microsoft.ServiceBus.Messaging.Configuration.NetMessagingBindingCollectionElement, Microsoft.ServiceBus, Culture=neutral, PublicKeyToken=31bf3856ad364e35"/>
</bindingExtensions>
</extensions>
</system.serviceModel>
<appSettings>
<!-- Service Bus specific app setings for messaging connections -->
<add key="Microsoft.ServiceBus.ConnectionString"
value="Endpoint=sb://tolltest.servicebus.windows.net/;SharedAccessKeyName=RootManageSharedAccessKey;SharedAccessKey=V93mgRhRp0d1FkslcsyjOZNLjo5iSZ730wJuWbZIbS8="/>
<add key="storageAccountName"
value="dojoeventhubs"/>
<add key="storageAccountKey"
value="lrrS7WkjginKovVFS9E3J8JmYJRnEj6bsz7hGymEqwfqmbt31h5GmQwE9+SiVSC3NPQZ+FhYLtkbTkJxOBbTrg=="/>
</appSettings>
</configuration>
\ No newline at end of file
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<assembly xmlns="urn:schemas-microsoft-com:asm.v1" manifestVersion="1.0">
<assemblyIdentity version="1.0.0.0" name="MyApplication.app"/>
<trustInfo xmlns="urn:schemas-microsoft-com:asm.v2">
<security>
<requestedPrivileges xmlns="urn:schemas-microsoft-com:asm.v3">
<requestedExecutionLevel level="asInvoker" uiAccess="false"/>
</requestedPrivileges>
</security>
</trustInfo>
</assembly>
This source diff could not be displayed because it is too large. You can view the blob instead.
This source diff could not be displayed because it is too large. You can view the blob instead.
<?xml version="1.0" encoding="utf-8"?>
<doc>
<assembly>
<name>Microsoft.ServiceBus.Messaging.EventProcessorHost</name>
</assembly>
<members>
<member name="T:Microsoft.ServiceBus.Messaging.EventProcessorHost">
<summary>Represents a host for processing Event Hubs event data.</summary>
</member>
<member name="M:Microsoft.ServiceBus.Messaging.EventProcessorHost.#ctor(System.String,System.String,System.String,System.String,System.String)">
<summary>Initializes a new instance of the <see cref="T:Microsoft.ServiceBus.Messaging.EventProcessorHost" /> class.</summary>
<param name="hostName">The name of the <see cref="T:Microsoft.ServiceBus.Messaging.EventProcessorHost" /> instance. This name must be unique for each instance of the host.</param>
<param name="eventHubPath">The path to the Event Hub from which to start receiving event data.</param>
<param name="consumerGroupName">The name of the Event Hubs consumer group from which to start receiving event data.</param>
<param name="eventHubConnectionString">The connection string for the Event Hub.</param>
<param name="storageConnectionString">The connection string for the Azure Blob storage account to use for partition distribution.</param>
</member>
<member name="M:Microsoft.ServiceBus.Messaging.EventProcessorHost.#ctor(System.String,System.String,System.String,System.String,System.String,System.String)">
<summary>Initializes a new instance of the <see cref="T:Microsoft.ServiceBus.Messaging.EventProcessorHost" /> class.</summary>
<param name="hostName">The name of the <see cref="T:Microsoft.ServiceBus.Messaging.EventProcessorHost" /> instance. This name must be unique for each instance of the host.</param>
<param name="eventHubPath">The path to the Event Hub from which to start receiving event data.</param>
<param name="consumerGroupName">The name of the Event Hubs consumer group from which to start receiving event data.</param>
<param name="eventHubConnectionString">The connection string for the Event Hub.</param>
<param name="storageConnectionString">The connection string for the Azure Blob storage account to use for partition distribution.</param>
<param name="leaseContainerName">The name of the Azure Blob container in which all lease blobs are created. If this parameter is not supplied, then the Event Hubs path is used as the name of the Azure Blob container.</param>
</member>
<member name="P:Microsoft.ServiceBus.Messaging.EventProcessorHost.HostName">
<summary>Gets the host name, which is a unique name for the <see cref="T:Microsoft.ServiceBus.Messaging.EventProcessorHost" /> instance.</summary>
<returns>The host name.</returns>
</member>
<member name="P:Microsoft.ServiceBus.Messaging.EventProcessorHost.PartitionManagerOptions">
<summary>Gets or sets the <see cref="T:Microsoft.ServiceBus.Messaging.PartitionManagerOptions" /> instance used by the <see cref="T:Microsoft.ServiceBus.Messaging.EventProcessorHost" /> object.</summary>
<returns>The <see cref="T:Microsoft.ServiceBus.Messaging.PartitionManagerOptions" /> instance.</returns>
</member>
<member name="M:Microsoft.ServiceBus.Messaging.EventProcessorHost.RegisterEventProcessorAsync``1">
<summary>Asynchronously registers the <see cref="T:Microsoft.ServiceBus.Messaging.IEventProcessor" /> interface implementation with the host using the <see cref="T:Microsoft.ServiceBus.Messaging.DefaultEventProcessorFactory`1" /> factory. This method also starts the host and enables it to start participating in the partition distribution process.</summary>
<returns>A task indicating that the <see cref="T:Microsoft.ServiceBus.Messaging.EventProcessorHost" /> instance has started.</returns>
<typeparam name="T">Implementation of your application-specific <see cref="T:Microsoft.ServiceBus.Messaging.IEventProcessor" />.</typeparam>
</member>
<member name="M:Microsoft.ServiceBus.Messaging.EventProcessorHost.RegisterEventProcessorAsync``1(Microsoft.ServiceBus.Messaging.EventProcessorOptions)">
<summary>Asynchronously registers the <see cref="T:Microsoft.ServiceBus.Messaging.IEventProcessor" /> interface implementation with the host using the <see cref="T:Microsoft.ServiceBus.Messaging.DefaultEventProcessorFactory`1" /> factory. This method also starts the host and enables it to start participating in the partition distribution process.</summary>
<returns>A task indicating that the <see cref="T:Microsoft.ServiceBus.Messaging.EventProcessorHost" /> instance has started.</returns>
<param name="processorOptions">An <see cref="T:Microsoft.ServiceBus.Messaging.EventProcessorOptions" /> object that controls various aspects of the event pump created when ownership is acquired for a given Event Hubs partition.</param>
<typeparam name="T">Implementation of your application-specific <see cref="T:Microsoft.ServiceBus.Messaging.IEventProcessor" />.</typeparam>
</member>
<member name="M:Microsoft.ServiceBus.Messaging.EventProcessorHost.RegisterEventProcessorFactoryAsync(Microsoft.ServiceBus.Messaging.IEventProcessorFactory)">
<summary>Asynchronously registers the event processor factory.</summary>
<returns>The task representing the asynchronous operation.</returns>
<param name="factory">The factory to register.</param>
</member>
<member name="M:Microsoft.ServiceBus.Messaging.EventProcessorHost.RegisterEventProcessorFactoryAsync(Microsoft.ServiceBus.Messaging.IEventProcessorFactory,Microsoft.ServiceBus.Messaging.EventProcessorOptions)">
<summary>Asynchronously registers the event processor factory.</summary>
<returns>Returns <see cref="T:System.Threading.Tasks.Task" />.</returns>
<param name="factory">The factory to register.</param>
<param name="processorOptions">An <see cref="T:Microsoft.ServiceBus.Messaging.EventProcessorOptions" /> object that controls various aspects of the event pump created when ownership is acquired for a given Event Hubs partition.</param>
</member>
<member name="M:Microsoft.ServiceBus.Messaging.EventProcessorHost.UnregisterEventProcessorAsync">
<summary>Asynchronously shuts down the <see cref="T:Microsoft.ServiceBus.Messaging.EventProcessorHost" /> instance. This method maintains the leases on all partitions currently held, and enables each <see cref="T:Microsoft.ServiceBus.Messaging.IEventProcessor" /> instance to shut down cleanly by invoking the <see cref="M:Microsoft.ServiceBus.Messaging.IEventProcessor.CloseAsync(Microsoft.ServiceBus.Messaging.PartitionContext,Microsoft.ServiceBus.Messaging.CloseReason)" /> method with a <see cref="F:Microsoft.ServiceBus.Messaging.CloseReason.Shutdown" /> object.</summary>
<returns>A task that indicates the <see cref="T:Microsoft.ServiceBus.Messaging.EventProcessorHost" /> instance has stopped.</returns>
</member>
<member name="T:Microsoft.ServiceBus.Messaging.PartitionManagerOptions">
<summary>Represents the options that control various aspects of partition distribution that occur within the <see cref="T:Microsoft.ServiceBus.Messaging.EventProcessorHost" /> instance.</summary>
</member>
<member name="M:Microsoft.ServiceBus.Messaging.PartitionManagerOptions.#ctor">
<summary>Initializes a new instance of the <see cref="T:Microsoft.ServiceBus.Messaging.PartitionManagerOptions" /> class.</summary>
</member>
<member name="P:Microsoft.ServiceBus.Messaging.PartitionManagerOptions.AcquireInterval">
<summary>Gets or sets the interval at which the <see cref="T:Microsoft.ServiceBus.Messaging.EventProcessorHost" /> instance begins a task to determine whether partitions are distributed evenly among known host instances.</summary>
<returns>The acquire interval of the partition.</returns>
</member>
<member name="P:Microsoft.ServiceBus.Messaging.PartitionManagerOptions.DefaultOptions">
<summary>Creates an instance of <see cref="P:Microsoft.ServiceBus.Messaging.EventProcessorHost.PartitionManagerOptions" /> with the following default values:<see cref="P:Microsoft.ServiceBus.Messaging.PartitionManagerOptions.RenewInterval" />: 10 seconds.<see cref="P:Microsoft.ServiceBus.Messaging.PartitionManagerOptions.AcquireInterval" />: 10 seconds.<see cref="P:Microsoft.ServiceBus.Messaging.PartitionManagerOptions.LeaseInterval" />: 30 seconds. </summary>
<returns>The default partition manager options.</returns>
</member>
<member name="P:Microsoft.ServiceBus.Messaging.PartitionManagerOptions.LeaseInterval">
<summary>Gets or sets the interval at which the lease is created on an Azure Blob representing an Event Hubs partition. If the lease is not renewed within this interval, it expires, and ownership of the partition passes to another <see cref="T:Microsoft.ServiceBus.Messaging.EventProcessorHost" /> instance.</summary>
<returns>Returns <see cref="T:System.TimeSpan" />.</returns>
</member>
<member name="P:Microsoft.ServiceBus.Messaging.PartitionManagerOptions.MaxReceiveClients"></member>
<member name="P:Microsoft.ServiceBus.Messaging.PartitionManagerOptions.RenewInterval">
<summary>Gets or sets the renewal interval for all leases for partitions currently held by the <see cref="T:Microsoft.ServiceBus.Messaging.EventProcessorHost" /> instance.</summary>
<returns>The interval to renew the partition.</returns>
</member>
</members>
</doc>
\ No newline at end of file
This source diff could not be displayed because it is too large. You can view the blob instead.
This source diff could not be displayed because it is too large. You can view the blob instead.
# Building a Real-time Sentiment Pipeline for Live Tweets using Python, R, & Azure
## Requirements
* Twitter Account + Twitter App setup (https://apps.twitter.com/)
* Anaconda 3.5 or Python 3.5 Installed
* Azure subscription or free trial account
* [30 day free trial](https://azure.microsoft.com/en-us/pricing/free-trial/)
* Azure Machine Learning Studio workspace
* Text Editor, I'll be using Sublime Text 3
* Github.com account (to receive code)
* PowerBI.com account (for Dashboard portion)
* .NET up to date + windows (for testing portion)
## Cloning the Repo for Code & Materials
```
git clone https://www.github.com/datasciencedojo/meetup.git
```
Folder: Building a Real-time Sentiment Pipeline for Live Tweets using Python, R, & Azure
## The Predictive Model
### Supervised Twitter Dataset
* Azure ML Reader Module:
* Data source: Azure Blob Storage
* Authentication type: PublicOrSAS
* URI: http://azuremlsampleexperiments.blob.core.windows.net/datasets/Sentiment140.tenPercent.sample.tweets.tsv
* File format: TSV
* URI has header row: Checked
* Import and save dataset
### Preprocessing & Cleaning
* Azure ML Metadata Editor: Cast categorical sentiment_label
* Azure ML Group Categorical Values: Casting '0' as Negative, '4' as positive
### Text Processing
* Filtering using R
* Removing stop words (Stop words list)
* Removing special characters
* Replace numbers
* Globally conform to lower case
* Stemming and lemmatization
[Example of Cleansing Stop Words](http://demos.datasciencedojo.com/demo/stopwords/)
* Create a term frequency matrix for English words
* Azure ML's [Feature Hashing Module](https://msdn.microsoft.com/library/azure/c9a82660-2d9c-411d-8122-4d9e0b3ce92a)
* Drop the tweet_text column, since it is no longer needed
* Azure ML's Project Columns module
* Feature Selection & Filtering
* Pick only the most X relevant columns/words to train on.
* Using Azure ML's [Filter based Selection](https://msdn.microsoft.com/library/azure/818b356b-045c-412b-aa12-94a1d2dad90f) module, set to Pearson's correlation to select the top 5000 most correlated columns
* Normalize the Term Frequency Matrix
* Text processing best practice, but does not matter too much for Tweets
* Normalize Data Module: Min/Max for all numeric columns
### Algorithm Selection
* [Algorithm Cheat Sheet](https://azure.microsoft.com/en-us/documentation/articles/machine-learning-algorithm-cheat-sheet/)
* [Beginer's Guide to Choosing Algorithms](https://azure.microsoft.com/en-us/documentation/articles/machine-learning-algorithm-choice/)
* [Azure ML's Support Vector Machines](https://msdn.microsoft.com/en-us/library/azure/dn905835.aspx)
* [Support Vector Machines in General](https://en.wikipedia.org/wiki/Support_vector_machine)
### Model Building
* Train the model
* Score the trained model against a validation set
* Evaluate the performance, maximaxing accuracy in this case
### Twitter App
* [Creating a Twitter Account] (https://www.hashtags.org/platforms/twitter/how-to-create-a-twitter-account/)
* [Creating a Twitter App](http://www.ning.com/help/?p=4955)
* Get your [Twitter app's](https://apps.twitter.com/) OAuth keys and tokens.
### Twitter API with Python
* [Twitter API for all languages](https://dev.twitter.com/overview/api/twitter-libraries)
* [Tweepy Python Package](https://github.com/tweepy/tweepy)
* [Streaming with Tweepy](http://tweepy.readthedocs.org/en/v3.2.0/streaming_how_to.html?highlight=stream)
### Azure Event Hub
* Create an Service Bus Namespace
* Create an Azure Event Hub
* Create a send key (to push data to)
* Create a manage key (stream processor)
* Create a listen key (to subscribe to)
* [Pushing to Azure Event Hub](http://azure-sdk-for-python.readthedocs.org/en/latest/servicebus.html)
* [Viewing inside of an Azure Event Hub](https://azure.microsoft.com/en-us/documentation/articles/event-hubs-csharp-ephcs-getstarted/)
Deploy the Model
Hook up Stream Processors
\ No newline at end of file
import tweepy
# import json
# my keys
consumer_token = ''
consumer_secret = ''
key = ''
secret = ''
auth = tweepy.OAuthHandler(consumer_token, consumer_secret)
auth.set_access_token(key, secret)
api = tweepy.API(auth)
api.verify_credentials()
class MyStreamListener(tweepy.StreamListener):
def on_status(self, status):
print(status.text)
def on_data(self, twitter_data):
print(twitter_data)
# tweetJSON = json.loads(twitter_data)
# print(tweetJSON['text'].encode("utf-8"))
myStreamListener = MyStreamListener()
myStream = tweepy.Stream(auth=api.auth, listener=MyStreamListener())
myStream.sample(async=False, languages=['en'])
# Intro to Business Data Analysis with Excel
GitHub Repository for the 03/08/2017 Meetup titled "[Business Data Analysis with Excel](https://www.meetup.com/data-science-dojo/events/236198327/)".
These materials make extensive use of the examples documented in the book "[Making Sense of Data](https://www.amazon.com/Making-Sense-Data-Donald-Wheeler/dp/0945320728/)" by Donald J. Wheeler. This book is highly recommended to all Data/Business Analysts interested in expanding the rigor of their analyses.
This source diff could not be displayed because it is too large. You can view the blob instead.
# datasets
A public repo of datasets
This source diff could not be displayed because it is too large. You can view the blob instead.
---
title: "Work and fun in Data Science Dojo"
author: your name
date:
output:
pdf_document:
toc: true
---
[linked phrase](http://datasciencedojo.com/)
# My story of Titanic tragedy
## Obtain the data
<!-- You may want to load data here -->
## Overview of the data
<!-- You may want to do the preliminary exploration of the data, using str(), summary(), head(), class(), etc. -->
<!-- Also write down your feelings of the data -->
## Modification of the original data
<!-- You can revise the data you got. -->
<!-- For example: if you feel the feature Survived should better to be a factor, you can do something like: titanic$Survived = factor(titanic$Survived, labels=c("died", "survived")) -->
## First plot of Titanic data
<!-- Make your first plot of Titanic data, and write down what you see from the plot. -->
<!-- Feel free to revise the headers to make this storybook nicer. -->
## Second plot of Titanic data
<!-- Make the 2nd, 3rd, 4th plots from here. Doesn't need to be a lot, but try to make every single one telling. -->
## Your summary of the Titanic data (story of Titanic tragedy)
* First...
* Second...
* Third...
* Fourth...
# Another course in Data Science Dojo
<!-- Keep adding your note, code and thoughts during the bootcamp! -->
# Another course in Data Science Dojo
# Important contacts in DSD bootcamp
* Raja Iqbal (Instructor)
[email protected]
* Jasmine Wilkerson (Instructor)
[email protected]
* Phuc Duong (Instructor)
[email protected]
* Yuhui Zhang (Instructor)
[email protected]
* Lisa Nicholson
[email protected]
This source diff could not be displayed because it is too large. You can view the blob instead.
#=======================================================================================
#
# File: CustomerQuery.R
# Author: Dave Langer
# Description: This code illustrates querying a SQL Server database via the RODBC
# package for the "Introduction to R Visualization with Power BI " Meetup
# dated 03/15/2017. More details on the Meetup are available at:
#
# https://www.meetup.com/Data-Science-Dojo-Toronto/events/237952698/
#
# The code in this file leverages data from Microsoft's Wide World
# Importers sample database available at:
#
# https://github.com/Microsoft/sql-server-samples/releases/tag/wide-world-importers-v1.0
#
# NOTE - This file is provided "As-Is" and no warranty regardings its contents are
# offered nor implied. USE AT YOUR OWN RISK!
#
#=======================================================================================
# Uncomment and run these lines of code to install required packages
#install.packages("RODBC")
library(RODBC)
# Open connection using Windows ODBC DSN
dbhandle <- odbcConnect("RConnection")
# Query database for a denormalized view of [Fact][Sale] data
dataset <- sqlQuery(dbhandle,
"SELECT [C].[CustomerID]
,[C].[CustomerName]
,[C].[BuyingGroupID]
,[C].[DeliveryMethodID]
,[C].[DeliveryCityID]
,[C].[DeliveryAddressLine1]
,[C].[DeliveryAddressLine2]
,[CITY].[CityName]
,[P].[StateProvinceCode]
,[C].[DeliveryPostalCode]
,[CC].[CustomerCategoryName]
,[BG].[BuyingGroupName]
,[O].[OrderID]
,[O].[OrderDate]
,[OL].[OrderLineID]
,[OL].[Quantity]
,[OL].[UnitPrice]
,[OL].[Quantity] * [OL].[UnitPrice] AS [LineTotal]
,[SC].[SupplierCategoryName]
FROM [WideWorldImporters].[Sales].[Customers] C
INNER JOIN [WideWorldImporters].[Sales].[CustomerCategories] CC ON ([C].[CustomerCategoryID] = [CC].[CustomerCategoryID])
LEFT OUTER JOIN [WideWorldImporters].[Sales].[BuyingGroups] BG ON ([C].[BuyingGroupID] = [BG].[BuyingGroupID])
INNER JOIN [WideWorldImporters].[Sales].[Orders] O ON ([C].[CustomerID] = [O].[CustomerID])
INNER JOIN [WideWorldImporters].[Sales].[OrderLines] OL ON ([O].[OrderID] = [OL].[OrderID])
INNER JOIN [WideWorldImporters].[Warehouse].[StockItems] SI ON ([OL].[StockItemID] = [SI].[StockItemID])
INNER JOIN [WideWorldImporters].[Purchasing].[Suppliers] S ON ([SI].[SupplierID] = [S].[SupplierID])
INNER JOIN [WideWorldImporters].[Purchasing].[SupplierCategories] SC ON ([S].[SupplierCategoryID] = [SC].[SupplierCategoryID])
INNER JOIN [WideWorldImporters].[Application].[Cities] CITY ON ([C].[DeliveryCityID] = [CITY].[CityID])
INNER JOIN [WideWorldImporters].[Application].[StateProvinces] P ON ([CITY].[StateProvinceID] = [P].[StateProvinceID])",
stringsAsFactors = FALSE)
#Close DB connection
odbcClose(dbhandle)
# Save off data frame in .RData binary format
save(dataset, file = "CustomerData.RData")
#=======================================================================================
#
# File: CustomerVisualizations.R
# Author: Dave Langer
# Description: This code illustrates R visualizaions used in the "Introduction to R
# Visualization with Power BI " Meetup dated 03/15/2017. More details on
# the Meetup are available at:
#
# https://www.meetup.com/Data-Science-Dojo-Toronto/events/237952698/
#
# The code in this file leverages data from Microsoft's Wide World
# Importers sample Data Warehouse available at:
#
# https://github.com/Microsoft/sql-server-samples/releases/tag/wide-world-importers-v1.0
#
# NOTE - This file is provided "As-Is" and no warranty regardings its contents are
# offered nor implied. USE AT YOUR OWN RISK!
#
#=======================================================================================
# Uncomment and run these lines of code to install required packages
#install.packages("dplyr")
#install.packages("lubridate")
#install.packages("ggplot2")
#install.packages("scales")
#install.packages("qcc")
# NOTE - Change your working directory as needed
load("CustomerData.RData")
# Preprocessing to make dataset look like Power BI
library(dplyr)
library(lubridate)
dataset <- dataset %>%
mutate(Year = year(dataset$OrderDate),
Month = month(dataset$OrderDate, label = TRUE))
#=============================================================================
#
# Visualization #1 - Aggregaed dynamic bar charts by Customer Category
#
#=============================================================================
library(dplyr)
library(ggplot2)
library(scales)
# Get total revenue by Buying Group, Supplier Category and Customer Catetory
customer.categories <- dataset %>%
group_by(BuyingGroupName, SupplierCategoryName, CustomerCategoryName) %>%
summarize(TotalRevenue = sum(LineTotal))
# Aggregate data across all supplier categories
all.suppliers <- dataset %>%
group_by(BuyingGroupName, CustomerCategoryName) %>%
summarize(TotalRevenue = sum(LineTotal))
all.suppliers$SupplierCategoryName <- "All Suppliers"
# Add aggregated data
customer.categories <- rbind(customer.categories,
all.suppliers)
# Format visualization title string dynamically
title.str.1 <- paste("Total Revenue for",
dataset$Year[1],
"by Buying Group and Supplier/Customer Categories for",
nrow(dataset),
"Rows of Data",
sep = " ")
# Plot
ggplot(customer.categories, aes(x = CustomerCategoryName, y = TotalRevenue, fill = BuyingGroupName)) +
theme_bw() +
coord_flip() +
facet_grid(BuyingGroupName ~ SupplierCategoryName) +
geom_bar(stat = "identity") +
scale_y_continuous(labels = comma) +
theme(text = element_text(size = 18),
axis.text.x = element_text(size = 12, angle=90, hjust=1)) +
labs(x = "Customer Category",
y = "Total Revenue",
fill = "Buying Group",
title = title.str.1)
#=============================================================================
#
# Visualization #2 - Aggregated Process Behavior Charts
#
#=============================================================================
# Add artificial filtering for example
dataset <- dataset %>%
filter(is.na(BuyingGroupName) &
(Year == 2013 | Year == 2014))
# Power BI code starts here
library(dplyr)
library(qcc)
# Grab year variables
Year1 <- min(dataset$Year)
Year2 <- max(dataset$Year)
# Accumulate totals
totals <- dataset %>%
filter(Year == Year1| Year == Year2 ) %>%
mutate(Month = substr(Month, 1, 3),
MonthNum = match(Month, month.abb)) %>%
group_by(Year, MonthNum, Month) %>%
summarize(TotalRevenue = sum(LineTotal)) %>%
mutate(Label = paste(Month, Year, sep = "-")) %>%
arrange(Year, MonthNum)
# Make labels pretty with dummy vars
Revenue.Group.1 <- totals$TotalRevenue[1:12]
Revenue.Group.2 <- totals$TotalRevenue[13:24]
title.str <- paste("Process Behavior Chart - ", Year1, " and ", Year2, " ",
dataset$CustomerCategoryName[1], " Total Revenue for Buying Group '",
dataset$BuyingGroupName[1], "'", sep = "")
# Plot
blank.super.qcc <- qcc(Revenue.Group.1, type = "xbar.one",
newdata = Revenue.Group.2,
labels = totals$Label[1:12],
newlabels = totals$Label[13:24],
title = title.str,
ylab = "Total Revenue", xlab = "Month-Year")
# Introduction to R Visualizations in Microsoft Power BI
GitHub Repository for the 03/15/2017 and 04/05/2017 Meetups titled "Introduction to R Visualizations in Microsoft Power BI". First held in [Toronto](https://www.meetup.com/Data-Science-Dojo-Toronto/events/237952698/) and subsequently in [Redmond](https://www.meetup.com/data-science-dojo/events/237941790/).
These materials make extensive use of Microsoft's [Wide World Importers](https://github.com/Microsoft/sql-server-samples/releases/tag/wide-world-importers-v1.0) SQL Server 2016 sample database.
Additionally, the following are required to use the files for the Meetup:
* [Power BI Desktop](https://www.microsoft.com/en-us/download/details.aspx?id=45331)
* [The R programming language](https://cran.rstudio.com/)
* The [dplyr](https://cran.r-project.org/web/packages/dplyr/index.html), [lubridate](https://cran.r-project.org/web/packages/lubridate/index.html), [ggplot2](https://cran.r-project.org/web/packages/ggplot2/index.html), [scales](https://cran.r-project.org/web/packages/scales/index.html), and [qcc](https://cran.r-project.org/web/packages/qcc/index.html) packages.
While not required, [RStudio](https://www.rstudio.com/products/rstudio/download/) is highly recommended.
#=======================================================================================
#
# File: IntroToMachineLearning.R
# Author: Dave Langer
# Description: This code illustrates the usage of the caret package for the An
# Introduction to Machine Learning with R and Caret" Meetup dated
# 06/07/2017. More details on the Meetup are available at:
#
# https://www.meetup.com/data-science-dojo/events/239730653/
#
# NOTE - This file is provided "As-Is" and no warranty regardings its contents are
# offered nor implied. USE AT YOUR OWN RISK!
#
#=======================================================================================
#install.packages(c("e1071", "caret", "doSNOW", "ipred", "xgboost"))
library(caret)
library(doSNOW)
#=================================================================
# Load Data
#=================================================================
train <- read.csv("train.csv", stringsAsFactors = FALSE)
View(train)
#=================================================================
# Data Wrangling
#=================================================================
# Replace missing embarked values with mode.
table(train$Embarked)
train$Embarked[train$Embarked == ""] <- "S"
# Add a feature for tracking missing ages.
summary(train$Age)
train$MissingAge <- ifelse(is.na(train$Age),
"Y", "N")
# Add a feature for family size.
train$FamilySize <- 1 + train$SibSp + train$Parch
# Set up factors.
train$Survived <- as.factor(train$Survived)
train$Pclass <- as.factor(train$Pclass)
train$Sex <- as.factor(train$Sex)
train$Embarked <- as.factor(train$Embarked)
train$MissingAge <- as.factor(train$MissingAge)
# Subset data to features we wish to keep/use.
features <- c("Survived", "Pclass", "Sex", "Age", "SibSp",
"Parch", "Fare", "Embarked", "MissingAge",
"FamilySize")
train <- train[, features]
str(train)
#=================================================================
# Impute Missing Ages
#=================================================================
# Caret supports a number of mechanism for imputing (i.e.,
# predicting) missing values. Leverage bagged decision trees
# to impute missing values for the Age feature.
# First, transform all feature to dummy variables.
dummy.vars <- dummyVars(~ ., data = train[, -1])
train.dummy <- predict(dummy.vars, train[, -1])
View(train.dummy)
# Now, impute!
pre.process <- preProcess(train.dummy, method = "bagImpute")
imputed.data <- predict(pre.process, train.dummy)
View(imputed.data)
train$Age <- imputed.data[, 6]
View(train)
#=================================================================
# Split Data
#=================================================================
# Use caret to create a 70/30% split of the training data,
# keeping the proportions of the Survived class label the
# same across splits.
set.seed(54321)
indexes <- createDataPartition(train$Survived,
times = 1,
p = 0.7,
list = FALSE)
titanic.train <- train[indexes,]
titanic.test <- train[-indexes,]
# Examine the proportions of the Survived class lable across
# the datasets.
prop.table(table(train$Survived))
prop.table(table(titanic.train$Survived))
prop.table(table(titanic.test$Survived))
#=================================================================
# Train Model
#=================================================================
# Set up caret to perform 10-fold cross validation repeated 3
# times and to use a grid search for optimal model hyperparamter
# values.
train.control <- trainControl(method = "repeatedcv",
number = 10,
repeats = 3,
search = "grid")
# Leverage a grid search of hyperparameters for xgboost. See
# the following presentation for more information:
# https://www.slideshare.net/odsc/owen-zhangopen-sourcetoolsanddscompetitions1
tune.grid <- expand.grid(eta = c(0.05, 0.075, 0.1),
nrounds = c(50, 75, 100),
max_depth = 6:8,
min_child_weight = c(2.0, 2.25, 2.5),
colsample_bytree = c(0.3, 0.4, 0.5),
gamma = 0,
subsample = 1)
View(tune.grid)
# Use the doSNOW package to enable caret to train in parallel.
# While there are many package options in this space, doSNOW
# has the advantage of working on both Windows and Mac OS X.
#
# Create a socket cluster using 10 processes.
#
# NOTE - Tune this number based on the number of cores/threads
# available on your machine!!!
#
cl <- makeCluster(10, type = "SOCK")
# Register cluster so that caret will know to train in parallel.
registerDoSNOW(cl)
# Train the xgboost model using 10-fold CV repeated 3 times
# and a hyperparameter grid search to train the optimal model.
caret.cv <- train(Survived ~ .,
data = titanic.train,
method = "xgbTree",
tuneGrid = tune.grid,
trControl = train.control)
stopCluster(cl)
# Examine caret's processing results
caret.cv
# Make predictions on the test set using a xgboost model
# trained on all 625 rows of the training set using the
# found optimal hyperparameter values.
preds <- predict(caret.cv, titanic.test)
# Use caret's confusionMatrix() function to estimate the
# effectiveness of this model on unseen, new data.
confusionMatrix(preds, titanic.test$Survived)
# An Introduction to Machine Learning with R and caret
GitHub Repository for the 06/07/2017 Meetup titled "An Introduction to Machine Learning with R and caret". First held in [Redmond, WA](https://www.meetup.com/data-science-dojo/events/239730653/).
These materials make use of the data from Kaggle's [Titanic: Machine Learning from Disaster](https://www.kaggle.com/c/titanic) competition.
Additionally, the following are required to use the files for the Meetup:
* [The R programming language](https://cran.rstudio.com/)
* While not required, [RStudio](https://www.rstudio.com/products/rstudio/download/) is highly recommended.
* The [e1071](https://cran.r-project.org/web/packages/e1071/index.html), [caret](https://cran.r-project.org/web/packages/caret/index.html), [doSNOW](https://cran.r-project.org/web/packages/doSNOW/index.html), [ipred](https://cran.r-project.org/web/packages/ipred/index.html), and [xgboost](https://cran.r-project.org/web/packages/xgboost/index.html) packages.
#Set working directory
setwd("C:\\Users\\user4\\Documents\\Mithun")
# Create new image and Save it
save.image("./randomforest4.Rdata")
# Install packages( if necessary, install using 'Packages' tab)
install.packages("randomForest")
install.packages("caret")
install.packages("rpart")
# clear all the variables
#rm(list=ls())
train=read.csv("./data//train.csv")
test=read.csv("./data//test.csv")
head(test)
# Add "Survived" column to test, to help combine with train data
test$Survived=NA
# Combine train and test
combi=rbind(train,test)
# Convert names to character
combi$Name<-as.character(combi$Name)
# Split 'Name' to isolate a person's title using strsplit
strsplit(combi$Name[1],split='[,.]')
# test of how strsplit works
strsplit(combi$Name[1],split='[,.]')[[1]]
strsplit(combi$Name[1],split='[,.]')[[1]][2]
# apply function to dataset
# This will isolate title for all rows
combi$Title <- sapply(combi$Name, FUN=function(x){strsplit(x,split='[,.]')[[1]][2]})
# remove empty spaces from the 'Title' field
combi$Title=gsub(' ','',combi$Title)
# Review contents of 'Title' field
table(combi$Title)
# Reduce 'Title' contenst into fewer categories
combi$Title[combi$Title %in% c('Mme', 'Mlle')] <- 'Mlle'
combi$Title[combi$Title %in% c('Capt', 'Don', 'Major', 'Sir')] <- 'Sir'
combi$Title[combi$Title %in% c('Dona', 'Lady', 'the Countess', 'Jonkheer')] <- 'Lady'
# Change Title to a factor
combi$Title <- factor(combi$Title)
# Combine sibling and parent/child variables into FamilySize variable
combi$FamilySize <- combi$SibSp + combi$Parch + 1
# Identifying families by combining last name and family size
# # identify surname
combi$Surname <- sapply(combi$Name, FUN=function(x) {strsplit(x, split='[,.]')[[1]][1]})
# # combine with family size
combi$FamilyID <- paste(as.character(combi$FamilySize), combi$Surname, sep="")
# Categorize family size less than 2 as small
combi$FamilyID[combi$FamilySize <= 2] <- 'Small'
# Review results
table(combi$FamilyID)
# Further consolidate results( some families may have different last names)
famIDs <- data.frame(table(combi$FamilyID))
famIDs <- famIDs[famIDs$Freq <= 2,]
combi$FamilyID[combi$FamilyID %in% famIDs$Var1] <- 'Small'
combi$FamilyID <- factor(combi$FamilyID)
# Splitting this new dataset back into train and test datasets
train <- combi[1:891,]
test <- combi[892:1309,]
# PRESENTATION START
# PRESENTATION START
# The "Age" variable has a few missing values
# To use the randomForest package in R, there should be no missing values
# A quick way of dealing with missing values is to replace them with either the mean or the median of the non-missing values for the variable
# In this example, we are replacing the missing values with a prediction, using decision trees.
library(rpart)
Agefit <- rpart(Age ~ Pclass + Sex + SibSp + Parch + Fare + Embarked + Title + FamilySize,data=combi[!is.na(combi$Age),], method="anova")
combi$Age[is.na(combi$Age)] <- predict(Agefit, combi[is.na(combi$Age),])
# Check for other missing variables
## 'Embarked' has two blank variables
## They are identified using the "which" command
which(combi$Embarked == '')
## rows 62 and 830 have the blank values for Embarked
## they are replaced with the mode of all the values for Embarked, which is 'S'
combi$Embarked[c(62,830)] = "S"
## convert 'Embarked' to a factor
combi$Embarked <- factor(combi$Embarked)
##'Fare' has one NA value
which(is.na(combi$Fare))
#Replace with median
combi$Fare[1044] <- median(combi$Fare, na.rm=TRUE)
## All missing values are taken care of now
# Random Forests in R can only digest factors with up to 32 levels
# If any factor variable has more than 32 levels, the levels need to be redefined to be <= 32 or the variable needs to be converted into a continuous one
# This example will redefine the levels
str(combi$FamilyID)
##increase the definition of Small from 2 to 3
combi$FamilyID2 <- combi$FamilyID
combi$FamilyID2 <- as.character(combi$FamilyID2)
combi$FamilyID2[combi$FamilySize <= 3] <- 'Small'
combi$FamilyID2 <- factor(combi$FamilyID2)
# split the dataset into training and test
train <- combi[1:891,]
test <- combi[892:1309,]
# installing the package
#install.packages('randomForest')
library(randomForest)
# to ensure reproducible results, use the set.seed function
# this will give you the same results everytime you run the code
# the number inside is not important
set.seed(415)
fit <- randomForest(as.factor(Survived) ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked + Title + FamilySize +FamilyID2, data=train, importance=TRUE, ntree=2000)
## 'importance=TRUE' allows us to inspect variable importance
## ntree enables specifying how many trees we want to grow
### ALSO LOOK AT NODESIZE AND SAMPSIZE TO SIMPPLIFY TREE, IN ORDER TO REDUCE COMPLEXITY
# Look at what variables are important
varImpPlot(fit)
## define accuracy and gini
### MeanDecreaseAccuracy : Tells us how much the accuracy decreases without the variable on the Y-axis.
### 'Title' causes the most decrease and is therefore the most predictive in nature
### MeanDecreseGini: Measure how pure terminal nodes are.
### Again the plot tests results after removing each variable, for decrease in Gini value.
### Variable with the highest value has the highest predictive power
### "Title" variable is top for both measures
# Performance Evaluation
## Confusion Matrix
## The 'fit' object contains several components
names(fit)
## to review the confusion matrix
fit[5]
fit$confusion
## Confidence Interval forAccuracy
set.seed(121)
library(caret)
confusionMatrix(fit$predicted,train$Survived)
## Area under the curve
###roc(train$Survived,as.integer(fit$predicted),plot = TRUE,smooth=TRUE)
# Tuning the Model
# creating a new data.frame to contain just the predictors necessary and not all the columns
# in the original training dataset
train1=data.frame(Pclass = train$Pclass,Survived =train$Survived, Sex=train$Sex,Age=train$Age,SibSp=train$SibSp,Parch=train$Parch,
Fare =train$Fare, Embarked=train$Embarked,Title=train$Title,FamilySize=train$FamilySize,FamilyID2=train$FamilyID2)
# tune to get best value of mtry
set.seed(121)
tunefit=train(as.factor(Survived)~ ., data=train1,method="rf",metric="Accuracy",tuneGrid=data.frame(mtry=c(2,3,4)))
tunefit
# Prediction
prediction=predict(tunefit, newdata=test)
head(prediction)
save.image("./randomforest4.Rdata")
# install.packages("rvest")
library(rvest)
library(stringr)
#################################################################################
# ingress
#################################################################################
# scrape date, now
now <- Sys.time()
# url to scrape, then download page
url <- "https://www.newegg.com/Desktop-Graphics-Cards/SubCategory/ID-48"
webpage <- read_html(url)
#################################################################################
# parsing elements
#################################################################################
############
# feature: card name
############
card_name <- webpage %>% html_nodes(".item-title") %>% html_text()
################
# feature: current price
################
cur_price <- webpage %>% html_nodes(".price-current strong") %>% html_text()
################
# feature: brand
################
brand <- webpage %>% html_nodes(".item-brand img") %>% html_attr("title")
################
# feature: shipping
################
shipping <- webpage %>% html_nodes(".price-ship") %>% html_text(trim=TRUE)
shipping <- str_replace_all(string = shipping, pattern = " Shipping", replacement = "")
#################################################################################
# data binding
#################################################################################
graphics_cards <- as.data.frame(card_name)
graphics_cards$scrape_date <- now
graphics_cards$cur_price <- cur_price
graphics_cards$brand <- brand
graphics_cards$shipping <- shipping
#################################################################################
# egress
#################################################################################
# change this to your own working folder
setwd("C:/Users/Phuc H Duong/Downloads/newegg")
# write file out as a csv
write.csv(
x = graphics_cards,
file = "graphics_card_report.csv",
row.names = FALSE
)
# History files
.Rhistory
.Rapp.history
# Session Data files
.RData
# Example code in package build process
*-Ex.R
# Output files from R CMD build
/*.tar.gz
# Output files from R CMD check
/*.Rcheck/
# RStudio files
.Rproj.user/
# produced vignettes
vignettes/*.html
vignettes/*.pdf
# OAuth2 token, see https://github.com/hadley/httr/releases/tag/v0.3
.httr-oauth
# knitr and R markdown default cache directories
/*_cache/
/cache/
# Temporary files created by R markdown
*.utf8.md
*.knit.md
# web_scraping_r
Web scraping in R
# install.packages("rvest")
library(rvest)
library(stringr)
#################################################################################
# ingress
#################################################################################
# scrape date, now
now <- Sys.time()
# url to scrape, then download page
url <- "https://www.newegg.com/Video-Cards-Video-Devices/Category/ID-38"
webpage <- read_html(url)
#################################################################################
# web scraping
#################################################################################
############
# feature: card name
############
card_name <- webpage %>% html_nodes(".item-title") %>% html_text()
################
# feature: current price
################
cur_price <- webpage %>% html_nodes(".price-current strong") %>% html_text()
################
# feature: original price
################
org_price <- webpage %>% html_nodes(".price-was") %>% html_text(trim=TRUE)
# substring search for price, using regular expression.
needle <- "\\d{1,}\\.\\d{1,}"
indexes <- str_locate(string = org_price, pattern = needle)
indexes <- as.data.frame(indexes)
org_price <- str_sub(string=org_price, start = indexes$start, end = indexes$end)
################
# feature: rating
################
# problem: not every graphics card has a rating
# solution: build a table of product id and ratings
# then join with the main table by the same product id
# product id
rate.pid <- webpage %>% html_nodes(".item-rating") %>% html_attr("href")
# format: <url><"Item='><pid><'$'><stuff>
rate.pid.split <- str_split_fixed(rate.pid, pattern = "Item=", n=2)
# result: [1] [2]
# <url> <pid><'$'><stuff>
rate.pid.split <- str_split_fixed(rate.pid.split[,2], pattern="&", n=2)
# result: [1] [2]
# <pid> <stuff>
rate.pid <- rate.pid.split[,1]
# rating
rating <- webpage %>% html_nodes(".item-rating") %>% html_attr("title")
# result: <string><+\s><rating>
rating <- str_split_fixed(string = rating, pattern="\\+\\s", n = 2)[,2]
# result: [1] [2]
# <string\s> <rating>
rating_df <- as.data.frame(cbind(rate.pid, rating))
# combine
#################################################################################
# data binding
#################################################################################
graphics_cards <- as.data.frame(card_name)
graphics_cards$scrape_date <- now
graphics_cards$cur_price <- current_price
graphics_cards$org_price <- org_price
graphics_cards$rating <- rating
#######################
# feature: sales price
#######################
# logic: sales price - current price = sales discount
# pseudo code: replace NA of org price, with the current price
# query org missing prices <- query cur prices of org missing prices
na.org_price <- is.na(graphics_cards$org_price)
graphics_cards[na.org_price,"org_price"] <- graphics_cards[na.org_price,"cur_price"]
# cast into numeric
graphics_cards$org_price <- as.numeric(graphics_cards$org_price)
graphics_cards$cur_price <- as.numeric(graphics_cards$cur_price)
# sales price - current price = sales discount
graphics_cards$sales_amt <- graphics_cards$org_price - graphics_cards$cur_price
#######################
# feature: discount %
#######################
# logic: divide sales amount by original price
graphics_cards$discount <- graphics_cards$sales_amt / graphics_cards$org_price
#######################
# feature: on_sale
#######################
# logic: if discount price as a percentage of the original price is higher than
# a certain percentage threshold, mark as being on sale
# key: 0 = not on sale
# 1 = on sale
threshold <- 0.03
graphics_cards$on_sale <- 0
graphics_cards[graphics_cards$discount > threshold, "on_sale"] <- 1
This source diff could not be displayed because it is too large. You can view the blob instead.
"card_name","scrape_date","cur_price","brand","shipping"
"EVGA GeForce GTX 1050 FTW GAMING ACX 3.0, 02G-P4-6157-KR, 2GB GDDR5, DX12 OSD Support (PXOC)",2017-06-27 08:31:03,"139","EVGA","$3.99"
"GIGABYTE GeForce GTX 1050 DirectX 12 GV-N1050OC-2GD 2GB 128-Bit GDDR5 PCI Express 3.0 x16 ATX Video Card",2017-06-27 08:31:03,"119","GIGABYTE","$3.99"
"GIGABYTE GeForce GTX 1050 Ti DirectX 12 GV-N105TWF2OC-4GD 4GB 128-Bit GDDR5 PCI Express 3.0 x16 ATX Video Card",2017-06-27 08:31:03,"159","GIGABYTE","$4.99"
"GIGABYTE GeForce GTX 1050 Ti DirectX 12 GV-N105TD5-4GD 4GB 128-Bit GDDR5 PCI Express 3.0 x16 ATX Video Cards",2017-06-27 08:31:03,"139","GIGABYTE","$4.99"
"EVGA GeForce GTX 1080 Ti SC2 HYBRID GAMING, 11G-P4-6598-KR, 11GB GDDR5X, HYBRID & LED, iCX Technology - 9 Thermal Sensors",2017-06-27 08:31:03,"809","EVGA","$4.99"
"MSI GeForce GTX 1050 DirectX 12 GTX 1050 2G OC 2GB 128-Bit GDDR5 PCI Express 3.0 x16 HDCP Ready ATX Video Card",2017-06-27 08:31:03,"103","MSI","$4.99"
"GIGABYTE Radeon RX 460 WINDFORCE OC 2GB GV-RX460WF2OC-2GD",2017-06-27 08:31:03,"109","GIGABYTE","$3.99"
"MSI GeForce GTX 1080 Ti FE DirectX 12 GTX 1080 Ti Founders Edition 11GB 352-Bit GDDR5X PCI Express 3.0 x16 HDCP Ready SLI Support Video Card",2017-06-27 08:31:03,"699","MSI","$5.92"
"SAPPHIRE Radeon RX 460 DirectX 12 100409-2GOC-2L 2GB 128-Bit GDDR5 PCI Express 3.0 CrossFireX Support Video Cards",2017-06-27 08:31:03,"99","Sapphire Tech","$3.99"
"SAPPHIRE Radeon RX 560 DirectX 12 100413P2GOCL 2GB 128-Bit GDDR5 Video Card",2017-06-27 08:31:03,"109","Sapphire Tech","$3.99"
"GIGABYTE Radeon RX 550 DirectX 12 GV-RX550D5-2GD 2GB 128-Bit GDDR5 PCI Express 3.0 x16 ATX Video Card",2017-06-27 08:31:03,"84","GIGABYTE","$3.99"
"VisionTek Radeon RX 560 DirectX 12 900962 2GB 128-Bit GDDR5 PCI Express x16 Video Card",2017-06-27 08:31:03,"119","VisionTek","$3.99"
"EVGA GeForce 8400 GS DirectX 10 512-P3-1301-KR 512MB 32-Bit DDR3 PCI Express 2.0 x16 HDCP Ready Low Profile Ready Video Card",2017-06-27 08:31:03,"29","EVGA","$3.99"
"PNY GeForce GTX 1050 Ti DirectX 12 VCGGTX1050T4PB 4GB 128-Bit GDDR5 PCI Express 3.0 x16 HDCP Ready Video Card",2017-06-27 08:31:03,"154","PNY Technologies, Inc.","$4.99"
"GIGABYTE GeForce GTX 750Ti 4GB WINDFORCE 2X OC EDITION",2017-06-27 08:31:03,"114","GIGABYTE","$3.99"
"PNY GeForce GT 730 DirectX 12 (feature 11_0) VCGGT7301D5LXPB 1GB 64-Bit GDDR5 PCI Express 2.0 Low Profile Ready Video Card",2017-06-27 08:31:03,"64","PNY Technologies, Inc.","$2.99"
"PNY GeForce GTX 950 Graphic Card - 1.02 GHz Core - 1.19 GHz Boost Clock - 2 GB GDDR5 - PCI Express 3.0 x16",2017-06-27 08:31:03,"109","PNY Technologies, Inc.","$3.99"
"EVGA GeForce GT 1030 SC, 02G-P4-6333-KR, 2GB GDDR5, Low Profile",2017-06-27 08:31:03,"79","EVGA","$3.99"
"XFX Radeon R7 240 R7-240A-2TS2 2GB 128-Bit DDR3 PCI Express 3.0 Video Cards",2017-06-27 08:31:03,"59","XFX","$3.99"
"GIGABYTE GeForce GTX 1050 OC Low Profile 2GB Video Card",2017-06-27 08:31:03,"119","GIGABYTE","$3.99"
"EVGA GeForce 8400 GS DirectX 10 01G-P3-1302-LR 1GB 64-Bit DDR3 PCI Express 2.0 x16 HDCP Ready Low Profile Ready Video Card",2017-06-27 08:31:03,"31","EVGA","$2.99"
"GIGABYTE Ultra Durable 2 GeForce GT 740 DirectX 12 GV-N740D5OC-2GI (rev. 3.0) 2GB 128-Bit GDDR5 PCI Express 3.0 x16 ATX Video Card",2017-06-27 08:31:03,"89","GIGABYTE","$3.99"
"EVGA GeForce GT 730 DirectX 12 04G-P3-2739-KR 4GB 128-Bit DDR3 PCI Express 2.0 Video Card",2017-06-27 08:31:03,"77","EVGA","$2.99"
"GIGABYTE Ultra Durable 2 Series GeForce GT 730 DirectX 12 GV-N730-2GI (rev. 1.0) 2GB 128-Bit DDR3 PCI Express 2.0 HDCP Ready ATX Video Card",2017-06-27 08:31:03,"59","GIGABYTE","Free"
"GIGABYTE GeForce GTX 1050 Ti DirectX 12 GV-N105TG1 GAMING-4GD 4GB 128-Bit GDDR5 PCI Express 3.0 x16 ATX Video Card",2017-06-27 08:31:03,"169","GIGABYTE","$4.99"
"PNY GeForce GTX 1080 Ti DirectX 12 VCGGTX1080T11PB-CG2 11GB 352-Bit GDDR5X PCI Express 3.0 x16 Video Card",2017-06-27 08:31:03,"699","PNY Technologies, Inc.","$4.99"
"GIGABYTE Radeon R7 360 DirectX 12 GV-R736OC-2GD (rev. 3.0) 2GB 128-Bit GDDR5 PCI Express 3.0 ATX Video Card",2017-06-27 08:31:03,"93","GIGABYTE","$3.99"
"EVGA GeForce GTX 1050 SSC GAMING ACX 3.0, 02G-P4-6154-KR, 2GB GDDR5, DX12 OSD Support (PXOC)",2017-06-27 08:31:03,"129","EVGA","$3.99"
"SAPPHIRE Radeon RX 550 DirectX 12 100414P2GL 2GB 128-Bit GDDR5 Video Card",2017-06-27 08:31:03,"82","Sapphire Tech","$3.99"
"EVGA GeForce GTX 1050 Ti FTW GAMING ACX 3.0, 04G-P4-6258-KR, 4GB GDDR5, DX12 OSD Support (PXOC)",2017-06-27 08:31:03,"169","EVGA","$3.99"
"PowerColor RED DRAGON Radeon RX 560 DirectX 12 AXRX 560 2GBD5-DHV2/OC 2GB 128-Bit GDDR5 CrossFireX Support ATX Video Card",2017-06-27 08:31:03,"119","PowerColor","$3.99"
"EVGA GeForce GTX 1050 Ti GAMING, 04G-P4-6251-KR, 4GB GDDR5, DX12 OSD Support (PXOC)",2017-06-27 08:31:03,"139","EVGA","$4.99"
"PowerColor RED DRAGON Radeon RX 550 DirectX 12 AXRX 550 2GBD5-DH/OC 2GB 128-Bit GDDR5 PCI Express 3.0 CrossFireX Support ATX Video Card",2017-06-27 08:31:03,"89","PowerColor","$3.99"
"GIGABYTE GeForce GT 1030 Low Profile 2G",2017-06-27 08:31:03,"69","GIGABYTE","$3.99"
"EVGA GeForce GTX 1050 Ti SC GAMING, 04G-P4-6253-KR, 4GB GDDR5, DX12 OSD Support (PXOC)",2017-06-27 08:31:03,"140","EVGA","$4.99"
"PowerColor Radeon R5 230 DirectX 11 AXR5 230 2GBK3-LHE 2GB 64-Bit DDR3 PCI Express 2.1 HDCP Ready CrossFireX Support Low Profile Video Cards",2017-06-27 08:31:03,"34","PowerColor","$3.99"
# Introduction to R Programming for Excel Users
GitHub Repository for the 05/03/2017 Meetup titled "Introduction to R Programming for Excel Users". First held in [Redmond, WA](https://www.meetup.com/data-science-dojo/events/239049571/).
These materials make extensive use of Kaggle's [Titanic: Machine Learning from Disaster](https://www.kaggle.com/c/titanic) training dataset for data wrangling, analysis, and visualization examples.
Additionally, the following are required to use the files for the Meetup:
* [Microsoft Excel](www.microsoftstore.com/Excel)
* [The R programming language](https://cran.rstudio.com/)
* [RStudio](https://www.rstudio.com/products/rstudio/download/)
* The following R packages: [ggplot2](https://cran.r-project.org/web/packages/ggplot2/index.html) and [dplyr](https://cran.r-project.org/web/packages/dplyr/index.html).
#=========================================================================================
#
# File: titanic.R
# Author: Dave Langer
# Description: This code illustrates R coding used in the "Introduction to R Programming
# for Excel Users" Meetup dated 05/03/2017. More details on
# the Meetup are available at:
#
# https://www.meetup.com/data-science-dojo/events/239049571/
#
# The code in this file leverages data from Kaggle's "Titanic: Machine
# Learning from Disaster" introductory competition:
#
# https://www.kaggle.com/c/titanic
#
# NOTE - This file is provided "As-Is" and no warranty regardings its contents are
# offered nor implied. USE AT YOUR OWN RISK!
#
#=========================================================================================
# Load up Titanic data into a R data frame (i.e., R's version of an Excel table)
titanic <- read.csv("titanic.csv", header = TRUE)
# Add a new feature to the data frame for SurvivedLabel
titanic$SurvivedLabel <- ifelse(titanic$Survived == 1,
"Survived",
"Died")
# Add a new feature (i.e., column) to the data frame for FamilySize
titanic$FamilySize <- 1 + titanic$SibSp + titanic$Parch
View(titanic)
# Look at the data types (i.e., R's version of Excel data formatting for cells)
str(titanic)
# Apply a row filter to the Titanic data frame - return only males
males <- titanic[titanic$Sex == "male",]
# Create summary statistics for male fares
summary(males$Fare)
var(males$Fare)
sd(males$Fare)
sum(males$Fare)
length(males$Fare)
# Ranges work just like in Excel - pick the first 5 rows of data.
first.five <- titanic[1:5,]
# View the first five columns of the first five rows.
View(first.five[, 1:5])
# Use an R package (i.e., the Excel equivalent of an Add-in) to
# create powerful visualizations easy.
#install.packages("ggplot2")
library(ggplot2)
ggplot(titanic, aes(x = FamilySize, fill = SurvivedLabel)) +
theme_bw() +
facet_wrap(Sex ~ Pclass) +
geom_histogram(binwidth = 1)
# Use an R package (i.e., the Excel equivalent of an Add-in) to
# make building data pivots easy.
#install.packages("dplyr")
library(dplyr)
pivot <- titanic %>%
group_by(Pclass, Sex, SurvivedLabel) %>%
summarize(AvgFamilySize = mean(FamilySize),
PassengerCount = n()) %>%
arrange(Pclass, Sex, SurvivedLabel)
View(pivot)
Guest Cluster:
HalJordonCluster
SSH Endpoint for Edge Node:
R-Server.HalJordonCluster-ssh.azurehdinsight.net:22
Cluster Login Name:
admin
Cluster Login Password:
DojoGuest123$
SSH Username:
sshguest
SSH Password:
ThisIsATerriblePassword2
\ No newline at end of file
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment