Building a Real-time Sentiment Pipeline for Live Tweets using Python, R, & Azure
Requirements
- Twitter Account + Twitter App setup (https://apps.twitter.com/)
- Anaconda 3.5 or Python 3.5 Installed
- Azure subscription or free trial account
- 30 day free trial
- Azure Machine Learning Studio workspace
- Text Editor, I'll be using Sublime Text 3
- Github.com account (to receive code)
- PowerBI.com account (for Dashboard portion)
- .NET up to date + windows (for testing portion)
Cloning the Repo for Code & Materials
git clone https://www.github.com/datasciencedojo/meetup.git
Folder: Building a Real-time Sentiment Pipeline for Live Tweets using Python, R, & Azure
The Predictive Model
Supervised Twitter Dataset
- Azure ML Reader Module:
- Data source: Azure Blob Storage
- Authentication type: PublicOrSAS
- URI: http://azuremlsampleexperiments.blob.core.windows.net/datasets/Sentiment140.tenPercent.sample.tweets.tsv
- File format: TSV
- URI has header row: Checked
- Import and save dataset
Preprocessing & Cleaning
- Azure ML Metadata Editor: Cast categorical sentiment_label
- Azure ML Group Categorical Values: Casting '0' as Negative, '4' as positive
Text Processing
- Filtering using R
- Removing stop words (Stop words list)
- Removing special characters
- Replace numbers
- Globally conform to lower case
- Stemming and lemmatization Example of Cleansing Stop Words
- Create a term frequency matrix for English words
- Azure ML's Feature Hashing Module
- Drop the tweet_text column, since it is no longer needed
- Azure ML's Project Columns module
- Feature Selection & Filtering
- Pick only the most X relevant columns/words to train on.
- Using Azure ML's Filter based Selection module, set to Pearson's correlation to select the top 5000 most correlated columns
- Normalize the Term Frequency Matrix
- Text processing best practice, but does not matter too much for Tweets
- Normalize Data Module: Min/Max for all numeric columns
Algorithm Selection
- Algorithm Cheat Sheet
- Beginer's Guide to Choosing Algorithms
- Azure ML's Support Vector Machines
- Support Vector Machines in General
Model Building
- Train the model
- Score the trained model against a validation set
- Evaluate the performance, maximaxing accuracy in this case
Twitter App
- Creating a Twitter Account
- Creating a Twitter App
- Get your Twitter app's OAuth keys and tokens.
Twitter API with Python
Azure Event Hub
- Create an Service Bus Namespace
- Create an Azure Event Hub
- Create a send key (to push data to)
- Create a manage key (stream processor)
- Create a listen key (to subscribe to)
- Pushing to Azure Event Hub
- Viewing inside of an Azure Event Hub
Deploy the Model Hook up Stream Processors