Building a Real-time Sentiment Pipeline for Live Tweets using Python, R, & Azure

Requirements

  • Twitter Account + Twitter App setup (https://apps.twitter.com/)
  • Anaconda 3.5 or Python 3.5 Installed
  • Azure subscription or free trial account
  • Text Editor, I'll be using Sublime Text 3
  • Github.com account (to receive code)
  • PowerBI.com account (for Dashboard portion)
  • .NET up to date + windows (for testing portion)

Cloning the Repo for Code & Materials

git clone https://www.github.com/datasciencedojo/meetup.git

Folder: Building a Real-time Sentiment Pipeline for Live Tweets using Python, R, & Azure

The Predictive Model

Supervised Twitter Dataset

Preprocessing & Cleaning

  • Azure ML Metadata Editor: Cast categorical sentiment_label
  • Azure ML Group Categorical Values: Casting '0' as Negative, '4' as positive

Text Processing

  • Filtering using R
    • Removing stop words (Stop words list)
    • Removing special characters
    • Replace numbers
    • Globally conform to lower case
    • Stemming and lemmatization Example of Cleansing Stop Words
  • Create a term frequency matrix for English words
  • Drop the tweet_text column, since it is no longer needed
    • Azure ML's Project Columns module
  • Feature Selection & Filtering
    • Pick only the most X relevant columns/words to train on.
    • Using Azure ML's Filter based Selection module, set to Pearson's correlation to select the top 5000 most correlated columns
  • Normalize the Term Frequency Matrix
    • Text processing best practice, but does not matter too much for Tweets
    • Normalize Data Module: Min/Max for all numeric columns

Algorithm Selection

Model Building

  • Train the model
  • Score the trained model against a validation set
  • Evaluate the performance, maximaxing accuracy in this case

Twitter App

Twitter API with Python

Azure Event Hub

Deploy the Model Hook up Stream Processors