Data Science Dojo
Copyright (c) 2019 - 2020


Level: Advanced
Recommended Use: Regression/Classification Models
Domain: Business/Web

Online News Popularity Data Set

Predict the number of shares in social networks


This advanced level data set has 39644 rows and 61 columns. This dataset summarizes a heterogeneous set of features about articles published by Mashable in a period of two years. This could be used to predict the number of shares of an article in social networks.

This data set is recommended for learning and practicing your skills in exploratory data analysis, data visualization, and regression/classification modelling techniques. It also allows you to practice with large number of features. Feel free to explore the data set with multiple supervised and unsupervised learning techniques. The Following data dictionary gives more details on this data set:


Data Dictionary

Column Position Atrribute Name Definition Data Type Example % Null Ratios
1 URL URL Of The Article (Non-Predictive) Qualitative "http://mashable.com/2013/01/07/amazon-instant-video-browser/" 0
2 Timedelta Timedelta: Days Between The Article Publication And The Dataset Acquisition (Non-Predictive) Quantitative 731 0
3 N_Tokens_Title N_Tokens_Title: Number Of Words In The Title Quantitative 12 0
4 N_Tokens_Content N_Tokens_Content: Number Of Words In The Content Quantitative 219 0
5 N_Unique_Tokens N_Unique_Tokens: Rate Of Unique Words In The Content Quantitative 0.663594467 0
6 N_Non_Stop_Words N_Non_Stop_Words: Rate Of Non-Stop Words In The Content Quantitative 0.999999992 0
7 N_Non_Stop_Unique_Tokens N_Non_Stop_Unique_Tokens: Rate Of Unique Non-Stop Words In The Content Quantitative 0.815384609 0
8 Num_Hrefs Num_Hrefs: Number Of Links Quantitative 4 0
9 Num_Self_Hrefs Num_Self_Hrefs: Number Of Links To Other Articles Published By Mashable Quantitative 2 0
10 Num_Imgs Num_Imgs: Number Of Images Quantitative 1 0
11 Num_Videos Num_Videos: Number Of Videos Quantitative 0 0
12 Average_Token_Length Average_Token_Length: Average Length Of The Words In The Content Quantitative 4.680365297 0
13 Num_Keywords Num_Keywords: Number Of Keywords In The Metadata Quantitative 5 0
14 Data_Channel_Is_Lifestyle Data_Channel_Is_Lifestyle: Is Data Channel 'Lifestyle'? Quantitative 0 0
15 Data_Channel_Is_Entertainment Data_Channel_Is_Entertainment: Is Data Channel 'Entertainment'? Quantitative 1 0
16 Data_Channel_Is_Bus Data_Channel_Is_Bus: Is Data Channel 'Business'? Quantitative 0 0
17 Data_Channel_Is_Socmed Data_Channel_Is_Socmed: Is Data Channel 'Social Media'? Quantitative 0 0
18 Data_Channel_Is_Tech Data_Channel_Is_Tech: Is Data Channel 'Tech'? Quantitative 0 0
19 Data_Channel_Is_World Data_Channel_Is_World: Is Data Channel 'World'? Quantitative 0 0
20 Kw_Min_Min Kw_Min_Min: Worst Keyword (Min. Shares) Quantitative 0 0
21 Kw_Max_Min Kw_Max_Min: Worst Keyword (Max. Shares) Quantitative 0 0
22 Kw_Avg_Min Kw_Avg_Min: Worst Keyword (Avg. Shares) Quantitative 0 0
23 Kw_Min_Max Kw_Min_Max: Best Keyword (Min. Shares) Quantitative 0 0
24 Kw_Max_Max Kw_Max_Max: Best Keyword (Max. Shares) Quantitative 0 0
25 Kw_Avg_Max Kw_Avg_Max: Best Keyword (Avg. Shares) Quantitative 0 0
26 Kw_Min_Avg Kw_Min_Avg: Avg. Keyword (Min. Shares) Quantitative 0 0
27 Kw_Max_Avg Kw_Max_Avg: Avg. Keyword (Max. Shares) Quantitative 0 0
28 Kw_Avg_Avg Kw_Avg_Avg: Avg. Keyword (Avg. Shares) Quantitative 0 0
29 Self_Reference_Min_Shares Self_Reference_Min_Shares: Min. Shares Of Referenced Articles In Mashable Quantitative 496 0
30 Self_Reference_Max_Shares Self_Reference_Max_Shares: Max. Shares Of Referenced Articles In Mashable Quantitative 496 0
31 Self_Reference_Avg_Sharess Self_Reference_Avg_Sharess: Avg. Shares Of Referenced Articles In Mashable Quantitative 496 0
32 Weekday_Is_Monday Weekday_Is_Monday: Was The Article Published On A Monday? Quantitative 1 0
33 Weekday_Is_Tuesday Weekday_Is_Tuesday: Was The Article Published On A Tuesday? Quantitative 0 0
34 Weekday_Is_Wednesday Weekday_Is_Wednesday: Was The Article Published On A Wednesday? Quantitative 0 0
35 Weekday_Is_Thursday Weekday_Is_Thursday: Was The Article Published On A Thursday? Quantitative 0 0
36 Weekday_Is_Friday Weekday_Is_Friday: Was The Article Published On A Friday? Quantitative 0 0
37 Weekday_Is_Saturday Weekday_Is_Saturday: Was The Article Published On A Saturday? Quantitative 0 0
38 Weekday_Is_Sunday Weekday_Is_Sunday: Was The Article Published On A Sunday? Quantitative 0 0
39 Is_Weekend Is_Weekend: Was The Article Published On The Weekend? Quantitative 0 0
40 Lda_00 Lda_00: Closeness To Lda Topic 0 Quantitative 0.500331204 0
41 Lda_01 Lda_01: Closeness To Lda Topic 1 Quantitative 0.37827893 0
42 Lda_02 Lda_02: Closeness To Lda Topic 2 Quantitative 0.040004675 0
43 Lda_03 Lda_03: Closeness To Lda Topic 3 Quantitative 0.041262648 0
44 Lda_04 Lda_04: Closeness To Lda Topic 4 Quantitative 0.040122544 0
45 Global_Subjectivity Global_Subjectivity: Text Subjectivity Quantitative 0.521617145 0
46 Global_Sentiment_Polarity Global_Sentiment_Polarity: Text Sentiment Polarity Quantitative 0.092561983 0
47 Global_Rate_Positive_Words Global_Rate_Positive_Words: Rate Of Positive Words In The Content Quantitative 0.0456621 0
48 Global_Rate_Negative_Words Global_Rate_Negative_Words: Rate Of Negative Words In The Content Quantitative 0.01369863 0
49 Rate_Positive_Words Rate_Positive_Words: Rate Of Positive Words Among Non-Neutral Tokens Quantitative 0.769230769 0
50 Rate_Negative_Words Rate_Negative_Words: Rate Of Negative Words Among Non-Neutral Tokens Quantitative 0.230769231 0
51 Avg_Positive_Polarity Avg_Positive_Polarity: Avg. Polarity Of Positive Words Quantitative 0.378636364 0
52 Min_Positive_Polarity Min_Positive_Polarity: Min. Polarity Of Positive Words Quantitative 0.1 0
53 Max_Positive_Polarity Max_Positive_Polarity: Max. Polarity Of Positive Words Quantitative 0.7 0
54 Avg_Negative_Polarity Avg_Negative_Polarity: Avg. Polarity Of Negative Words Quantitative -0.35 0
55 Min_Negative_Polarity Min_Negative_Polarity: Min. Polarity Of Negative Words Quantitative -0.6 0
56 Max_Negative_Polarity Max_Negative_Polarity: Max. Polarity Of Negative Words Quantitative -0.2 0
57 Title_Subjectivity Title_Subjectivity: Title Subjectivity Quantitative 0.5 0
58 Title_Sentiment_Polarity Title_Sentiment_Polarity: Title Polarity Quantitative -0.1875 0
59 Abs_Title_Subjectivity Abs_Title_Subjectivity: Absolute Subjectivity Level Quantitative 0 0
60 Abs_Title_Sentiment_Polarity Abs_Title_Sentiment_Polarity: Absolute Polarity Level Quantitative 0.1875 0
61 Shares Shares: Number Of Shares Quantitative 593 0

Acknowledgement

This data set has been sourced from the Machine Learning Repository of University of California, Irvine Online News Popularity Data Set (UC Irvine). The UCI page mentions the following publication as the original source of the data set:

K. Fernandes, P. Vinagre and P. Cortez. A Proactive Intelligent Decision Support System for Predicting the Popularity of Online News. Proceedings of the 17th EPIA 2015 - Portuguese Conference on Artificial Intelligence, September, Coimbra, Portugal