Sentiment-Driven Market Analysis with Twitter and Reddit

By: Asher Gilani, Hasan Muhammad, Arjun Rajkumar

twitter_v_reddit.png

Objectives

The rise of social media has given access to a huge volume of discussion on almost any topic imaginable. One such topic is company-related news with social media discussion acting as a proxy for public sentiment towards a given brand. We want to use discussion on social media to to try and predict performance of big companies in the stock market. We will try to use the sentiments shared towards companies on Twitter as an initial benchmark to compare against stock market movement. Then, we will look at investing communities on Reddit to see how discussions on another platform compare. If a correlation is found, we will have found a valuable source for companies to get instantaneous snapshots of how they are viewed by the public and how that image effects their monetary value.

Roadmap

  • First explore/clean up the Kaggle Twitter Dataset
  • Query the Yahoo Finance API for stocks to match the Twitter Dataset
  • Use NLP and Vader sentiment analysis to generate a sentiment score for each tweet
  • Use Twitter API to collect more recent stocks and do the same NLP processing
  • Examine sentiment-stock correlations
  • Build Linear Regression models to predict stock prices with sentiment scores
  • Use Reddit API to examine stock sentiment

Data Sources

In [1]:
# General data manipulation libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
from datetime import datetime, timedelta

# API libraries/credentials
import config
import yfinance as yf
import tweepy
import praw

# NLP libraries
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import TweetTokenizer
nltk.download('punkt')
nltk.download('stopwords')

from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

# ML libraries
from sklearn import linear_model as lm
import statsmodels
import statsmodels.api
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\asher\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\asher\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!

Exploration of Twitter Kaggle Dataset

This dataset contains financial tweets regarding stocks traded on NYSE, NASDAQ, and the S&P 500.

In [2]:
skiprows = [730, 2835, 3057, 3112, 3193, 3204, 3254, 3519, 4077, 4086, 4087, 4498]
tweet_df = pd.read_csv('./datasets/stockerbot-export.csv', skiprows=skiprows, parse_dates=['timestamp'])

# Drop last five rows since they seem to be in a different language
tweet_df = tweet_df[:-5]
tweet_df.head()
Out[2]:
id text timestamp source symbols company_names url verified
0 1019696670777503700 VIDEO: “I was in my office. I was minding my o... 2018-07-18 21:33:26+00:00 GoldmanSachs GS The Goldman Sachs https://twitter.com/i/web/status/1019696670777... True
1 1019709091038548000 The price of lumber $LB_F is down 22% since hi... 2018-07-18 22:22:47+00:00 StockTwits M Macy's https://twitter.com/i/web/status/1019709091038... True
2 1019711413798035500 Who says the American Dream is dead? https://t... 2018-07-18 22:32:01+00:00 TheStreet AIG American https://buff.ly/2L3kmc4 True
3 1019716662587740200 Barry Silbert is extremely optimistic on bitco... 2018-07-18 22:52:52+00:00 MarketWatch BTC Bitcoin https://twitter.com/i/web/status/1019716662587... True
4 1019718460287389700 How satellites avoid attacks and space junk wh... 2018-07-18 23:00:01+00:00 Forbes ORCL Oracle http://on.forbes.com/6013DqDDU True
In [3]:
ax = tweet_df.company_names.value_counts().sort_values(ascending=False).head(10).plot.barh(figsize=(10, 6))
ax.set_title('Most Commonly Tweeted Stocks')
ax.set_xlabel('Number of Times Tweeted')
Out[3]:
Text(0.5, 0, 'Number of Times Tweeted')

It is nice to see that this dataset has many tweets (around 100) for the most frequent companies. This will help us in determing the general sentiment around these companies over time to compare to stock prices. Many big names such as 21st Century Fox, Alphabet (parent company of Google), and Netflix also appear which are very established companies.

In [4]:
fig, (ax1, ax2) = plt.subplots(1, 2)

tweet_df.source.value_counts().head(10).plot.barh(ax=ax1, figsize=(15, 10))
ax1.set_title('Most Common Tweet Sources')
ax1.set_xlabel('Number of Times Tweeted')

tweet_df.source.value_counts().reset_index().plot(ax=ax2)
ax2.set_title('Distribution of Number of Tweets Per Source')
ax2.set_ylabel('Number of Tweets')
ax2.set_xlabel('Twitter Users')
Out[4]:
Text(0.5, 0, 'Twitter Users')

This is also a nice feature of our data. The bar chart on the left worried me that our dataset was only going to have a few users that tweet a lot. This would create a lot of bias in any correlations we find. But if you look at the distribution graph on the right, it shows that there are only a few users who really tweet a lot; most users in this dataset only tweet a few times which gives us a better sample. Later we will try using the Twitter API to get an even broader collection of tweet sources.

In [5]:
ax = tweet_df.groupby([tweet_df.timestamp.dt.year, tweet_df.timestamp.dt.month])['id'].count().plot(kind='bar')
ax.set_title('Distribution of Times')
ax.set_xlabel('Year/Month')
ax.set_ylabel('Count')
Out[5]:
Text(0, 0.5, 'Count')

The one unfortunate thing about this dataset is it primarily includes tweets from July 2018 and a few tweets it seems from February 2018. We will likely need to pull in more data using the Twitter API, but this dataset will provide a good starting place.

Collecting Stock Data Through the Yahoo Finance API

We were originally going to compare this Twitter dataset against some Kaggle stocks datasets. Given that our Twitter dataset only covers one month and the Kaggle datasets are very large, we are instead going to use the yfinance Python module to collect data for this specific month.

In [6]:
# Lets identify the top 30 stocks (based on frequency) from our Twitter dataset and use that with yfinance
common_tickers = tweet_df.symbols.value_counts().sort_values(ascending=False).head(30)
common_tickers.head()
Out[6]:
NFLX    101
AMAT    100
MTB     100
HON     100
MOMO    100
Name: symbols, dtype: int64
In [7]:
# Queries Yahoo Finance API for these stocks on the month of July 2018
stocks_df = yf.download(tickers=list(common_tickers.index), start='2018-07-01', end='2018-07-31')
stocks_df.head()
[*********************100%***********************]  30 of 30 completed

3 Failed downloads:
- OMG: No data found for this date range, symbol may be delisted
- ARNC: Data doesn't exist for startDate = 1530417600, endDate = 1533009600
- FOX: Data doesn't exist for startDate = 1530417600, endDate = 1533009600
Out[7]:
Adj Close ... Volume
ABBV AMAT ARNC BLK D DFS ES ESS FOX GPS ... MOMO MTB NFLX NTNX OMG ON PBR SQQQ TEL UPS
Date
2018-07-02 00:00:00 80.599098 43.812244 NaN 465.989349 61.225769 66.760170 55.400024 220.294998 NaN 28.312382 ... 4815300 732600 8142500 2096200 NaN 4726700 14640200 749600 1671100 2252300
2018-07-03 00:00:00 81.031921 42.968220 NaN 458.311768 61.995003 66.187805 55.869362 220.294998 NaN 28.767450 ... 2118100 383100 5280300 1296300 NaN 2400500 15707700 509200 815400 1145700
2018-07-05 00:00:00 81.785034 43.582054 NaN 458.955444 62.379616 66.225349 56.057095 222.607773 NaN 28.196383 ... 3498100 680100 8448900 2332100 NaN 7132400 21271900 841100 1521900 2866800
2018-07-06 00:00:00 83.897171 44.243843 NaN 458.684906 62.630074 66.093971 56.610909 223.838806 NaN 27.893003 ... 4059300 829500 8629600 2291100 NaN 4202200 13076300 837000 986000 2022400
2018-07-09 00:00:00 84.001045 44.617893 NaN 470.261902 60.894825 67.576477 54.874374 222.589127 NaN 27.500397 ... 5586900 1108800 11127500 2527400 NaN 3749400 18007000 617600 1140300 2103600

5 rows × 180 columns

Cleaning Yahoo Finance Data

In [8]:
# Only want Open, Close, and Volume. Drop the rest
stocks_df.drop(['Adj Close', 'Low', 'High'], axis=1, inplace=True)

# Columns are multi-indexed, so flatten this
stocks_df.columns = ['_'.join(multi_col) for multi_col in stocks_df.columns]

# Drop FOX, OMG, and ARNC cols since we weren't able to get data for them
stocks_df.drop(list(stocks_df.filter(regex='(FOX)|(OMG)|(ARNC)$')), axis=1, inplace=True)

stocks_df.head()
Out[8]:
Close_ABBV Close_AMAT Close_BLK Close_D Close_DFS Close_ES Close_ESS Close_GPS Close_GRPN Close_HLT ... Volume_MHK Volume_MOMO Volume_MTB Volume_NFLX Volume_NTNX Volume_ON Volume_PBR Volume_SQQQ Volume_TEL Volume_UPS
Date
2018-07-02 00:00:00 93.110001 45.680000 499.519989 68.449997 71.150002 59.020000 236.220001 31.730000 86.199997 79.300003 ... 839600 4815300 732600 8142500 2096200 4726700 14640200 749600 1671100 2252300
2018-07-03 00:00:00 93.610001 44.799999 491.290009 69.309998 70.540001 59.520000 236.220001 32.240002 85.199997 79.449997 ... 405700 2118100 383100 5280300 1296300 2400500 15707700 509200 815400 1145700
2018-07-05 00:00:00 94.480003 45.439999 491.980011 69.739998 70.580002 59.720001 238.699997 31.600000 87.599998 79.690002 ... 414600 3498100 680100 8448900 2332100 7132400 21271900 841100 1521900 2866800
2018-07-06 00:00:00 96.919998 46.130001 491.690002 70.019997 70.440002 60.310001 240.020004 31.260000 87.199997 80.559998 ... 479100 4059300 829500 8629600 2291100 4202200 13076300 837000 986000 2022400
2018-07-09 00:00:00 97.040001 46.520000 504.100006 68.080002 72.019997 58.459999 238.679993 30.820000 96.599998 80.940002 ... 426000 5586900 1108800 11127500 2527400 3749400 18007000 617600 1140300 2103600

5 rows × 81 columns

This is a lot easier to work with. The data is not exactly tidy since the tickers are included in column names, but this has every price lined up by the date which should make it easier to graph. When graphing, we will just use regex to select the columns corresponding to a specific ticker. Such a dataset would be tidy. Later during the Twitter API parsing, we transform this dataset into tidy data.

NLP and Sentiment Analysis on Tweets

Now that we have some tweets and stock price information for the corresponding time period, we can begin to do some analysis. We will start by doing some NLP on the Tweets and then we can compare them to changes in stock price over time.

We will be using VADER (Valence Aware Dictionary and Sentiment Reasoner) to figure out the sentiment. This gives us a reliable way of determining sentiment without labeling our own data and generating models. This rule-based analysis tool was explicitly designed for social media data which is perfect for our scenario.

In [9]:
# First filter for Tweets talking about our common tickers
relevant_tickers = tweet_df[tweet_df.symbols.isin(common_tickers.index)].reset_index()

print(relevant_tickers.shape)
relevant_tickers.head()
(2906, 9)
Out[9]:
index id text timestamp source symbols company_names url verified
0 8 1019720723441635300 Senate wants emergency alerts to go out throug... 2018-07-18 23:09:00+00:00 TechCrunch NFLX Netflix https://tcrn.ch/2L8DsgT True
1 21 1019737727477174300 Templeton and BlackRock say now's the time to ... 2018-07-19 00:16:34+00:00 business BLK BlackRock https://bloom.bg/2NmXZLe True
2 46 1016109247048945700 In 2 days I doubled my account with the option... 2018-07-08 23:58:17+00:00 HarryPickens2 HSY The Hershey Company http://Optionsbypros.com False
3 82 1016153595522281500 Weekly Recap: $HSY price moved 4.28% since pos... 2018-07-09 02:54:31+00:00 aboutheoptions HSY The Hershey Company https://twitter.com/i/web/status/1016153595522... False
4 108 1016208393378283500 $AR $CGC $DAVE $GNC $HSY $ICON $KMB $M $PEP $P... 2018-07-09 06:32:15+00:00 SeekingAlpha HSY The Hershey Company https://seekingalpha.com/article/4185993-shari... False

Cleaning Up Tweets with NLTK

In [10]:
tt = TweetTokenizer()

def clean_tweet(s):
    # Removing punctuation and stop words (e.g 'the', 'is', 'which')
    return ' '.join(wd for wd in tt.tokenize(s.lower()) if wd.isalnum() and wd not in set(stopwords.words('english')))

# Sample usage
print(relevant_tickers.text[1])
print(clean_tweet(relevant_tickers.text[1]))
Templeton and BlackRock say now's the time to buy emerging markets https://t.co/h4z90nxqwH https://t.co/6FkVrZlu3j
templeton blackrock say time buy emerging markets
In [11]:
# Now apply this cleanup to every tweet
relevant_tickers.text = relevant_tickers.text.apply(clean_tweet)
relevant_tickers.head()
Out[11]:
index id text timestamp source symbols company_names url verified
0 8 1019720723441635300 senate wants emergency alerts go netflix spoti... 2018-07-18 23:09:00+00:00 TechCrunch NFLX Netflix https://tcrn.ch/2L8DsgT True
1 21 1019737727477174300 templeton blackrock say time buy emerging markets 2018-07-19 00:16:34+00:00 business BLK BlackRock https://bloom.bg/2NmXZLe True
2 46 1016109247048945700 2 days doubled account option trades xrx mrtx 2018-07-08 23:58:17+00:00 HarryPickens2 HSY The Hershey Company http://Optionsbypros.com False
3 82 1016153595522281500 weekly recap hsy price moved since post next s... 2018-07-09 02:54:31+00:00 aboutheoptions HSY The Hershey Company https://twitter.com/i/web/status/1016153595522... False
4 108 1016208393378283500 ar cgc dave gnc hsy icon kmb pep pg rrc smrt wtw 2018-07-09 06:32:15+00:00 SeekingAlpha HSY The Hershey Company https://seekingalpha.com/article/4185993-shari... False

Perform Sentiment Analysis with Vader

In [12]:
sent = SentimentIntensityAnalyzer()
relevant_tickers['sentiment'] = relevant_tickers.text.apply(lambda s: sent.polarity_scores(s)['compound'])

# a compound sentiment value is positive if >= 0.05, negative if <= -0.05, neutral if in between
relevant_tickers.head()
Out[12]:
index id text timestamp source symbols company_names url verified sentiment
0 8 1019720723441635300 senate wants emergency alerts go netflix spoti... 2018-07-18 23:09:00+00:00 TechCrunch NFLX Netflix https://tcrn.ch/2L8DsgT True -0.3818
1 21 1019737727477174300 templeton blackrock say time buy emerging markets 2018-07-19 00:16:34+00:00 business BLK BlackRock https://bloom.bg/2NmXZLe True 0.0000
2 46 1016109247048945700 2 days doubled account option trades xrx mrtx 2018-07-08 23:58:17+00:00 HarryPickens2 HSY The Hershey Company http://Optionsbypros.com False 0.0000
3 82 1016153595522281500 weekly recap hsy price moved since post next s... 2018-07-09 02:54:31+00:00 aboutheoptions HSY The Hershey Company https://twitter.com/i/web/status/1016153595522... False -0.2960
4 108 1016208393378283500 ar cgc dave gnc hsy icon kmb pep pg rrc smrt wtw 2018-07-09 06:32:15+00:00 SeekingAlpha HSY The Hershey Company https://seekingalpha.com/article/4185993-shari... False 0.0000

So now each tweet has a value ranging from -1 to 1 representing how positive/negative the sentiment of the tweet is. My current worry is that many of these tweets are factual in nature meaning the sentiment score will not be indicative of stock performance. Later we will try to look at other things such as random Tweets from the last week and compare sentiment of those tweets to stock performance. For now let's take a look at the sentiment distribution.

In [13]:
# Violin plot to show sentiment distribution
ticker_subset = relevant_tickers[relevant_tickers.symbols.isin(['NFLX', 'IP', 'SQQQ', 'DFS', 'UPS'])]
ax = sns.violinplot(x='symbols', y='sentiment', data=ticker_subset)
ax.set_title('Company Sentiment Distribution July 2018')
ax.set_xlabel('Companies')
ax.set_ylabel('Sentiment')
Out[13]:
Text(0, 0.5, 'Sentiment')

Looking at the sentiment distributions, we see that these companies generally have a neutral sentiment score which will not be very interesting for finding correlations to stock prices. SQQQ in particular has a lot of neutral tweets which makes sense since it is an index fund and it is harder to have an opinion on a collection of stocks. Even with all the neutral tweets, this data still shows unique insights as to how people feel about these companies since each company has a slightly different distribution. It's also interesting that while the companies do have clusters of positive tweets, there are relatively few negative ones in the dataset

Sentiment vs Stock Price Performance

Now lets see how these sentiments over time correlate against stock performance

In [14]:
fig, axs = plt.subplots(2, 3)
syms = relevant_tickers.symbols
fig.tight_layout(pad=0.25)

# Graph Sentiment vs Time for 3 companies
axs[0, 0].set_ylim([-0.8, 0.8]); axs[0, 1].set_ylim([-0.8, 0.8]); axs[0, 2].set_ylim([-0.8, 0.8])
sentiment_kwargs = {'x': 'timestamp', 'y': 'sentiment', 'figsize': (20, 16), 'ylabel': 'Sentiment', 'xlabel': 'Timestamp'}

relevant_tickers[syms == 'NFLX'].plot.line(ax=axs[0, 0], title='Netflix Sentiment', **sentiment_kwargs)
relevant_tickers[syms == 'MOMO'].plot.line(ax=axs[0, 1], title='Momo Sentiment', **sentiment_kwargs)
relevant_tickers[syms == 'HON'].plot.line(ax=axs[0, 2], title='Honeywell Intl Sentiment', **sentiment_kwargs)

# Graph stock performance vs Time for 3 companies
price_kwargs = {'xlabel': 'Timestamp', 'ylabel': 'Close Prices', 'use_index': True}

stocks_df.plot.line(y='Close_NFLX', ax=axs[1, 0], title='Netflix Close Prices', **price_kwargs)
stocks_df.plot.line(y='Close_MOMO', ax=axs[1, 1], title='Momo Close Prices', **price_kwargs)
stocks_df.plot.line(y='Close_HON', ax=axs[1, 2], title='Honeywell Intl Close Prices', **price_kwargs)
Out[14]:
<matplotlib.axes._subplots.AxesSubplot at 0x28ef263fa20>

We use close prices here since we want to allow time for the sentiments expressed in the tweets to impact and be reflected in the stock price for the company.

Unfortunately, it does not look like there is any real correlation between the sentiment and stock prices here. I hypothesize that the problem is just with our current Twitter dataset since these tweets do not contain significant sentiment, yet we are still tagging it with a sentiment value. Now let's move on to doing similar analysis but with tweets pulled in from the Twitter API to see if this remedies the issue.

Twitter API

The Kaggle Twitter dataset was interesting and provided a starting place for analysis. Since this data was already labeled with information such as the source and stocks that it mentioned, it was easy to compare against information pulled from the Yahoo Finance API. To try and find some more insight into indicators of stock performance we will use the Twitter API to get tweets that directly mention the companies we are looking at. Hopefully, this will produce more opinionated data which will make better use of our sentiment analysis

In [15]:
# Connect to the Twitter API using our dev account credentials
auth = tweepy.AppAuthHandler(config.api_key, config.api_secret)
api = tweepy.API(auth)
In [16]:
# This will be the datastructure we build our dataframe from later
search_res = [["Company", "Symbol", "Date", "Tweet"]]

# Get today's date and all the companies we will want to search for on Twitter
date = datetime.today()
company_handle = ["Apple", "exxonmobil", "Walmart", "cvspharmacy", "CapitalOne"]
companies = ["Apple", "Exxon", "Walmart", "CVS", "Capital One"]
tickers = ["AAPL", "XOM", "WMT", "CVS", "COF"]

The five companies chosen here (Apple, Exxon Mobil, Walmart, CVS, and Capital One) were selected because they are among the biggest in the US and thus should have a lot of discussion around them, giving us a larger pool of tweets to look at. They also represent a diverse variety of business sectors so we aren't exclusively looking at, for example, tech or finance companies but instead the business world at large.

Query the Twitter API

In [17]:
for i in range(len(companies)):
    days_ago = 6
    
    while days_ago >= 0:
        # Construct our Twitter query for this exact day and tweets that mention the company
        query = "to:" + company_handle[i] + " since:" + (date - timedelta(days=days_ago)).strftime('%Y-%m-%d') + " until:" + (date - timedelta(days=days_ago-1)).strftime('%Y-%m-%d')
        tweets = api.search(q=query, lang="en", count=50)
        
        for twt in tweets:
            # Add to our dataframe proxy
            search_res.append([companies[i], tickers[i], (date - timedelta(days=days_ago)).strftime('%Y-%m-%d'), twt.text])
        
        days_ago -= 1
        
# Showing a slice of these tweets to avoid those with curse words
search_res[4:7]
Out[17]:
[['Apple',
  'AAPL',
  '2020-12-12',
  '@Apple bought my wife a new m1 MacBook pro. It froze all the time. Beach ball after Beach ball. Called support, end… https://t.co/1BH00mE8r4'],
 ['Apple',
  'AAPL',
  '2020-12-12',
  '@Apple  airplay Imsi catchers and UMTS designed to kill people. Illegal levels of electromagnetic radiation used to… https://t.co/wRyXbU3mE1'],
 ['Apple',
  'AAPL',
  '2020-12-12',
  '@apple please, please don’t kill #musicmemos it’s an AMAZING resource for creators']]

Due to restrictions placed on us by the Twitter API, we can only retrieve tweets from up to a week ago so for each company, we collect 50 tweets (again due to limits on the API) per day we have access to. The tweets are ones which directly mention the company's twitter account so they should be more strongly opinonated than the press-release-style tweets we had in the Kaggle dataset.

Build Tweet Dataframe and Run NLP

In [18]:
# Construct the new Tweet dataframe
tweet_df = pd.DataFrame(search_res)
tweet_df.columns = tweet_df.iloc[0] # The solumn names are in the first entry of search_res
tweet_df = tweet_df.iloc[1:] # Remove the names row

# Clean up the Tweets with NLP and add the Vader sentiment score
tweet_df["Tweet"] = tweet_df["Tweet"].apply(clean_tweet)
tweet_df["Sentiment"] = tweet_df["Tweet"].apply(lambda s: sent.polarity_scores(s)['compound'])
tweet_df[4:7]
Out[18]:
Company Symbol Date Tweet Sentiment
5 Apple AAPL 2020-12-12 airplay imsi catchers umts designed kill peopl... -0.8519
6 Apple AAPL 2020-12-12 please please kill amazing resource creators 0.4019
7 Apple AAPL 2020-12-12 need beam scotty corona christmas gift 0.4404

At this point we have queried the Twitter API for tweets about relevant companies from the last week. We used our previous NLP code to strip the tweets of all the unimportant information, and then we tagged each tweet with a Vader sentiment score.

Now we will pull in stock price information using YFinance to line up with the date of these tweets.

Query Yahoo Finance Again

In [19]:
# Download stock price info for these companies for the past week
stocks_df = yf.download(tickers=tickers, start=((tweet_df.iloc[0])["Date"]), end=(date + timedelta(days=1)).strftime('%Y-%m-%d'))

# Drop unimportant columns
stocks_df.drop(['Adj Close', 'Low', 'High', 'Open', 'Volume'], axis=1, inplace=True)

# Columns are multi-indexed, so flatten this
stocks_df.columns = ['_'.join(multi_col) for multi_col in stocks_df.columns]
stocks_df["Date"] = stocks_df.index
stocks_df.index = list(range(len(stocks_df.index))) # Was originally indexed by the date of the price
stocks_df.head()
[*********************100%***********************]  5 of 5 completed
Out[19]:
Close_AAPL Close_COF Close_CVS Close_WMT Close_XOM Date
0 121.779999 90.419998 69.720001 145.649994 42.220001 2020-12-14
1 127.879997 92.680000 69.889999 145.580002 43.040001 2020-12-15
2 127.809998 93.589996 69.320000 145.429993 43.700001 2020-12-16
3 128.699997 94.080002 69.589996 146.100006 43.480000 2020-12-17

Now we want to actually tidy our stocks dataset since we also want to line it up with our tweets. We will use pd.melt to do so. Then we can merge this dataframe with our Tweets dataframe to look at close prices corresponding to the date of tweets about that company.

Merge Stocks and Tweets

In [20]:
# Tidy the stocks dataframe
stocks_df = pd.melt(stocks_df, id_vars=["Date"], value_vars=["Close_AAPL", "Close_COF", "Close_CVS", "Close_WMT", "Close_XOM"])
stocks_df.columns = ["Date", "Symbol", "Close"]

# Remove the "Close_" from what used to be column names
stocks_df["Symbol"] = stocks_df["Symbol"].str[6:]

# Write out the date as a string so we can merge on it
stocks_df["Date"] = stocks_df["Date"].dt.strftime('%Y-%m-%d')

# Merge the Tweet dataframe with the stocks dataframe
tweet_df = tweet_df.merge(stocks_df, how="left", on=["Date", "Symbol"])
tweet_df[4:7]
Out[20]:
Company Symbol Date Tweet Sentiment Close
4 Apple AAPL 2020-12-12 airplay imsi catchers umts designed kill peopl... -0.8519 NaN
5 Apple AAPL 2020-12-12 please please kill amazing resource creators 0.4019 NaN
6 Apple AAPL 2020-12-12 need beam scotty corona christmas gift 0.4404 NaN

This dataframe now has a lot of interesting information. Each record in the table tells us about how people felt about a company on a certain date along with a numeric sentiment score and a stock price on that day.

Something interesting that we saw was that many tweets contained curse words and these tweets had a very low sentiment score which makes sense and gives us assurance that our sentiment tagging is working. We chose not to show these tweets since they contain explicatives.

Let's visualize this data!

  • We will first show a violin plot of sentiment for several companies
  • Then we will show sentiment vs stock prices for several companies
In [21]:
# Violin plot to show sentiment distribution
ax = sns.violinplot(x='Company', y='Sentiment', data=tweet_df)
ax.set_title('Company Sentiment Distribution')
ax.set_xlabel('Companies')
ax.set_ylabel('Sentiment')
Out[21]:
Text(0, 0.5, 'Sentiment')

These distributions tell us that the sentiment for most tweets is still neutral, but there is a better distribution of sentiment on these tweets compared to our last dataset which is a nice indicator, and there are far more negative ones than before, giving us a fulller spectrum of opinions. Let's see how the sentiment of these tweets fares against our close price data.

In [22]:
fig, axs = plt.subplots(2, 3)
syms = tweet_df.Symbol
fig.tight_layout(pad=0.25)

# Graph Sentiment vs Time for 3 companies
axs[0, 0].set_ylim([-0.8, 0.8]); axs[0, 1].set_ylim([-0.8, 0.8]); axs[0, 2].set_ylim([-0.8, 0.8])
sentiment_kwargs = {'x': 'Date', 'y': 'Sentiment', 'figsize': (20, 16), 'ylabel': 'Sentiment', 'xlabel': 'Timestamp'}

tweet_df[syms == 'AAPL'].plot.line(ax=axs[0, 0], title='Apple Sentiment', **sentiment_kwargs)
tweet_df[syms == 'XOM'].plot.line(ax=axs[0, 1], title='Exxon Sentiment', **sentiment_kwargs)
tweet_df[syms == 'COF'].plot.line(ax=axs[0, 2], title='Capital One Sentiment', **sentiment_kwargs)

# Graph stock performance vs Time for 3 companies
price_kwargs = {'x': 'Date', 'y': 'Close', 'xlabel': 'Timestamp', 'ylabel': 'Close Prices'}

tweet_df[syms == 'AAPL'].plot.line(ax=axs[1, 0], title='Apple Close Prices', **price_kwargs)
tweet_df[syms == 'XOM'].plot.line(ax=axs[1, 1], title='Exxon Close Prices', **price_kwargs)
tweet_df[syms == 'COF'].plot.line(ax=axs[1, 2], title='Capital One Close Prices', **price_kwargs)
Out[22]:
<matplotlib.axes._subplots.AxesSubplot at 0x28ef298b7f0>

These sentiment graphs are definitely more extreme than before which is an improvement on our previous Twitter dataset since we have actual sentiment we can track. Since the sentiment is all over the place, we will need to normalize it.

We do not have stock market data on every desired date since the market is not open on weekends.

The Twitter API definitely performed better than the Kaggle Twitter dataset since these tweets are much more emotional in nature which makes the Vader sentiment tagging more valuable. We are limited by the fact that we can only look at 7 days of tweets which means we can't assess long-term trends.

Aggregating Twitter Sentiment

Twitter sentiment was too noisy from before. Let's try aggregating it for each day and looking at those trends.

In [23]:
# For each company on each day, get the average of the sentiment
rel_cols = ['Symbol', 'Date', 'Sentiment', 'Close']
net_df = tweet_df[rel_cols].groupby(['Symbol', 'Date']).mean().reset_index()
net_df.head()
Out[23]:
Symbol Date Sentiment Close
0 AAPL 2020-12-12 0.062244 NaN
1 AAPL 2020-12-13 0.010432 NaN
2 AAPL 2020-12-14 -0.039218 121.779999
3 AAPL 2020-12-15 0.041872 127.879997
4 AAPL 2020-12-16 0.077726 127.809998

Let's visualize the sentiment and close prices now

In [24]:
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 5))

sns.lineplot(x='Date', y='Sentiment', hue='Symbol', data=net_df, ax=ax1)
ax1.set(title='Net Sentiment per Day', xlabel='Date', ylabel='Sentiment')

sns.lineplot(x='Date', y='Close', hue='Symbol', data=net_df, ax=ax2)
ax2.set(title='Close Price per Day', xlabel='Date', ylabel='Close Price')
Out[24]:
[Text(0, 0.5, 'Close Price'),
 Text(0.5, 0, 'Date'),
 Text(0.5, 1.0, 'Close Price per Day')]

The net sentiment is much easier to look at now that it is aggregated for each day. The close prices are not very interesting since it is only over a week. With more Twitter data, this would likely yield more interesting results. Regardless, this serves as an interesting proof of concept.

Linear Regression using Twitter Sentiment

Even though we are low on Twitter data, we thought it would be interesting to create linear regression models to try and predict close prices for each company solely based on sentiment. This is more for proof of concept since 7 days of data likely is not enough to make accurate predictions.

Let's train the Linear Regression models!

In [25]:
# Add each model to the dictionary, reg, after training
reg = {}
net_df.dropna(inplace=True)

for company in net_df.Symbol.unique():
    comp = net_df.loc[net_df.Symbol == company, ['Sentiment', 'Close']]
    reg[company] = lm.LinearRegression()
    reg[company].fit(comp[['Sentiment']], comp['Close'])

Let's view the p-values of such linear models. We will have to use StatsModels since SKlearn does not show p-values.

In [26]:
for company in net_df.Symbol.unique():
    res = statsmodels.formula.api.ols(formula="Close ~ Sentiment", data=(net_df[net_df.Symbol == company])).fit()
    print(f'{company}:\n {dict(res.pvalues)}')
AAPL:
 {'Intercept': 4.9300605624809145e-05, 'Sentiment': 0.05901055371321618}
COF:
 {'Intercept': 1.3042474265647013e-05, 'Sentiment': 0.05614261511503919}
CVS:
 {'Intercept': 4.0184446690207104e-07, 'Sentiment': 0.03009512913411071}
WMT:
 {'Intercept': 1.5157401191704022e-06, 'Sentiment': 0.8805642222519954}
XOM:
 {'Intercept': 0.0008762760672667768, 'Sentiment': 0.3601334034008641}

There was a lot of variability, but unfortunately most of our models had a p-value greater than .05, so we can't call the results significant. There were issues with our data not covering a wide enough timeframe and the close prices being relatively static so it's too soon to rule out a correlation entirely. We could probably do better if this analysis was repeated over a longer period with more room for change in the stock price.

Reddit API

Twitter has been commonly used by large companies to report to the general public on news. We will now look at the Reddit API to see if we can generate more interesting correlations to provide insight into stock performance.

In [27]:
# Connect to the Reddit API using our credentials
reddit = praw.Reddit(**config.reddit_creds)

Query the Reddit API for r/investing Posts

We will come up with a list of tickers and search for posts about those tickers within the last month.

In [28]:
# Search r/investing for posts about our target stocks
subs = []
rel_tickers = ['AAPL', 'TSLA', 'WMT', 'CVS']

for ticker in rel_tickers:
    for sub in reddit.subreddit('investing').search(ticker, time_filter='month', limit=50):
        # Store general data from these submissions
        subs.append([sub.title, sub.selftext, ticker, sub.score, sub.num_comments, datetime.fromtimestamp(sub.created)])
            
# Generate general dataframe
reddit_df = pd.DataFrame(subs, columns=['title', 'body', 'Symbol', 'reddit_score', 'num_comments', 'Date'])
reddit_df.Date = reddit_df.Date.dt.strftime('%Y-%m-%d')
reddit_df.head()
Out[28]:
title body Symbol reddit_score num_comments Date
0 What is your investment strategy on MSFT? AAPL? Like many of you, I am investing a large perce... AAPL 16 46 2020-12-10
1 NVDA vs AAPL \nAs title indicates, which one is better long... AAPL 16 31 2020-12-09
2 Stop looking for advice In 1997 there was a study conducted by Duke Un... AAPL 1200 302 2020-12-07
3 DraftKings (NASDAQ: DKNG) - Deep Dive Research... **Deep Dive #1 (Part 2 of 2) – DraftKings, bri... AAPL 75 47 2020-12-03
4 Short Term Profit Taking Considerations Profit taking. You're up a large percentage an... AAPL 41 24 2020-12-15

Run NLP on this Reddit data

We will focus our NLP efforts on the body instead of title since that likely has more information than the title itself. Again, we will clean up the user text (using our clean_tweet function for simplicity) and then evaluate its sentiment.

In [29]:
# Drop irrelevant columns. May be interesting building a model on these other features in the future
reddit_df.drop(['title', 'reddit_score', 'num_comments'], axis=1, inplace=True)

# NLP for cleanup and sentiment score
reddit_df['body'] = reddit_df.body.apply(clean_tweet)
reddit_df['Sentiment'] = reddit_df.body.apply(lambda s: sent.polarity_scores(s)['compound'])

reddit_df.head()
Out[29]:
body Symbol Date Sentiment
0 like many investing large percentage portfolio... AAPL 2020-12-10 0.9571
1 title indicates one better long term investmen... AAPL 2020-12-09 0.8288
2 1997 study conducted duke university 1 followe... AAPL 2020-12-07 0.9186
3 deep dive 1 part 2 2 draftkings bringing sport... AAPL 2020-12-03 0.9998
4 profit taking large percentage afraid market t... AAPL 2020-12-15 0.9764

Use Yahoo Finance to Grab Stock Prices

Grab stock prices for relevant tickers. Only grab enough data for around a month since that is how we queried the Reddit posts.

In [30]:
today = datetime.today()
start = date - timedelta(days=31)

stocks_df = yf.download(tickers=rel_tickers, start=start.strftime('%Y-%m-%d'), end=today.strftime('%Y-%m-%d'))

# Clean up the same way we have before
stocks_df.drop(['Adj Close', 'Low', 'High'], axis=1, inplace=True)
stocks_df.columns = ['_'.join(multi_col) for multi_col in stocks_df.columns]
stocks_df['Date'] = stocks_df.index
stocks_df.index = list(range(len(stocks_df.index)))
stocks_df.head()
[*********************100%***********************]  4 of 4 completed
Out[30]:
Close_AAPL Close_CVS Close_TSLA Close_WMT Open_AAPL Open_CVS Open_TSLA Open_WMT Volume_AAPL Volume_CVS Volume_TSLA Volume_WMT Date
0 119.389999 67.139999 441.609985 149.369995 119.550003 68.250000 460.170013 150.500000 74271000 34429700 61188300 14237200 2020-11-17
1 118.029999 66.349998 486.640015 149.089996 118.610001 67.500000 448.350006 149.860001 76322100 11674600 78044000 8312300 2020-11-18
2 118.639999 66.059998 499.269989 152.119995 117.589996 66.019997 492.000000 149.460007 74113000 11656400 62475300 6196900 2020-11-19
3 117.339996 66.389999 489.609985 150.240005 118.639999 65.900002 497.989990 151.169998 73604300 8822500 32807300 7312400 2020-11-20
4 113.849998 66.779999 521.849976 150.929993 117.180000 66.870003 503.500000 150.369995 127959300 5889100 50260300 6185700 2020-11-23

Merge Stocks and Reddit Dataframe

In [31]:
# Merge with the Reddit dataframe similar to how we did with the Twitter API dataframe
stocks_df = pd.melt(stocks_df, id_vars=['Date'], value_vars=[f'Close_{tkr}' for tkr in rel_tickers])
stocks_df.columns = ['Date', 'Symbol', 'Close']
stocks_df['Symbol'] = stocks_df['Symbol'].str[6:]
stocks_df['Date'] = stocks_df['Date'].dt.strftime('%Y-%m-%d')
reddit_df = reddit_df.merge(stocks_df, how='left', on=['Date', 'Symbol'])
reddit_df.head()
Out[31]:
body Symbol Date Sentiment Close
0 like many investing large percentage portfolio... AAPL 2020-12-10 0.9571 123.239998
1 title indicates one better long term investmen... AAPL 2020-12-09 0.8288 121.779999
2 1997 study conducted duke university 1 followe... AAPL 2020-12-07 0.9186 123.750000
3 deep dive 1 part 2 2 draftkings bringing sport... AAPL 2020-12-03 0.9998 122.940002
4 profit taking large percentage afraid market t... AAPL 2020-12-15 0.9764 127.879997

Examine Stock-Sentiment Correlations

First, a violin plot of our sentiment distribution

In [32]:
ax = sns.violinplot(x='Symbol', y='Sentiment', data=reddit_df)
ax.set_title('Company Sentiment Distribution')
ax.set_xlabel('Companies')
ax.set_ylabel('Sentiment')
Out[32]:
Text(0, 0.5, 'Sentiment')

These results are quite interesting since our Twitter datasets have all had very neutral sentiments regarding companies, but r/investing has very high sentiment posts on average. CVS follows the same sentiment distribution from what we have seen in our Twitter datasets, but Apple, Tesla, and Walmart all have very high sentiment distributions centered around 1.

This likely means that Reddit talks about stocks when they are doing well so that people can invest. In our Twitter datasets, we have normal people talking about companies in their daily lives. And in such daily interaction, people may have mixed feeling about these companeis.

In [33]:
fig, axs = plt.subplots(2, 3)
syms = reddit_df.Symbol
fig.tight_layout(pad=0.25)

# Graph Sentiment vs Time for 3 companies
axs[0, 0].set_ylim([-0.8, 0.8]); axs[0, 1].set_ylim([-0.8, 0.8]); axs[0, 2].set_ylim([-0.8, 0.8])
sentiment_kwargs = {'x': 'Date', 'y': 'Sentiment', 'figsize': (20, 16), 'ylabel': 'Sentiment', 'xlabel': 'Timestamp'}

reddit_df[syms == 'AAPL'].plot.line(ax=axs[0, 0], title='Apple Sentiment', **sentiment_kwargs)
reddit_df[syms == 'TSLA'].plot.line(ax=axs[0, 1], title='Tesla Sentiment', **sentiment_kwargs)
reddit_df[syms == 'WMT'].plot.line(ax=axs[0, 2], title='Walmart Sentiment', **sentiment_kwargs)

# Graph stock performance vs Time for 3 companies
price_kwargs = {'x': 'Date', 'y': 'Close', 'xlabel': 'Timestamp', 'ylabel': 'Close Prices'}

reddit_df[syms == 'AAPL'].plot.line(ax=axs[1, 0], title='Apple Close Prices', **price_kwargs)
reddit_df[syms == 'TSLA'].plot.line(ax=axs[1, 1], title='Tesla Close Prices', **price_kwargs)
reddit_df[syms == 'WMT'].plot.line(ax=axs[1, 2], title='Walmart Close Prices', **price_kwargs)
Out[33]:
<matplotlib.axes._subplots.AxesSubplot at 0x28e80406e10>

These correlations still are not great, but definitely looks a lot more readable than what we were getting with Twitter. We may be able to get better results if we look at a larger timeframe such as a year instead of a month. We would do so here, but we may exhaust our Reddit API queries.

Discussion

  • We were not able to find direct insights into stock performance in a way that could make us money
    • The Twitter Kaggle dataset was too informative in nature so the NLP sentiment analysis was ineffective. It also only covered one month of data
    • The Twitter API seemed promising, but we were impeded by our ability to only query 7 days worth of Tweets which made it hard to see any correlations to stock prices even after normalization
    • The Reddit API seems to be the best choice going forwards. We are able to query as much data as we need and we are given much more data (Reddit score, number of comments, etc.)
    • Even so, we were not able to query more data for fear of exhausting our limited number of Reddit queries
  • A lot of the Tweets and Reddit posts we have been pulling are somewhat random. In the future, it may be interesting to target specific demographics of people or industries to analyze sentiment there

Notable Differences between Twitter and Reddit

  • The sentiment distribution for Tweets is generally centered around 0 with an even amount of positive/negative tweets
  • The r/investing subreddit had many higher sentiment distributions, generally being centered around 1
  • This trend would need to be confirmed by investigating more companies and throughout a longer time range, but this was nonetheless interesting
  • A plausible explanation is that people generally make posts on r/investing when there is good news, but people post on Twitter to post any and all interactions they have with large companies

Next Steps

It would be interesting to repeat this entire process, but with larger Twitter and Reddit datasets. We were pretty limited by our APIs, but if we were to amass more data over time and save it locally, we could repeat our analysis. This may provide more insight into specific correlations or specific industries that are particularly sensitive to social media sentiment. It would also be interesting to see how well our Twitter Linear Regression models would perform in predicting stock prices on a rolling window. Similarly, it would be interesting to build Linear Regression models on the Reddit dataset since beyond just sentiment, we have information such as how popular the posts are and how many comments they have. We encourage our readers to continue this research as we believe we have provided a strong baseline for sentiment-driven market analysis.

Further Resources

In case our reader would like to do further work on this, there are a few academic papers we found that do more extensive work with sentiment-driven analysis.