The goal of this project is to see the overall sentiment of tweets containing the word "lucy" and to make a wordcloud for the positive tweets and the negative tweets.

Import packages

In [1]:
from tweepy import Stream
from tweepy import OAuthHandler
from tweepy.streaming import StreamListener
import pandas as pd
import json
from textblob import TextBlob

Stream in 6000 tweets containing the word 'lucy' using tweepy and put the tweets into a dataframe.

In [2]:
#https://github.com/shreyans29/thesemicolon/blob/master/livesenti.py

df_tweets = pd.DataFrame(columns = ['count', 'tweet', 'senti'])
count=0

class listener(StreamListener):
    
    def on_data(self,data):
        all_data=json.loads(data)
        tweet=all_data["text"]
        #username=all_data["user"]["screen_name"]
        blob=TextBlob(tweet.strip())
       
        global count
        global df_tweets
        
        count=count+1
        
        if count%50 == 0:
            print(count)
        df_tweets = df_tweets.append(pd.DataFrame({'count':[count], 'tweet':[tweet.strip()]}))
        
        if count==6000:
            df_tweets.to_csv('df_tweet.csv')
            return(False)
        else:
            return(True)
        
    def on_error(self,status):
        print(status)

atoken = "1009156045988491264-nzRbKNnbbHOgE9Qx8MgYrGqUEIZzqO"
asecret = "yPsJFcSoeh1M20RnIkz9dC5VVfbSgiWkUI4EVFkj6nKc2"
ckey = "BaOWdsvJJwXvlHO1FNyfAIqk5"
csecret = "nmiAfhPun8pkh76DUY96FYeYxSWoxblt7V9AnMiKxZU8FwWMXE"

auth=OAuthHandler(ckey,csecret)
auth.set_access_token(atoken,asecret)

twitterStream=  Stream(auth, listener(count))
twitterStream.filter(track=["uber"], languages=['en'])
C:\User_Files\Lucy_Wan\Programming\Anaconda2\lib\site-packages\pandas\core\frame.py:6201: FutureWarning: Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=True'.

To retain the current behavior and silence the warning, pass sort=False

  sort=sort)
50
100
150
200
250
300
350
400
450
500
550
600
650
700
750
800
850
900
950
1000
1050
1100
1150
1200
1250
1300
1350
1400
1450
1500
1550
1600
1650
1700
1750
1800
1850
1900
1950
2000
2050
2100
2150
2200
2250
2300
2350
2400
2450
2500
2550
2600
2650
2700
2750
2800
2850
2900
2950
3000
3050
3100
3150
3200
3250
3300
3350
3400
3450
3500
3550
3600
3650
3700
3750
3800
3850
3900
3950
4000
4050
4100
4150
4200
4250
4300
4350
4400
4450
4500
4550
4600
4650
4700
4750
4800
4850
4900
4950
5000
5050
5100
5150
5200
5250
5300
5350
5400
5450
5500
5550
5600
5650
5700
5750
5800
5850
5900
5950
6000
In [3]:
df_tweets = pd.read_csv('df_tweet.csv')

Take out the characters that the computer can't understand from the code.

In [4]:
df_tweets.tweet = df_tweets.tweet.map(lambda x: x.encode('ascii',errors='ignore'))

Turn all the tweets into lower case.

In [5]:
df_tweets.tweet = df_tweets.tweet.map(lambda x: x.lower())

Look at the first five tweets to see that they have been correctly modified.

In [6]:
df_tweets.tweet.head()
Out[6]:
0     b'this line up is trash...... take me back 2014'
1    b'rt @blackboyymagic: sitting in the back text...
2    b'rt @nickjohonas: summer 16 had:\r\n"one danc...
3    b'rt @blackboyymagic: sitting in the back text...
4    b'rt @blackboyymagic: sitting in the back text...
Name: tweet, dtype: object

Find the sentiments of each tweet and turn that into a column of our dataframe.

In [7]:
df_tweets['senti'] = df_tweets.tweet.map(lambda x: TextBlob(str(x)).sentiment.polarity)
In [8]:
df_tweets.head()
Out[8]:
Unnamed: 0 count senti tweet
0 0 1 0.0 b'this line up is trash...... take me back 2014'
1 0 2 0.3 b'rt @blackboyymagic: sitting in the back text...
2 0 3 0.0 b'rt @nickjohonas: summer 16 had:\r\n"one danc...
3 0 4 0.3 b'rt @blackboyymagic: sitting in the back text...
4 0 5 0.3 b'rt @blackboyymagic: sitting in the back text...

Find the overall sentiment score.

In [9]:
df_tweets.senti.sum()
Out[9]:
1067.345152419918

As we can see, the overall sentiment score is about 170, which means that the sentiment towards the name or word 'lucy' is overwhelmingly positive.

Now, we will make two dataframes, one for the positive tweets and one for the negative tweets.

In [10]:
from wordcloud import WordCloud
import matplotlib.pyplot as plt
pos_tweets = df_tweets.tweet[df_tweets.senti>0]
neg_tweets = df_tweets.tweet[df_tweets.senti<0]

Print the average positive sentiment score and the average negative sentiment score.

In [11]:
print(df_tweets.senti[df_tweets.senti>0].sum()/df_tweets.senti[df_tweets.senti>0].shape[0])
print(df_tweets.senti[df_tweets.senti<0].sum()/df_tweets.senti[df_tweets.senti<0].shape[0])
0.3173476337385158
-0.2972236284758037

Display the first five positive tweets.

Note: I decided to keep the retweets since they indicate the popularity of a tweet.

In [12]:
pos_tweets.head()
Out[12]:
1    b'rt @blackboyymagic: sitting in the back text...
3    b'rt @blackboyymagic: sitting in the back text...
4    b'rt @blackboyymagic: sitting in the back text...
5    b'rt @blackboyymagic: sitting in the back text...
7    b'rt @blackboyymagic: sitting in the back text...
Name: tweet, dtype: object

Display the first five negative tweets.

In [13]:
neg_tweets.head()
Out[13]:
6     b'rt @hamilton6connor: had a conversation with...
12    b'rt @larryelder: "black conservatives kicked ...
16    b'rt @hamilton6connor: had a conversation with...
30    b'rt @trumpera_2017: woman calls the cops on b...
34    b'rt @gagasyuyi: summer 09 had:\r\n\r\nbad rom...
Name: tweet, dtype: object

Create a wordcloud for the positive tweets.

In [14]:
wordcloud = WordCloud(background_color = 'white').generate(' '.join(pos_tweets.astype(str)))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()

Create a wordcloud for the negative tweets.

In [15]:
wordcloud2 = WordCloud(background_color = 'white').generate(' '.join(neg_tweets.astype(str)))
plt.imshow(wordcloud2, interpolation='bilinear')
plt.axis("off")
plt.show()