web | homework代做 | 代写Android | javascript作业 | html作业 | IT作业 | lab – Project 1

Project 1

web | homework代做 | 代写Android | javascript作业 | html作业 | IT作业 | lab – 这个题目属于一个Android的代写任务, 涵盖了web/Android/javascript/html/IT等方面, 该题目是值得借鉴的lab代写的题目

html代写 代写html 网站代写 网页代写

### Release Date: Monday, March 16
### Due Date: Monday April 6, 12:00 PM

Introduction

In this project, we will work with social media data to analyze politics. Along the way, we want to learn more about data-driven journalism (https://en.wikipedia.org/wiki/Data-driven_journalism) where journalists use datasets for reporting. Journalists will locate information, filter and transform tables, generate charts and perform investigations for news outlets.

We want to gain insights into politics through data. We will try to reproduce some of the findings from Buzzfeed (https://www.buzzfeednews.com/article/peteraldhous/trump-twitter-wars) on the usage of Twitter by the president. Note that the journalists have provided supporting details (https://buzzfeednews.github.io/2018-01-trump-twitter-wars/) of their analyses alongside the article.

As we explore the data, we want to gain practice with:

Searching for patterns of characters in strings
Working with data in nested formats instead of tabular formats
Handling dates and times

We will guide you through the problems step by step. However, we encourage you to discuss with us in Office Hours and on Piazza so that we can work together through these steps.

Submission Instructions

Submission of homework requires two steps. See Homework 0 for more information.

Step 1

You are required to subm IT your notebook on JupyterHub. Please navigate to the Assignments tab to

fetch
modify
validate
submit

your notebook. Consult the instructional video (https://nbgrader.readthedocs.io/en/stable/user_guide/highlights.html#student-assignment-list-extension- for-jupyter-notebooks) for more information about JupyterHub.

Step 2

You are required to submit a copy of your notebook to Gradescope. Follow these steps

Formatting Instructions

1. Download as  html (File->Download As->HTML(.html)).
2. Open the HTML in the browser. Print to .pdf
3. Upload to Gradescope. Consult the instructional video
(https://www.gradescope.com/get_started#student-submission) for more information about
Gradescope.
4. Indicate the location of your responses on Gradescope. You must tag your answer's page numbers to
the appropriate question on Gradescope. See instructional video for more information.

Note that

You should break long lines of code into multiple lines. Otherwise your code will extend out of view from
the cell. Consider using \ followed by a new line.
For each textual response, please include relevant code that informed your response.
For each plotting question, please include the code used to generate the plot. If your plot does not
appear in the HTML / pdf output, then use Image('name_of_file', embed = True) to embed it.
You should not display large output cells such as all rows of a table.

Important : Gradescope points will be awarded if and only if all the formatting instructions are followed.

Collaboration Policy

Data science is a collaborative activity. While you may talk with others about the homework, we ask that you write your solutions individually. If you do discuss the assignments with others please include their names below.

Name: list name here

NetId: list netid here

Collaborators: list names here

Rubric

  • Gradescope Question Points – 1a – 1b – 1c – 2a – 2b – 2c – 2d – 2e – 3a – 3b – 3c – 3d – 3e – 3f – 4a – 4b
    • Total

In [ ]:

import pandas as pd import numpy as np

import re import datetime import json

import dsua_112_utils

import matplotlib import matplotlib.pyplot as plt import seaborn as sns

# Set some parameters in the packages % matplotlib inline

sns.set(font_scale=1.5)

plt.rcParams[‘figure.figsize’] = ( 12 , 8 ) plt.rcParams[‘figure.dpi’] = 150

pd.options.display.max_rows = 20 pd.options.display.max_columns = 15

# Some packages to help with configuration import os , sys , pathlib , pickle from IPython.display import Image

In [ ]:

home_path = os.environ["HOME"] data_path_recent = f’ {home_path} /shared/Project1/data/trump_tweets_recent.json’ data_path_old_1 = f’ {home_path} /shared/Project1/data/old_trump_tweets_1.json’ data_path_old_2 = f’ {home_path} /shared/Project1/data/old_trump_tweets_2.json’ img_path = f’ {home_path} /shared/Project1/images’ lexicon_path = f’ {home_path} /shared/Project1/vader_lexicon.txt’

In [ ]:

# TEST

assert ‘pandas’ in sys.modules and "pd" in locals() assert ‘numpy’ in sys.modules and "np" in locals() assert ‘matplotlib’ in sys.modules and "plt" in locals() assert ‘seaborn’ in sys.modules and "sns" in locals() assert ‘datetime’ in sys.modules assert "home_path" in locals()

1. Loading Twitter Data

Donald Trump has made frequent use of Twitter. We want to focus on activity linked to his Twitter handle realdonaldtrump. After we access the data, we can try to understand the scope and temporality of the posts.

In [ ]:

Image(filename=img_path + ‘/usage.PNG’, embed= True , width= 750 )

Question 1a

Recall from Section 7 that Twitter provides an Application Programming Interface (API) for developers. We can access the API with the tweepy package. With Twitter credentials we can collect data from the platform in the javascript Object Notation (JSON) format.

In [ ]:

# Run to load the data

with open(data_path_recent) as f: trump_tweets = json.load(f)

We have collected the last 2000 tweets from realdonaldtrump into trump_tweets_recent.json. We can load files in the JSON format with the json package. Remember that the JSON format is a nested format not a tabular format. Instead of rows and columns, we have keys and values resembling a dictionary.

In [ ]:

# TEST

assert 2000 <= len(trump_tweets) <= 3000

Here we have a list of dictionaries. Each dictionary corresponds to a post.

Note that Twitter limits the usages of its API. We will see in Question 2e that the data contains gaps stemming from restrictions on access. Since we cannot collect all post from Twitter, we need to combine with historical data stored in separate files.

Before we link these records, we want to study trump_tweets. In particular, what is the oldest tweet in trump_tweets. We have the following keys for each posts

In [ ]:

list(trump_tweets[ 0 ].keys())

The key created_at corresponds to the date, time and timezone for the post. Create a list consisting of the values for created_at for each of the entries in the trump_tweets. Use the pandas function to_datetime to convert the list into a DatetimeIndex. See lab 7 for more information about storing dates and times.

In [ ]:

list_datetime = …

# YOUR CODE HERE raise NotImplementedError ()

index_datetime = …

# YOUR CODE HERE raise NotImplementedError ()

In [ ]:

# TEST

assert str(index_datetime.tz) == "UTC"

Note that each entry is of the form YYYY-MM-DD HH:MM:SS+TZ corresponding to

YYYY: 4 digit year
MM: 2 digit month
DD: 2 digit day
HH: 2 digit hour
MM: 2 digit minute
SS: 2 digit second
DD: timezone as 4 digit offset from UTC timezone
(https://www.wikiwand.com/en/List_of_UTC_time_offsets)

Determine the year and month of the oldest tweet. Remember from Lab 7 that objects storing dates have attributes year and month.

In [ ]:

# Enter the year of the oldest tweet

oldest_year = …

# YOUR CODE HERE raise NotImplementedError ()

# Enter the month of the oldest tweet (e.g. 1 for January)

oldest_month = …

# YOUR CODE HERE raise NotImplementedError ()

In [ ]:

# TEST

assert type(oldest_month) == int assert 1 <= oldest_month <= 12

assert type(oldest_year) == int assert oldest_year <= 2020

Question 1b

Usign the Twitter API, we could collect more data. We have called these files old_trump_tweets_1.json and old_trump_tweets_2.json. We need to join these files with the information in trump_tweets_recent.json. Note that we have a nested format of data not a tabular format of data. We cannot use the joining operations to combine the files.

In [ ]:

# Run to load additional data

with open(data_path_old_1) as f: old_trump_tweets_1 = json.load(f)

with open(data_path_old_2) as f: old_trump_tweets_2 = json.load(f)

We want to combine the data from old_trump_tweets_1, old_trump_tweets_2 and trump_tweets_recent. However, some posts will store the text in the text field and others will use the full_text field.

For each entry in the lists old_trump_tweets_1, old_trump_tweets_2 and trump_tweets_recent containing full_text, replace the key full_text with text.

In [ ]:

for tweets in [trump_tweets, old_trump_tweets_1, old_trump_tweets_2]: for tweet in tweets: # YOUR CODE HERE raise NotImplementedError ()

In [ ]:

# TEST

assert "full_text" not in [tweet.keys() for tweet in trump_tweets] assert "full_text" not in [tweet.keys() for tweet in old_trump_tweets_1] assert "full_text" not in [tweet.keys() for tweet in old_trump_tweets_2]

Generate a DataFrame called all_tweets with columns

id: Unique identifier of the tweet
time: The date, time and timezone of the post from the created_at column
source: The source device of the tweet.
text or full_text: The text in the post
retweet_count: The retweet count of the tweet

Note that we can create a DataFrame by passing a list of dictionaries to pd.DataFrame. Here we need the keys to match for each dictionary. Generate three DataFrames corresponding to old_trump_tweets_1, old_trump_tweets_2 and trump_tweets_recent. Combine them using pd.concat.

In [ ]:

columns = [‘created_at’, ‘id’, "text", ‘source’, ‘retweet_count’]

list_df = [] for tweets in [trump_tweets, old_trump_tweets_1, old_trump_tweets_2]: df = … # YOUR CODE HERE raise NotImplementedError () list_df.append(df)

all_tweets = pd.concat(list_df)

In [ ]:

all_tweets.head()

In [ ]:

# TEST

assert set(all_tweets.columns.values).issubset(set([‘time’, ‘id’, ‘text’, ‘sourc e’, ‘retweet_count’])) assert all_tweets.shape[ 1 ] == 5

Question 1c

Note that we have duplicates in the id columns because old_trump_tweets_1, old_trump_tweets_2 and trump_tweets_recent contained overlapping posts. We can remove duplicates through

1. Grouping by id column
2. Using the method first to select the first row from each group

Call the resulting table trump

In [ ]:

trump = …

# YOUR CODE HERE raise NotImplementedError ()

In [ ]:

# TEST

assert len(trump.index.values) == len(np.unique(trump.index.values))

Before moving onto the analysis of the dataset

1. Sort the rows of trump by the values in the index using the pandas method sort_index.
2. Use the pandas method pd.to_datetime on the time column to convert the entries like in
Question 1a.
3. Use pd.to_csv to save a copy to the path /tmp/trump.csv.

In [ ]:

# YOUR CODE HERE raise NotImplementedError ()

In [ ]:

# TEST

assert 11000 < trump.shape[ 0 ] < 12000 assert 831846101179314177 in trump.index assert np.any([(‘Twitter for iPhone’ in s) for s in trump[‘source’].unique()]) assert str(trump["time"].dt.tz) == "UTC"

Question 2: Tweet Source Analysis

We want to study some of the charateristics of Trump tweets. In particular, we want to determine the devices used for the tweets.

In [ ]:

trump[‘source’].unique()

Question 2a

We want to remove the HTML tags from the entries in the source column. Use trump[‘source’].str.replace with a regular expression. Remember that regular expressions are greedy meaning that r"<.*>" will match the entire entry.

In [ ]:

# YOUR CODE HERE raise NotImplementedError ()

In [ ]:

# TEST assert set([‘Twitter for Android’,’Twitter for iPhone’]) < set(trump[‘source’].u nique())

We can see in the following plot that there are two device types that are more commonly used

In [ ]:

trump[‘source’].value_counts().plot(kind="bar") plt.ylabel("Number of Tweets") q2a_gca = plt.gca();

In [ ]:

# TEST

heights = [int(rect.get_height()) for rect in q2a_gca.get_children() if isinstan ce(rect, matplotlib.patches.Rectangle)] assert max(heights) == max(trump[‘source’].value_counts().values)

Question 2b

Is there a difference between his Tweet behavior across these devices? Maybe Trump’s tweets from an Android come at different times than his tweets from an iPhone. Note that Twitter gives us his tweets in the UTC timezone (https://www.wikiwand.com/en/List_of_UTC_time_offsets). We see the +0000 in the time column.

Add a column est_time by converting from UTC to EST timezone. Use trump[‘time’].dt.tz_convert("US/Eastern"). See Lab 7 for more information about time zones.

In [ ]:

# Convert to Eastern Time

trump[‘est_time’] = …

# YOUR CODE HERE raise NotImplementedError ()

In [ ]:

# TEST

assert str(trump["est_time"].dt.tz) == "EST"

Now add a column called hour to the trump table. The colum should contain the hour of the day as floating point number computed by:

hour+ +
minute
60
second
602

In [ ]:

trump[‘hour’] = …

# YOUR CODE HERE raise NotImplementedError ()

In [ ]:

# TEST assert max(trump[‘hour’]) > 23 assert sum(trump[‘hour’] < 0 ) == 0

Question 2c

Use this data along with the seaborn distplot function to examine the distribution over hours of the day in eastern time that trump tweets on each device for the 2 most commonly used devices.

In [ ]:

top_devices = trump[‘source’].value_counts().sort_values()[- 2 :].index.values

for device in top_devices: series = …

# YOUR CODE HERE
raise NotImplementedError ()
sns.distplot(series, hist= False , label = device[- 7 :] )

plt.xlabel(‘hour’) plt.ylabel(‘fraction’) plt.legend() q2c_gca = plt.gca();

In [ ]:

# TEST

with open(f" {home_path} /shared/Project1/data/image.pickle", "br") as fh: q2c_gca_benchmark = pickle.load(fh)

curves = dsua_112_utils.get_curves(q2c_gca) benchmark_curves = dsua_112_utils.get_curves(q2c_gca_benchmark)

diff = dsua_112_utils.generate_normalized_difference(q2c_gca_benchmark, curves[ "Android"], benchmark_curves[‘Android’]) dsua_112_utils.compare_curves(diff)

Question 2d

According to this Verge article (https://www.theverge.com/2017/3/29/15103504/donald-trump-iphone-using- switched-android), Donald Trump switched from an Android to an iPhone sometime in March 2017.

Create a figure identical to your figure from 4c, except that you should show the results only from 2016. If you get stuck consider looking at the year_fraction function from the next problem.

During the campaign, it was theorized that Donald Trump’s tweets from Android were written by him personally, and the tweets from iPhone were from his staff. Does your figure give support to this theory?

In [ ]:

top_devices = trump[‘source’].value_counts().sort_values()[- 2 :].index.values

for device in top_devices: series = …

# YOUR CODE HERE
raise NotImplementedError ()
sns.distplot(series, hist= False , label = device[- 7 :] )

plt.xlabel(‘hour’) plt.ylabel(‘fraction’) plt.legend() q2d_gca = plt.gca();

In [ ]:

# TEST

with open(f" {home_path} /shared/Project1/data/image1.pickle", "br") as fh: q2d_gca_benchmark = pickle.load(fh)

curves = dsua_112_utils.get_curves(q2d_gca) benchmark_curves = dsua_112_utils.get_curves(q2d_gca_benchmark)

diff = dsua_112_utils.generate_normalized_difference(q2d_gca_benchmark, curves[ "Android"], benchmark_curves[‘Android’]) dsua_112_utils.compare_curves(diff)

Question 2e

Which device did Donald Trump use between 2016-2018 in this dataset. To examine the distribution of dates we will convert the date to a fractional year that can be plotted as a distribution. We will use the year_fraction function in the supporting code for the assignment

In [ ]:

trump[‘year’] = trump[‘time’].apply(dsua_112_utils.year_fraction)

Use the sns.distplot to overlay the distributions of the 2 most frequently used web technologies between before 2019.

In [ ]:

top_devices = trump[‘source’].value_counts().sort_values()[- 2 :].index.values

for device in top_devices: series = …

# YOUR CODE HERE
raise NotImplementedError ()
sns.distplot(series, hist= False , label = device[- 7 :] )

plt.xlabel(‘hour’) plt.ylabel(‘fraction’) plt.legend() q2e_gca = plt.gca();

In [ ]:

# TEST

with open(f" {home_path} /shared/Project1/data/image2.pickle", "br") as fh: q2e_gca_benchmark = pickle.load(fh)

curves = dsua_112_utils.get_curves(q2e_gca) benchmark_curves = dsua_112_utils.get_curves(q2e_gca_benchmark)

diff = dsua_112_utils.generate_normalized_difference(q2e_gca_benchmark, curves[ "Android"], benchmark_curves[‘Android’]) dsua_112_utils.compare_curves(diff)

Question 3: Sentiment Analysis

We can try to understand the sentiment behind the words in Trump’s posts. For example, the sentence "I love America!" has positive sentiment. However, the sentence "I hate taxes!" has a negative sentiment. In addition, some words have stronger positive / negative sentiment than others: "I love America." is more positive than "I like America."

In [ ]:

Image(filename=img_path + ‘/sentiment.PNG’, embed= True , width= 1000 )

We will use the VADER (Valence Aware Dictionary and sEntiment Reasoner) (https://github.com/cjhutto/vaderSentiment) lexicon to analyze the sentiment of Trump’s tweets. VADER is a lexicon and rule-based sentiment analysis tool that is specifically useful for sentiments in social media. The VADER lexicon gives the sentiment of individual words. Run the following cell to show the first few rows of the lexicon:

In [ ]:

with open(lexicon_path) as fh: print(”.join(fh.readlines()[: 10 ]))

Note that the lexicon contains contains words along with abbreviations, slang and emojis. Since some words can appear as abbreviations, the lexicon does include some duplication depending on context. For example, lol appears as both a word and abbreviation with sentiment.

The first column of the lexicon is the token meaning the word itself.
The second column is the polarity of the word meaning how positive / negative.
The third columns is the standard deviation of the polarity
The fourth column are 10 raw scores determined by the annotators

See the documentation (https://github.com/cjhutto/vaderSentiment) for more information.

Question 3a

Use pd.read_csv to load the lexicon into a DataFrame called sent.

The index should be the tokens in the lexicon.
The table should have one column containing the polarity
The delimiter is \t not , so you need to set sep = \t

In [ ]:

path_to_use_for_sent = lexicon_path

sent = …

# YOUR CODE HERE raise NotImplementedError ()

In [ ]:

#TEST

assert np.allclose(sent[‘polarity’].head(), [-1.5, – 0.4, – 1.5, – 0.4, – 0.7])

Question 3b

We want to use the lexicon to calculate the overall sentiment for each of Trump’s tweets:

1. For each tweet, find the sentiment of each word.
2. Calculate the sentiment of each tweet by taking the sum of the sentiments of its words.

First, let’s lowercase the text in the tweets since the lexicon is also lowercase. Set the text column of the trump to be the lowercase text of each tweet.

In [ ]:

# YOUR CODE HERE raise NotImplementedError ()

In [ ]:

# TEST

assert trump.loc[ 954722155463430145 ,"text"] == ‘democrats are holding our milita ry hostage over their desire to have unchecked illegal immigration. cant let th at happen!’

Question 3c

We need to get rid of punctuation. Otherwise we won’t match words in the lexicon. Create a new column called no_punc intrump to be the lowercase text of each tweet with all punctuation replaced by a single space. We consider punctuation characters to be any character that isn’t a Unicode word character or a whitespace character. Remember that

The special character \w denotes letters and numbers
The special character \s denotes space

In [ ]:

# Save your regex in punct_re punct_re = r” trump[‘no_punc’] = …

# YOUR CODE HERE raise NotImplementedError ()

In [ ]:

#TEST assert re.search(punct_re, ‘this’) is None assert re.search(punct_re, ‘this is ok’) is None assert re.search(punct_re, ‘this is \n ok’) is None assert re.search(punct_re, ‘this is not ok.’) is not None assert re.search(punct_re, ‘this#is#ok’) is not None assert re.search(punct_re, ‘this^is ok’) is not None

assert trump[‘text’].loc[ 884740553040175104 ] == ‘working hard to get the olympic s for the united states (l.a.). stay tuned!’

Question 3d:

We should convert the tweets into a tidy format (https://cran.r-project.org/web/packages/tidyr/vignettes/tidy- data.html). Changing the format will help us to analyze the sentiments. Use the no_punc column of trump to create a table called tidy_format.

1. The index should be the id repeated once for every word in the tweet
2. The first column should be called num. It should give the location of the word in the tweet. For
example, if the tweet was "i love america", then the location of the word "i" is 0, "love" is 1, and
"america" is 2.
3. The second column should be called word. It should give the individual words of each tweet.

Some rows of tidy_format table look like:

num word
894661651760377856^0 i
894661651760377856^1 think
894661651760377856 2 senator
894661651760377856^3 blumenthal
894661651760377856^4 should

Since we should avoid using loops, we will take advantage of pandas methods. We will take three steps.

First we use a string method called split that breaks the words across different columns.

In [ ]:

# Step 1

no_punc_split = trump["no_punc"].str.split(expand = True )

no_punc_split.head()

Second we use a method called melt. Remember that we discussed melt in Week 6 lecture.

In [ ]:

# Step 2

numbered_columns = no_punc_split.columns.values no_punc_split.reset_index(inplace = True )

tidy_format = pd.melt(no_punc_split, id_vars=[‘id’], value_vars=numbered_columns )

tidy_format.head()

Third we need to

Rename variable column to num
Rename value column to word
Drop any rows with missing values
Sort by ['id','variable']
Set index to be the id column

In [ ]:

# Step 3

# YOUR CODE HERE raise NotImplementedError ()

In [ ]:

#TEST assert tidy_format.loc[ 894661651760377856 ].shape == ( 27 , 2 )

Question 3e:

Now that we have changed the format, we can study the sentiment of each tweet. In particular, we can join the table with the lexicon table.

Add a polarity column to the trump table. The polarity column should contain the sum of the sentiment polarity of each word in the text of the tweet.

Take a left join of tidy_format and sent
Fill missing values with 0

In [ ]:

tidy_format_sent_merged = …

# YOUR CODE HERE raise NotImplementedError ()

In [ ]:

# TEST

assert tidy_format_sent_merged["polarity"].isna().sum() == 0 assert set(tidy_format_sent_merged.columns.values).issubset({"polarity", "word", "num"})

Group tidy_format_sent_merged by id. Use agg on the polarity column with the sum function to add the numbers for each post.

In [ ]:

trump[‘polarity’] = …

# YOUR CODE HERE raise NotImplementedError ()

In [ ]:

#TEST assert np.allclose(trump.loc[ 744701872456536064 , ‘polarity’], 8.4) assert np.allclose(trump.loc[ 745304731346702336 , ‘polarity’], 2.5) assert np.allclose(trump.loc[ 744519497764184064 , ‘polarity’], 1.7) # If you fail this test, you dropped tweets with 0 polarity assert np.allclose(trump.loc[ 744355251365511169 , ‘polarity’], 0.0)

Now we have a measure of the sentiment of each of his tweets. Run the cells below to see the most positive and most negative tweets from Trump in your dataset:

In [ ]:

print(‘Most negative tweets:’) for t in trump.sort_values(‘polarity’).head()[‘text’]: print(‘ \n ‘, t)

In [ ]:

print(‘Most positive tweets:’) for t in trump.sort_values(‘polarity’, ascending= False ).head()[‘text’]: print(‘ \n ‘, t)

Question 3f

Plot the distribution of tweet sentiments broken down by whether the text of the tweet contains nyt or fox. You should obtain a chart like the followig.

In [ ]:

Image(filename=img_path + ‘/news.PNG’, embed= True , width= 750 )

In [ ]:

keywords = ["nyt","fox"]

for keyword in keywords: # YOUR CODE HERE raise NotImplementedError ()

plt.legend() q3f_gca = plt.gca();

In [ ]:

# TEST

heights = [rect.get_height() for rect in q3f_gca.get_children() if isinstance(re ct,matplotlib.patches.Rectangle)]

assert np.isclose(sorted(heights)[- 2 ], 0.32, atol = 1e-1)

Question 4: Engagement

We want to understand the posts that led to many retweets. If a post was retweeted, then a follow of Donald Trump copied the post to a different user. The keywords in these posts should indicate topics of interests

In [ ]:

Image(filename=img_path + ‘/keywords.PNG’, embed= True , width= 750 )

Question 4a

We will determine the words that led to many retweets on average. For example, at the time of this writing, Donald Trump has two tweets that contain the word ‘oakland’ (tweets 932570628451954688 and

  1. with 36757 and 10286 retweets respectively, for an average of 23,521.5.

We will take four steps to find the 20 most retweeted words. We will include only words that appear in at least 25 tweets. The format of top_20 table will be

retweet_count
word
jong 40592.
 37918.
iranian 32982.
un 32677.
kim 32237.

Note that the table will contain some words outside of the Latin alphabet. Surrounding a visit to India, some posts include characters from Devanagari a block of Unicode containing alphabets such as Hindi.

First, we can use the tidy_format_sent_merged table from Question 3 to study the words in each post. Joining with the retweet_count column of trump gives us the number of retweets of the post for each word in the post.

In [ ]:

# Step 1

retweets = pd.merge(left = trump[["retweet_count"]], right = tidy_format_sent_merged, left_index = True , right_index = True , how = "left")

Second group retweets by word. Use filter to remove any words that appear fewer than 25 times.

In [ ]:

# Step 2

retweets_filtered = …

# YOUR CODE HERE raise NotImplementedError ()

In [ ]:

# TEST

assert retweets_filtered.groupby("word").size().min() > 24

Third group retweets_filtered by word. We can use agg to compute the average of the retweet_count column in each group.

In [ ]:

# Step 3

retweets_average = …

# YOUR CODE HERE raise NotImplementedError ()

In [ ]:

# TEST

assert retweets_average.index.name == "word" assert retweets_average.columns.values == ["retweet_count"]

Now that we have the average number of retweets in retweets_average, use sort_values to determine the top 20.

In [ ]:

# Step 4

top_20 = …

# YOUR CODE HERE raise NotImplementedError ()

In [ ]:

#TEST assert ‘iran’ in top_20.index assert ‘nuclear’ in top_20.index

Here’s a bar chart of your results:

In [ ]:

top_20[‘retweet_count’].plot.barh();

Question 4b

At some point in time, "kim", "jong" and "un" were popular in Trump’s tweets. Can we concludet that tweets involving "jong" are more popular than his other tweets.

Consider each of the statements about possible confounding factors below. State whether each statement is true or false and explain. If the statement is true, state whether the confounding factor could have made kim jong un related tweets higher in the list than they should be.

1. We didn't restrict our word list to nouns, so we have unhelpful words like "let" and "any" in our result.
2. We didn't remove hashtags in our text, so we have duplicate words (eg. #great and great).
3. We didn't account for the fact that Trump's follower count has increased over time.
YOUR ANSWER HERE