report | assignment | R代写 | 统计代写 | data science代写 | stata代写 | spss代写 - Homework # 2 Assignment – 学霸代写– CS代写| CS作业代写| Computer Science留学生作业代写| 淘宝托管

Homework # 2 Assignment

report | assignment | R代写 | 统计代写 | data science代写 | stata代写 | spss代写 – 该题目是一个常规的report的练习题目代写, 涵盖了report等方面, 这个项目是assignment代写的代写题目

Xuanyu Zhou

For this assignment (and all assignments in this class), you are to turn in your work as an R Markdown document. Submit your solutions to GradeScope in PDF format.

Problem 1

For this problem, we will be using the netflix_titles.csv file which is a dataset from a Netflix Movies and TV Shows study. Most of the column names are reasonably descriptive. The only columns that may be confusing is AgeCHDdiag which is the age the patient was diagnosed as having CHD ( Chronic heart disease ) and MRW which stands for Metropolitan Relative Weight , an old medical index of a ratio of a persons weight compared to a desirable weight for that persons height and gender.

a. For each attribute, identify (by replacing the xxxin the table) whether it is a measure (quantitative) or a
category (qualitative) and then for the measures whether it is of type interval or ratio and for the categories
whether it is of type ordinal or nominal: (10 points)

netflix <- read.csv("netflix_titles.csv")
sapply(netflix, class)

## show_id type title director cast country
## "character" "character" "character" "character" "character" "character"
## date_added release_year rating duration listed_in
## "character" "integer" "character" "character" "character"

Attribute Category/Measure Interval/Ratio or Nominal/Ordinal

show_id Category Nominal

type Category Nominal

title Category Nominal

director Category Nominal

cast Category Nominal

country Category Nominal

date_added Measure Interval

release_year Measure Interval

rating Category Nominal

duration Measure Interval

listed_in Category Nominal

b. Read the file into a tibble, convert the char columns to factors, and add a new binary column PG_13 that
has a value of 1 if the movie is rated as PG-13 and a 0 otherwise (5 points)

library (tidyverse)

##  Attaching packages 
 tidyverse 1.3.1

##  ggplot2 3.3.5  purrr 0.3.
##  tibble 3.1.6  dplyr 1.0.
##  tidyr 1.1.4  stringr 1.4.
##  readr 2.1.2  forcats 0.5.

##  Conflicts 
tidyverse_conflicts() 
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()

netflix <- as_tibble(read.csv("netflix_titles.csv"))
for (col in colnames(netflix)){
if (class(netflix[[col]]) == "character"){
netflix[[col]] <- as.factor(netflix[[col]])
}
}
netflix[["PG-13"]] <- 0
netflix[netflix[["rating"]] == "PG-13","PG-13"] <- 1

c. Write a command to calculate the average duration for TV Shows and Movies. (5 points)

get_duration <- function (duration){
return (as.numeric(strsplit(duration, split = " ")[[ 1 ]][ 1 ]))
}

digit_duration <- NULL
for (row in c( 1 :nrow(netflix))){
digit_duration <- c(digit_duration, get_duration(as.character(netflix[["duration"]][ro
w])))
}
netflix[["digit_duration"]] <- digit_duration
aggregate(netflix[["digit_duration"]], by=list(netflix[["type"]]), mean, na.rm=T)

Group.
<fct>

x
<dbl>

Movie 99.

TV Show 1.

2 rows

Previous 1 2 3 4 5 6 ... 19 Next

d. Write a command to calculate the total number of movies in each country and the number of those
movies that are rated PG-13. Return the result in descending order of number of PG-13 movies. (5 points)

get_countries <- function (country){
return (strsplit(country, split=",")[[ 1 ]])
}

movie_df <- netflix[netflix[["type"]] == "Movie",]
countries <- NULL
for (country in movie_df[["country"]]){
countries <- c(countries, get_countries(country))
}
movie_tibble <- as_tibble(table(countries))

movie_pg_13_df <- movie_df[movie_df[["PG-13"]] == 1 , ]
countries <- NULL
for (country in movie_pg_13_df[["country"]]){
countries <- c(countries, get_countries(country))
}
movie_pg_13_tibble <- as_tibble(table(countries))
country_movie <- merge(movie_tibble, movie_pg_13_tibble, by.x = "countries", by.y="count
ries", all.x=T)
colnames(country_movie) <- c("Country", "Movie_num", "PG-13_movie_num")
country_movie[is.na(country_movie[["PG-13_movie_num"]]), "PG-13_movie_num"] = 0
country_movie <- country_movie[order(country_movie$`PG-13_movie_num`, decreasing = T),]
country_movie

Country
<chr>

Movie_num
<int>

PG-13_movie_num
<dbl>

182 United States 2364 362

104 United States 388 71

181 United Kingdom 382 51

103 United Kingdom 152 33

33 France 155 27

19 Canada 132 25

34 Germany 104 24

22 China 55 9

8 Australia 33 8

128 France 148 8

1-10 of 187 rows

Problem 2

Previous 1 2 3 4 5 Next

The file video_game_data.csv contains a variety of data about video games. Read this data into a tibble and convert it to be in tidy format.

video_games <- as_tibble(read.csv("video_game_data.csv"))
spread(video_games, key = "details", value = "value")

VideoGame
<chr>

client
<chr>

onsale
<chr>

price
<chr>

publish_year
<chr>

sales
<chr>

big friend war PC FALSE 39.99 2020 20

Flying Mario PC FALSE 39.99 2021 20

Human Falls4 Mobile TRUE 6.99 2019 3025

3 rows

Problem 3

The file orders.csv contains the quantities of clothing items purchased by various companies. Read this data into a tibble and then convert it to be in tidy format.

orders <- as_tibble(read.csv("Orders.csv"))
orders_tibble <- aggregate(orders[, c( 2 : 10 )], by=list(orders$Company), sum)
gather(orders_tibble, type, value, -Group.1)

Group.
<chr>

type
<chr>

value
<int>

Company A Hat.Small 753

Company B Hat.Small 285

Company C Hat.Small 475

Company D Hat.Small 803

Company E Hat.Small 532

Company A Hat.Medium 793

Company B Hat.Medium 487

Company C Hat.Medium 495

Company D Hat.Medium 524

Company E Hat.Medium 659

1-10 of 45 rows

Problem 4

The file 10_year_sales.xlsx contains weekly sales data for a 10 year period by city. Note that there are 10 tabs in this file.

a. Use the tidyverse read_excel function to read in each of these tabs and consolidate them into a single
tibble that is in a tidy format.

library (tidyverse)
library (readxl)

get_df <- function (file, sheet){
df <- as_tibble(read_excel(file, sheet=sheet))
year <- rep(as.numeric(sheet), nrow(df)- 1 )
month <- df[ 2 :nrow(df), 1 ]
day <- df[ 2 :nrow(df), 2 ]
for (col in ( 3 :ncol(df))){
if (col %% 2 == 1 ){
city <- rep(colnames(df)[col], nrow(df)- 1 )
price <- df[ 2 :nrow(df), col]
quantity <- df[ 2 :nrow(df), col+ 1 ]
temp <- tibble(year, month, day, city, price, quantity)
colnames(temp) <- c("year", "month", "day", "city", "price", "quantity")
if (col == 3 ){
result <- temp
} else {
result <- rbind(result, temp)
}
}
}
return (result)
}

for (sheet in excel_sheets("10_year_sales.xlsx")){
if (sheet == "2010"){
sales <- get_df("10_year_sales.xlsx", sheet)
} else {
sales <- rbind(sales, get_df("10_year_sales.xlsx", sheet))
}
}

## New names:

## * “ -> …

## New names:

## * “ -> … ## New names:

## * “ -> …

## New names:

* “ -> …
* “ -> …
* “ -> …
* “ -> …
* “ -> … ## New names:
* “ -> …
* “ -> …
* “ -> …
* “ -> … ## New names:
* “ -> …
* “ -> …
* “ -> …
* “ -> … ## New names:
* “ -> …
* “ -> …
* “ -> …
* “ -> …

New names:

## * “ -> …

## * “ -> … ## New names:

## * “ -> …

New names:

## * “ -> … ## New names:

## * “ -> …

* …

New names:

* “ -> …

* …

New names:

* “ -> …

* …

sales

year
<dbl>

month
<chr>

day
<chr>

city
<chr>

price
<chr>

quantity
<chr>

2010 1 1 Noblesville 7.6498290378267 2529.

2010 1 8 Noblesville 12.8198861460368 2342.

2010 1 15 Noblesville 10.009212503725 2403.

2010 1 22 Noblesville 12.5136330390549 2264.

2010 1 29 Noblesville 11.8081908578277 2312.

2010 2 5 Noblesville 6.14827970115427 2582.

2010 2 12 Noblesville 9.37422641842372 2448.

2010 2 19 Noblesville 12.8221199240939 2306.

2010 2 26 Noblesville 9.2498885747989 2413.

2010 3 5 Noblesville 12.1420270377188 2338.

1-10 of 1,380 rows Previous 1 2 3 4 5 6 ... 138 Next

b. Generate a tibble showing the total revenue (price times quantity) for each city from the years 2015 through
2019.

select_sales <- sales[sales[["year"]] > 2014 , ]
select_sales[["revenue"]] <- as.numeric(select_sales[["price"]]) * as.numeric(select_sal
es[["quantity"]])
city_revenue <- aggregate(select_sales[["revenue"]], by=list(select_sales[["city"]]), su
m, na.rm=T)
colnames(city_revenue) <- c("city", "revenue")
city_revenue

city
<chr>

revenue
<dbl>

Bloomington 10191661

Fort Wayne 5447501

Muncie 10767048

Noblesville 9447614

South Bend 8085843

5 rows

Problem 5

You are to use the tables found in Orders – Data Integration.csv and Product Costs – Data Integration.csv to generate a report of the total profit from each of your customers. Perform the following steps:

a. Read in the two files and determine if either (or both) of the files are not in Tidy format. (5 points)

orders <- as_tibble(read.csv("Orders - Data Integration.csv"))
products <- as_tibble(read.csv("Product Costs - Data Integration.csv"))

Summarize below which table(s) are not in Tidy format and why:

The Product Costs - Data Integration table is not in Tidy format because the column headers 2019 Cost
and 2020 Cost contain values.
b. If either table is not in Tidy format, correct it. Also, convert the costs from character to numeric types.

colnames(products) <- c("Product.Code", "2019", "2020")
products <- gather(products, key="Year", value="Cost", "2019", "2020")
products

Product.Code
<chr>

Year
<chr>

Cost
<dbl>

Previous 1 2 3 4 Next

Previous 1 2 Next

Product.Code
<chr>

Year
<chr>

Cost
<dbl>

X1388 2019 540.

ZZ52 2019 196.

X1189 2019 244.

Y12 2019 322.

HCK15 2019 335.

ZY13 2019 303.

ZZ522 2019 448.

HC155 2019 282.

BW243 2019 367.

GG2554 2019 428.

1-10 of 32 rows

c. Join the two tables (be careful which type of join to use)

total <- merge(orders, products, by.x=c("Year.Purchased", "Product.Code"), by.y=c("Year"
, "Product.Code"))
total

Year.Purchased
<int>

Product.Code
<chr>

Order_N...
<int>

Customer
<chr>

Line.Item
<int>

Quantity
<int>

Unit.Price
<int>

Cost
<dbl>

2019 A33 1000 Customer A 2 63 698 467.

2019 BW243 1000 Customer A 3 75 483 367.

2019 X1189 1000 Customer A 1 65 314 244.

2019 X1388 1000 Customer A 4 20 684 540.

2019 X1388 1003 Customer D 1 62 684 540.

2019 Y12 1000 Customer A 5 82 474 322.

2020 BW243 1001 Customer B 2 29 483 380.

2020 GG2554 1001 Customer B 3 70 595 447.

2020 GG2554 1002 Customer C 5 98 595 447.

2020 HC155 1002 Customer C 2 73 404 303.

1-10 of 16 rows

d. Create a new column in the joined table that is equal to the total profit for that line item (note: profit equals
the price that you charge for an item minus the cost that you have to pay to get that item from your

supplier) (5 points)

total[["total_profit"]] <- total[["Quantity"]] * (total[["Unit.Price"]] - total[["Cost"
]])
total

e. Generate a table that contains the total profit received from each of your customers. Print out that table in a
nice format using the knitr::kable() function. (5 points)

library (knitr)
tables <- aggregate(total[["total_profit"]], by=list(total[["Customer"]]), sum)
colnames(tables) <- c("Customer", "Customer Profit")
kable(tables)

Customer Customer Profit

Customer A 43006.

Customer B 23080.

Customer C 68533.

Customer D 8905.

f. Extra credit: Install the formattable library to display the Customer Profit column in your table in a
currency format

#install.packages("formattable")
library (formattable)
currency(tables$`Customer Profit`)

Previous 1 2 Next

Year.Purchased
<int>

Product.Code
<chr>

Order_N...
<int>

Customer
<chr>

Line.Item
<int>

Quantity
<int>

Unit.Price
<int>

Cost
<dbl>

2019 A33 1000 Customer A 2 63 698 467.

2019 BW243 1000 Customer A 3 75 483 367.

2019 X1189 1000 Customer A 1 65 314 244.

2019 X1388 1000 Customer A 4 20 684 540.

2019 X1388 1003 Customer D 1 62 684 540.

2019 Y12 1000 Customer A 5 82 474 322.

2020 BW243 1001 Customer B 2 29 483 380.

2020 GG2554 1001 Customer B 3 70 595 447.

2020 GG2554 1002 Customer C 5 98 595 447.

2020 HC155 1002 Customer C 2 73 404 303.

1-10 of 16 rows

## [1] $43,006.18 $23,080.80 $68,533.48 $8,905.

Problem 6

For this problem, you are to use the tables found in files Roster.csv, ISE535 Spring 2021.csv, ISE599 Fall 2020.csv, and ISE535 Fall 2021.csv

Perform the following steps:

a. Read the four csv files into tibbles and combine them into a single table that is in tidy format (call it
gradebook). Load the lubridate library and use it to convert the Date of Birth attribute to be a date type
and covert the grade to a factor type. (15 points)

roster <- as_tibble(read.csv("Roster.csv"))
ise535_spring <- as_tibble(read.csv("ISE535 Spring 2021.csv"))
ise599_fall <- as_tibble(read.csv("ISE599 Fall 2020.csv"))
ise535_fall <- as_tibble(read.csv("ISE535 Fall 2021.csv"))

colnames(roster) <- roster[ 1 , ]
roster <- roster[ 2 :nrow(roster), ]

colnames(ise535_spring) <- c("Student ID", "Spring Semester 2021")
ise535_spring <- ise535_spring[ 2 :nrow(ise535_spring),]

colnames(ise599_fall) <- c("Student ID", "Fall Semester 2020")
ise599_fall <- ise599_fall[ 2 :nrow(ise599_fall),]

colnames(ise535_fall) <- c("Student ID", "Fall Semester 2021")
ise535_fall <- ise535_fall[ 2 :nrow(ise535_fall),]

gradebook <- merge(roster, ise535_spring, all.x = T)
gradebook <- merge(gradebook, ise599_fall, all.x = T)
gradebook <- merge(gradebook, ise535_fall, all.x = T)

library (lubridate)

##
## Attaching package: 'lubridate'

## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union

gradebook[["Date of Birth"]] <- mdy(gradebook[["Date of Birth"]])
gradebook[["Spring Semester 2021"]] <- as.factor(gradebook[["Spring Semester 2021"]])
gradebook[["Fall Semester 2020"]] <- as.factor(gradebook[["Fall Semester 2020"]])
gradebook[["Fall Semester 2021"]] <- as.factor(gradebook[["Fall Semester 2021"]])
gradebook

b. Add a column that contains the grade point value for the class (A = 4, B = 3, C = 2) and then use that
column to display the GPA for each student. Hint: look at the Tidyverse case_when command. (10 points)

get_point <- function (grade){
if (is.na(grade)){
return ( 0 )
} else if (grade == "A"){
return ( 4 )
} else if (grade == "B"){
return ( 3 )
} else if (grade == "C"){
return ( 2 )
} else {
return ( 0 )
}
}

grade_book_long <- gather(gradebook, key="semester", value="level", "Spring Semester 202
1", "Fall Semester 2020", "Fall Semester 2021")
points = NULL
for (row in c( 1 :nrow(grade_book_long))){
points <- c(points, get_point(grade_book_long[row, "level"]))
}
grade_book_long[["point"]] <- points
results <- aggregate(grade_book_long$point, by=list(grade_book_long[["Student Name"]]),
mean)
colnames(results) <- c("Student Name", "GPA")
results

Previous 1 2 3 Next

Student ID
<chr>

Student Name
<chr>

Home Country
<chr>

Date of Birth
<date>

Spring Semester 2021
<fct>

Fall Seme
<fct>

Student ID
<chr>

Student Name
<chr>

Home Country
<chr>

Date of Birth
<date>

Spring Semester 2021
<fct>

Fall Seme
<fct>

161677 Danika Floyd USA 1994-08-21 A A

180376 Helena Gibbs USA 1994-06-21 A B

239998 Frankie Shields Japan 1992-03-

246074 Elliott Watts Mexico 1998-03-

265902 Dario Holt India 1996-03-

327654 Jadyn Henry India 1992-11-

383626 Keely Beasley USA 1995-10-13 B C

457169 Grayson Riddle Mexico 1990-02-21 A

476440 Gary Cohen India 1990-05-15 B

542015 Alfred Lang India 1994-11-18 A B

1-10 of 24 rows | 1-6 of 7 columns

NA NA

NA

Previous 1 2 3 Next

Student Name
<chr>

GPA

<dbl>

Student Name
<chr>

GPA

<dbl>

Jadyn Henry 0.

Janiyah Downs 0.

Tyree Levine 0.

Alfred Lang 2.

Charlie Franklin 0.

Ciara Hendrix 2.

Danika Floyd 3.

Dario Holt 0.

Darryl Rodriguez 0.

Elliana Patton 2.

1-10 of 24 rows