report | assignment | R代写 | 统计代写 | data science代写 | stata代写 | spss代写 – Homework # 2 Assignment

Homework # 2 Assignment

report | assignment | R代写 | 统计代写 | data science代写 | stata代写 | spss代写 – 该题目是一个常规的report的练习题目代写, 涵盖了report等方面, 这个项目是assignment代写的代写题目

ass代做 assignment代写 代写assignment

Xuanyu Zhou

For this assignment (and all assignments in this class), you are to turn in your work as an R Markdown document. Submit your solutions to GradeScope in PDF format.

Problem 1

For this problem, we will be using the netflix_titles.csv file which is a dataset from a Netflix Movies and TV Shows study. Most of the column names are reasonably descriptive. The only columns that may be confusing is AgeCHDdiag which is the age the patient was diagnosed as having CHD ( Chronic heart disease ) and MRW which stands for Metropolitan Relative Weight , an old medical index of a ratio of a persons weight compared to a desirable weight for that persons height and gender.

a. For each attribute, identify (by replacing the xxxin the table) whether it is a measure (quantitative) or a
category (qualitative) and then for the measures whether it is of type interval or ratio and for the categories
whether it is of type ordinal or nominal: (10 points)
netflix <- read.csv("netflix_titles.csv")
sapply(netflix, class)
## show_id type title director cast country
## "character" "character" "character" "character" "character" "character"
## date_added release_year rating duration listed_in
## "character" "integer" "character" "character" "character"
Attribute Category/Measure Interval/Ratio or Nominal/Ordinal
show_id Category Nominal
type Category Nominal
title Category Nominal
director Category Nominal
cast Category Nominal
country Category Nominal
date_added Measure Interval
release_year Measure Interval
rating Category Nominal
duration Measure Interval
listed_in Category Nominal
b. Read the file into a tibble, convert the char columns to factors, and add a new binary column PG_13 that
has a value of 1 if the movie is rated as PG-13 and a 0 otherwise (5 points)
library (tidyverse)
##  Attaching packages 
 tidyverse 1.3.1 
##  ggplot2 3.3.5  purrr 0.3.
##  tibble 3.1.6  dplyr 1.0.
##  tidyr 1.1.4  stringr 1.4.
##  readr 2.1.2  forcats 0.5.
##  Conflicts 
tidyverse_conflicts() 
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
netflix <- as_tibble(read.csv("netflix_titles.csv"))
for (col in colnames(netflix)){
if (class(netflix[[col]]) == "character"){
netflix[[col]] <- as.factor(netflix[[col]])
}
}
netflix[["PG-13"]] <- 0
netflix[netflix[["rating"]] == "PG-13","PG-13"] <- 1
c. Write a command to calculate the average duration for TV Shows and Movies. (5 points)
get_duration <- function (duration){
return (as.numeric(strsplit(duration, split = " ")[[ 1 ]][ 1 ]))
}
digit_duration <- NULL
for (row in c( 1 :nrow(netflix))){
digit_duration <- c(digit_duration, get_duration(as.character(netflix[["duration"]][ro
w])))
}
netflix[["digit_duration"]] <- digit_duration
aggregate(netflix[["digit_duration"]], by=list(netflix[["type"]]), mean, na.rm=T)
Group.
<fct>
x
<dbl>
Movie 99.
TV Show 1.

2 rows

Previous 1 2 3 4 5 6 ... 19 Next
d. Write a command to calculate the total number of movies in each country and the number of those
movies that are rated PG-13. Return the result in descending order of number of PG-13 movies. (5 points)
get_countries <- function (country){
return (strsplit(country, split=",")[[ 1 ]])
}
movie_df <- netflix[netflix[["type"]] == "Movie",]
countries <- NULL
for (country in movie_df[["country"]]){
countries <- c(countries, get_countries(country))
}
movie_tibble <- as_tibble(table(countries))
movie_pg_13_df <- movie_df[movie_df[["PG-13"]] == 1 , ]
countries <- NULL
for (country in movie_pg_13_df[["country"]]){
countries <- c(countries, get_countries(country))
}
movie_pg_13_tibble <- as_tibble(table(countries))
country_movie <- merge(movie_tibble, movie_pg_13_tibble, by.x = "countries", by.y="count
ries", all.x=T)
colnames(country_movie) <- c("Country", "Movie_num", "PG-13_movie_num")
country_movie[is.na(country_movie[["PG-13_movie_num"]]), "PG-13_movie_num"] = 0
country_movie <- country_movie[order(country_movie$`PG-13_movie_num`, decreasing = T),]
country_movie
Country
<chr>
Movie_num
<int>
PG-13_movie_num
<dbl>
182 United States 2364 362
104 United States 388 71
181 United Kingdom 382 51
103 United Kingdom 152 33
33 France 155 27
19 Canada 132 25
34 Germany 104 24
22 China 55 9
8 Australia 33 8
128 France 148 8
1-10 of 187 rows

Problem 2

Previous 1 2 3 4 5 Next

The file video_game_data.csv contains a variety of data about video games. Read this data into a tibble and convert it to be in tidy format.

video_games <- as_tibble(read.csv("video_game_data.csv"))
spread(video_games, key = "details", value = "value")
VideoGame
<chr>
client
<chr>
onsale
<chr>
price
<chr>
publish_year
<chr>
sales
<chr>
big friend war PC FALSE 39.99 2020 20
Flying Mario PC FALSE 39.99 2021 20
Human Falls4 Mobile TRUE 6.99 2019 3025
3 rows

Problem 3

The file orders.csv contains the quantities of clothing items purchased by various companies. Read this data into a tibble and then convert it to be in tidy format.

orders <- as_tibble(read.csv("Orders.csv"))
orders_tibble <- aggregate(orders[, c( 2 : 10 )], by=list(orders$Company), sum)
gather(orders_tibble, type, value, -Group.1)
Group.
<chr>
type
<chr>
value
<int>
Company A Hat.Small 753
Company B Hat.Small 285
Company C Hat.Small 475
Company D Hat.Small 803
Company E Hat.Small 532
Company A Hat.Medium 793
Company B Hat.Medium 487
Company C Hat.Medium 495
Company D Hat.Medium 524
Company E Hat.Medium 659
1-10 of 45 rows

Problem 4

The file 10_year_sales.xlsx contains weekly sales data for a 10 year period by city. Note that there are 10 tabs in this file.

a. Use the tidyverse read_excel function to read in each of these tabs and consolidate them into a single
tibble that is in a tidy format.
library (tidyverse)
library (readxl)
get_df <- function (file, sheet){
df <- as_tibble(read_excel(file, sheet=sheet))
year <- rep(as.numeric(sheet), nrow(df)- 1 )
month <- df[ 2 :nrow(df), 1 ]
day <- df[ 2 :nrow(df), 2 ]
for (col in ( 3 :ncol(df))){
if (col %% 2 == 1 ){
city <- rep(colnames(df)[col], nrow(df)- 1 )
price <- df[ 2 :nrow(df), col]
quantity <- df[ 2 :nrow(df), col+ 1 ]
temp <- tibble(year, month, day, city, price, quantity)
colnames(temp) <- c("year", "month", "day", "city", "price", "quantity")
if (col == 3 ){
result <- temp
} else {
result <- rbind(result, temp)
}
}
}
return (result)
}
for (sheet in excel_sheets("10_year_sales.xlsx")){
if (sheet == "2010"){
sales <- get_df("10_year_sales.xlsx", sheet)
} else {
sales <- rbind(sales, get_df("10_year_sales.xlsx", sheet))
}
}
## New names:

## * “ -> …

## * “ -> …

## * “ -> …

## New names:

## * “ -> … ## New names:

## * “ -> …

## * “ -> …

## New names:

  • * “ -> …

  • * “ -> …

  • * “ -> …

  • * “ -> …

  • * “ -> … ## New names:

  • * “ -> …

  • * “ -> …

  • * “ -> …

  • * “ -> … ## New names:

  • * “ -> …

  • * “ -> …

  • * “ -> …

  • * “ -> … ## New names:

  • * “ -> …

  • * “ -> …

  • * “ -> …

  • * “ -> …

New names:

## * “ -> …

## * “ -> …

## * “ -> … ## New names:

## * “ -> …

## * “ -> …

## * “ -> …

New names:

## * “ -> … ## New names:

## * “ -> …

## * “ -> …

## * “ -> …

## * “ -> …

* …

New names:

* “ -> …

* “ -> …

* “ -> …

* “ -> …

* “ -> …

* …

New names:

* “ -> …

* “ -> …

* “ -> …

* “ -> …

* “ -> …

* …

sales

year
<dbl>
month
<chr>
day
<chr>
city
<chr>
price
<chr>
quantity
<chr>
2010 1 1 Noblesville 7.6498290378267 2529.
2010 1 8 Noblesville 12.8198861460368 2342.
2010 1 15 Noblesville 10.009212503725 2403.
2010 1 22 Noblesville 12.5136330390549 2264.
2010 1 29 Noblesville 11.8081908578277 2312.
2010 2 5 Noblesville 6.14827970115427 2582.
2010 2 12 Noblesville 9.37422641842372 2448.
2010 2 19 Noblesville 12.8221199240939 2306.
2010 2 26 Noblesville 9.2498885747989 2413.
2010 3 5 Noblesville 12.1420270377188 2338.
1-10 of 1,380 rows Previous 1 2 3 4 5 6 ... 138 Next
b. Generate a tibble showing the total revenue (price times quantity) for each city from the years 2015 through
2019.
select_sales <- sales[sales[["year"]] > 2014 , ]
select_sales[["revenue"]] <- as.numeric(select_sales[["price"]]) * as.numeric(select_sal
es[["quantity"]])
city_revenue <- aggregate(select_sales[["revenue"]], by=list(select_sales[["city"]]), su
m, na.rm=T)
colnames(city_revenue) <- c("city", "revenue")
city_revenue
city
<chr>
revenue
<dbl>
Bloomington 10191661
Fort Wayne 5447501
Muncie 10767048
Noblesville 9447614
South Bend 8085843
5 rows

Problem 5

You are to use the tables found in Orders – Data Integration.csv and Product Costs – Data Integration.csv to generate a report of the total profit from each of your customers. Perform the following steps:

a. Read in the two files and determine if either (or both) of the files are not in Tidy format. (5 points)
orders <- as_tibble(read.csv("Orders - Data Integration.csv"))
products <- as_tibble(read.csv("Product Costs - Data Integration.csv"))

Summarize below which table(s) are not in Tidy format and why:

The Product Costs - Data Integration table is not in Tidy format because the column headers 2019 Cost
and 2020 Cost contain values.
b. If either table is not in Tidy format, correct it. Also, convert the costs from character to numeric types.
colnames(products) <- c("Product.Code", "2019", "2020")
products <- gather(products, key="Year", value="Cost", "2019", "2020")
products
Product.Code
<chr>
Year
<chr>
Cost
<dbl>
Previous 1 2 3 4 Next
Previous 1 2 Next
Product.Code
<chr>
Year
<chr>
Cost
<dbl>
X1388 2019 540.
ZZ52 2019 196.
X1189 2019 244.
Y12 2019 322.
HCK15 2019 335.
ZY13 2019 303.
ZZ522 2019 448.
HC155 2019 282.
BW243 2019 367.
GG2554 2019 428.

1-10 of 32 rows

c. Join the two tables (be careful which type of join to use)
total <- merge(orders, products, by.x=c("Year.Purchased", "Product.Code"), by.y=c("Year"
, "Product.Code"))
total
Year.Purchased
<int>
Product.Code
<chr>
Order_N...
<int>
Customer
<chr>
Line.Item
<int>
Quantity
<int>
Unit.Price
<int>
Cost
<dbl>
2019 A33 1000 Customer A 2 63 698 467.
2019 BW243 1000 Customer A 3 75 483 367.
2019 X1189 1000 Customer A 1 65 314 244.
2019 X1388 1000 Customer A 4 20 684 540.
2019 X1388 1003 Customer D 1 62 684 540.
2019 Y12 1000 Customer A 5 82 474 322.
2020 BW243 1001 Customer B 2 29 483 380.
2020 GG2554 1001 Customer B 3 70 595 447.
2020 GG2554 1002 Customer C 5 98 595 447.
2020 HC155 1002 Customer C 2 73 404 303.

1-10 of 16 rows

d. Create a new column in the joined table that is equal to the total profit for that line item (note: profit equals
the price that you charge for an item minus the cost that you have to pay to get that item from your
supplier) (5 points)
total[["total_profit"]] <- total[["Quantity"]] * (total[["Unit.Price"]] - total[["Cost"
]])
total
e. Generate a table that contains the total profit received from each of your customers. Print out that table in a
nice format using the knitr::kable() function. (5 points)
library (knitr)
tables <- aggregate(total[["total_profit"]], by=list(total[["Customer"]]), sum)
colnames(tables) <- c("Customer", "Customer Profit")
kable(tables)

Customer Customer Profit

Customer A 43006.

Customer B 23080.

Customer C 68533.

Customer D 8905.

f. Extra credit: Install the formattable library to display the Customer Profit column in your table in a
currency format
#install.packages("formattable")
library (formattable)
currency(tables$`Customer Profit`)
Previous 1 2 Next
Year.Purchased
<int>
Product.Code
<chr>
Order_N...
<int>
Customer
<chr>
Line.Item
<int>
Quantity
<int>
Unit.Price
<int>
Cost
<dbl>
2019 A33 1000 Customer A 2 63 698 467.
2019 BW243 1000 Customer A 3 75 483 367.
2019 X1189 1000 Customer A 1 65 314 244.
2019 X1388 1000 Customer A 4 20 684 540.
2019 X1388 1003 Customer D 1 62 684 540.
2019 Y12 1000 Customer A 5 82 474 322.
2020 BW243 1001 Customer B 2 29 483 380.
2020 GG2554 1001 Customer B 3 70 595 447.
2020 GG2554 1002 Customer C 5 98 595 447.
2020 HC155 1002 Customer C 2 73 404 303.
1-10 of 16 rows
## [1] $43,006.18 $23,080.80 $68,533.48 $8,905.

Problem 6

For this problem, you are to use the tables found in files Roster.csv, ISE535 Spring 2021.csv, ISE599 Fall 2020.csv, and ISE535 Fall 2021.csv

Perform the following steps:

a. Read the four csv files into tibbles and combine them into a single table that is in tidy format (call it
gradebook). Load the lubridate library and use it to convert the Date of Birth attribute to be a date type
and covert the grade to a factor type. (15 points)
roster <- as_tibble(read.csv("Roster.csv"))
ise535_spring <- as_tibble(read.csv("ISE535 Spring 2021.csv"))
ise599_fall <- as_tibble(read.csv("ISE599 Fall 2020.csv"))
ise535_fall <- as_tibble(read.csv("ISE535 Fall 2021.csv"))
colnames(roster) <- roster[ 1 , ]
roster <- roster[ 2 :nrow(roster), ]
colnames(ise535_spring) <- c("Student ID", "Spring Semester 2021")
ise535_spring <- ise535_spring[ 2 :nrow(ise535_spring),]
colnames(ise599_fall) <- c("Student ID", "Fall Semester 2020")
ise599_fall <- ise599_fall[ 2 :nrow(ise599_fall),]
colnames(ise535_fall) <- c("Student ID", "Fall Semester 2021")
ise535_fall <- ise535_fall[ 2 :nrow(ise535_fall),]
gradebook <- merge(roster, ise535_spring, all.x = T)
gradebook <- merge(gradebook, ise599_fall, all.x = T)
gradebook <- merge(gradebook, ise535_fall, all.x = T)
library (lubridate)
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
gradebook[["Date of Birth"]] <- mdy(gradebook[["Date of Birth"]])
gradebook[["Spring Semester 2021"]] <- as.factor(gradebook[["Spring Semester 2021"]])
gradebook[["Fall Semester 2020"]] <- as.factor(gradebook[["Fall Semester 2020"]])
gradebook[["Fall Semester 2021"]] <- as.factor(gradebook[["Fall Semester 2021"]])
gradebook
b. Add a column that contains the grade point value for the class (A = 4, B = 3, C = 2) and then use that
column to display the GPA for each student. Hint: look at the Tidyverse case_when command. (10 points)
get_point <- function (grade){
if (is.na(grade)){
return ( 0 )
} else if (grade == "A"){
return ( 4 )
} else if (grade == "B"){
return ( 3 )
} else if (grade == "C"){
return ( 2 )
} else {
return ( 0 )
}
}
grade_book_long <- gather(gradebook, key="semester", value="level", "Spring Semester 202
1", "Fall Semester 2020", "Fall Semester 2021")
points = NULL
for (row in c( 1 :nrow(grade_book_long))){
points <- c(points, get_point(grade_book_long[row, "level"]))
}
grade_book_long[["point"]] <- points
results <- aggregate(grade_book_long$point, by=list(grade_book_long[["Student Name"]]),
mean)
colnames(results) <- c("Student Name", "GPA")
results
Previous 1 2 3 Next
Student ID
<chr>
Student Name
<chr>
Home Country
<chr>
Date of Birth
<date>
Spring Semester 2021
<fct>
Fall Seme
<fct>
Student ID
<chr>
Student Name
<chr>
Home Country
<chr>
Date of Birth
<date>
Spring Semester 2021
<fct>
Fall Seme
<fct>
161677 Danika Floyd USA 1994-08-21 A A
180376 Helena Gibbs USA 1994-06-21 A B
239998 Frankie Shields Japan 1992-03-
246074 Elliott Watts Mexico 1998-03-
265902 Dario Holt India 1996-03-
327654 Jadyn Henry India 1992-11-
383626 Keely Beasley USA 1995-10-13 B C
457169 Grayson Riddle Mexico 1990-02-21 A
476440 Gary Cohen India 1990-05-15 B
542015 Alfred Lang India 1994-11-18 A B

1-10 of 24 rows | 1-6 of 7 columns

NA NA
NA NA
NA NA
NA NA
NA
NA
Previous 1 2 3 Next
Student Name
<chr>
GPA
<dbl>
Student Name
<chr>
GPA
<dbl>
Jadyn Henry 0.
Janiyah Downs 0.
Tyree Levine 0.
Alfred Lang 2.
Charlie Franklin 0.
Ciara Hendrix 2.
Danika Floyd 3.
Dario Holt 0.
Darryl Rodriguez 0.
Elliana Patton 2.

1-10 of 24 rows