Stats 102A – Homework 6 – Output File
homework | 代写assignment – 统计的题目, 这是值得参考的assignment代写的题目
Miles Chen (example)
homework questions and prompts copyright Miles Chen, Do not post, share, or distribute without permission.
Academic Integrity Statement
By including this statement, I, Joe Bruin , declare that all of the work in this assignment is my own original work. At no time did I look at the code of other students nor did I search for code solutions online. I understand that plagiarism on any single part of this assignment will result in a 0 for the entire assignment and that I will be referred to the dean of students.
I did discuss ideas related to the homework with Josephine Bruin for parts 2 and 3, with John Wooden for part 2, and with Gene Block for part 5. At no point did I show another student my code, nor did I look at another students code.
library(tidyverse)
Warning: package ggplot2 was built under R version 4.0.
Warning: package tibble was built under R version 4.0.
Warning: package tidyr was built under R version 4.0.
Warning: package readr was built under R version 4.0.
Warning: package dplyr was built under R version 4.0.
Warning: package forcats was built under R version 4.0.
Part 1
Explain what a p-value is and how a researcher should use it. (150 words or less)
P-value often involves the determination of statistical significance. Statistically significant results reflect that the results obtained through rigorous testing have reached a certain degree of credibility. Statistics uses the P value to represent the possibility of rejection of the null hypothesis. If the p-value obtained by the test is less than the preset level, the test is statistically significant. Usually choose as 0.05 as the threshold, but different experiments have different sensitivity requirements. Only paying attention to the P value brings a lot of trouble to scientific research. A single p-value can only provide limited information. A p-value<0. is too easy to pass, and it is difficult to use it to reject the null hypothesis. Data analysis should not be judged by a simple p-value. Therefore, some p-hacking work will be done, usually to increase the sample size, so that the P value can be published.
Part 2
Randomization test for numeric data
_# Data credit: David C. Howell #no_waiting
is a vector that records the time it took a driver to leave the
parking spot if no one was waiting for the driver_
no_waiting <- c(36.30, 42.07, 39.97, 39.33, 33.76, 33.91, 39.65, 84.92, 40.70, 39.65, 39.48, 35.38, 75.07, 36.46, 38.73, 33.88, 34.39, 60.52, 53.63, 50.62)
_#waiting
is a vector that records the time it takes a driver to leave if
someone was waiting on the driver_
waiting <- c(49.48, 43.30, 85.97, 46.92, 49.18, 79.30, 47.35, 46.52, 59.68, 42.89, 49.29, 68.69, 41.61, 46.81, 43.75, 46.55, 42.33, 71.48, 78.95, 42.06)
mean(waiting)
####### ## [1] 54.
mean(no_waiting)
####### ## [1] 44.
obs_dif <- mean(waiting) – mean(no_waiting)
Randomization test
Conduct a randomization test
Null Hypothesis: there is no difference in average time for drivers who have a person waiting vs those who do not have a person waiting
Alternative Hypothesis: drivers who have a person waiting will take longer than if they did not.
set.seed(1) differences <- rep(NA, 10000) records <- c(no_waiting, waiting) n1 <- length(no_waiting) n2 <- length(waiting) for (i in seq_along(differences)){ randomized <- sample(records) groupA <- randomized[1:n1] groupB <- randomized[(n1+1):(n1+n2)] differences[i] <- mean(groupB) – mean(groupA) }
# the empirical p-value mean(differences > obs_dif)
####### ## [1] 0.
since the empirical p-value less than 0.05, we reject the null hypothesis which means we do have evidence that drivers who have a person waiting will take longer than if they did not.
Comparison to traditional t-test
Conduct a traditional two-sample independent t-test.
t.test(waiting, no_waiting, alternative=’greater’, paired=FALSE)
####### ##
Welch Two Sample t-test
data: waiting and no_waiting
t = 2.1496, df = 37.984, p-value = 0.
alternative hypothesis: true difference in means is greater than 0
95 percent confidence interval:
2.088773 Inf
sample estimates:
mean of x mean of y
54.1055 44.
The p-value of t-test is 0.01901, which is slightly larger than the empirical p-value.
Part 3
Another Randomization test for numeric data
Exploratory Analysis
Carry out an exploratory analysis.
data <- read.csv(‘AfterSchool.csv’) data$Treatment <- as.factor(data$Treatment)
data %>% group_by(Treatment) %>% summarise(min=min(Victim), 1st quantile
=quantile(Victim, 0.25), median=median(Victim), 3rd quantile
=quantile(Victim, 0.75), max=max(Victim), mean=mean(Victim))
# A tibble: 2 x 7
Treatment min 1st quantile median 3rd quantile max mean
1 0 41.5 41.5 47.3 58.7 81.6 50.
2 1 41.5 41.5 47.3 53.0 75.9 49.
ggplot() + geom_boxplot(aes(x = Treatment, y = Victim), data = data, orientation=’x’)
40
50
60
70
80
0 1
Treatment
Victim
Randomization Test
Use the randomization test.
Null Hypothesis: The after-school program has no effect on victimization Alternative Hypothesis: The after-school program has an effect on victimization
treatment <- data$Victim[data$Treatment == 1] control <- data$Victim[data$Treatment == 0]
mean(treatment)
####### ## [1] 49.
mean(control)
####### ## [1] 50.
obs_dif <- mean(control) – mean(treatment)
set.seed(1) differences <- rep(NA, 10000) records <- c(treatment, control) n1 <- length(treatment)
n2 <- length(control) for (i in seq_along(differences)){ randomized <- sample(records) groupA <- randomized[1:n1] groupB <- randomized[(n1+1):(n1+n2)] differences[i] <- mean(groupB) – mean(groupA) }
# empirical p-value mean(abs(differences) > obs_dif)
####### ## [1] 0.
Since the empirical p-value is 0.2173 > 0.05, so we cant reject the null hypothesis, which means we do not have evidence that the after-school program has an effect on victimizaion.
Part 4
Randomization test
Perform a randomization test
Null Hypethesis: ECMO has the same effective as CMT at saving the lives of newborn babies with respiratory failure.
Alternative Hypothesis: ECMO is more effective at saving the lives of newborn babies with respiratory failure.
records <- c(rep(FALSE, 4), rep(TRUE, 6), rep(FALSE, 1), rep(TRUE, 28))
cmt <- records[1:10] ecmo <- records[11:29]
obs_dif <- mean(ecmo) – mean(cmt)
set.seed(1)
differences <- rep(NA, 10000) for (i in seq_along(differences)){ randomized <- sample(records) groupA <- randomized[1:10] groupB <- randomized[11:29] differences[i] <- mean(groupB) – mean(groupA) }
# empirical p-value mean(abs(differences) > obs_dif)
####### ## [1] 0.
The empirical p-value is 0.0043 < 0.05, so we reject the null hypothesis, which means we do have evidence that ECMO is more effective at saving the lives of newborn babies with respiratory failure.
Comparison to Fishers Exact Test
Use Rsfisher.test()
fisher.test(matrix(c(4, 6, 1, 28), nrow = 2), alternative =’greater’)
####### ##
Fishers Exact Test for Count Data
data: matrix(c(4, 6, 1, 28), nrow = 2)
p-value = 0.
alternative hypothesis: true odds ratio is greater than 1
95 percent confidence interval:
1.833681 Inf
sample estimates:
odds ratio
16.
The p-value of fisher exact test is 0.01102 which is larger than the empirical p-value.
Part 5
Comparing Groups, Chapter 7, Exercise 7.
Non-parametric bootstrap test
Use a non-parametric bootstrap test
data <- read.csv(‘HSB.csv’) data$Schtyp <- as.factor(data$Schtyp)
private <- data$Sci[data$Schtyp == 0] public <- data$Sci[data$Schtyp == 1]
obs_dif <- var(public) – var(private)
set.seed(1) differences <- rep(NA, 10000) for (i in seq_along(differences)){ groupA <- sample(data$Sci, 15, replace = TRUE) groupB <- sample(data$Sci, 15, replace = TRUE) differences[i] <- var(groupB) – var(groupA) }
mean(abs(differences) > abs(obs_dif))
####### ## [1] 0.
The empirical p-value is 0.4068 > 0.05, we cannot reject the null hypothesis which means we do not have evidence that there is a difference in the variances of science scores between public and private school students.
Parametric bootstrap test
Use a parametric bootstrap test
private <- data$Sci[data$Schtyp == 0] public <- data$Sci[data$Schtyp == 1]
obs_dif <- var(public) – var(private)
m <- data$Sci %>% mean() s <- data$Sci %>% sd()
set.seed(1) differences <- rep(NA, 10000) for (i in seq_along(differences)){ groupA <- rnorm(length(private), m, s) groupB <- rnorm(length(public), m, s) differences[i] <- var(groupB) – var(groupA) }
mean(abs(differences) > abs(obs_dif))
####### ## [1] 0.
The empirical p-value is 0.1731 > 0.05, we cannot reject the null hypothesis which means we do not have evidence that there is a difference in the variances of science scores between public and private school students.
Part 6
light <- c(28, 26, 33, 24, 34,-44, 27, 16, 40, -2, 29, 22, 24, 21, 25, 30, 23, 29, 31, 19, 24, 20, 36, 32, 36, 28, 25, 21, 28, 29, 37, 25, 28, 26, 30, 32, 36, 26, 30, 22, 36, 23, 27, 27, 28, 27, 31, 27, 26, 33, 26, 32, 32, 24, 39, 28, 24, 25, 32, 25, 29, 27, 28, 29, 16, 23)
Non-parametric bootstrap test
Perform a bootstrap test
obs_dif <- 33 – mean(light) data <- light + obs_dif
set.seed(1) differences <- rep(NA, 10000) for (i in seq_along(differences)){ measures <- sample(data, 10, replace = TRUE) differences[i] <- mean(measures) – 33 }
mean(abs(differences) > abs(obs_dif))
####### ## [1] 0.
The empirical p-value is 0.0521 > 0.05, we cannot reject the null hypothesis, which means we do not have evidence that the mean as Newcombs measurements is not equals 33.
Non-parametric bootstrap test with outliers removed
Perform the bootstrap test again after removing the two negative outliers (-2, and -44)
data <- light[light >= 0] obs_dif <- 33 – mean(data)
data <- data + obs_dif
set.seed(1) differences <- rep(NA, 10000) for (i in seq_along(differences)){ measures <- sample(data, 10, replace = TRUE) differences[i] <- mean(measures) – 33 }
mean(abs(differences) > abs(obs_dif))
####### ## [1] 0.
The empirical p-value is 0.0011 < 0.05, we reject the null hypothesis, which means we do have evidence that the mean as Newcombs measurements is not equals 33.