Stats 102A – Homework 6 – Output File

homework | 代写assignment – 统计的题目，这是值得参考的assignment代写的题目

Miles Chen (example)

homework questions and prompts copyright Miles Chen, Do not post, share, or distribute without permission.

Academic Integrity Statement

By including this statement, I, Joe Bruin , declare that all of the work in this assignment is my own original work. At no time did I look at the code of other students nor did I search for code solutions online. I understand that plagiarism on any single part of this assignment will result in a 0 for the entire assignment and that I will be referred to the dean of students.

I did discuss ideas related to the homework with Josephine Bruin for parts 2 and 3, with John Wooden for part 2, and with Gene Block for part 5. At no point did I show another student my code, nor did I look at another students code.

library(tidyverse)

Warning: package ggplot2 was built under R version 4.0.

Warning: package tibble was built under R version 4.0.

Warning: package tidyr was built under R version 4.0.

Warning: package readr was built under R version 4.0.

Warning: package dplyr was built under R version 4.0.

Warning: package forcats was built under R version 4.0.

Part 1

Explain what a p-value is and how a researcher should use it. (150 words or less)

P-value often involves the determination of statistical significance. Statistically significant results reflect that the results obtained through rigorous testing have reached a certain degree of credibility. Statistics uses the P value to represent the possibility of rejection of the null hypothesis. If the p-value obtained by the test is less than the preset level, the test is statistically significant. Usually choose as 0.05 as the threshold, but different experiments have different sensitivity requirements. Only paying attention to the P value brings a lot of trouble to scientific research. A single p-value can only provide limited information. A p-value<0. is too easy to pass, and it is difficult to use it to reject the null hypothesis. Data analysis should not be judged by a simple p-value. Therefore, some p-hacking work will be done, usually to increase the sample size, so that the P value can be published.

Part 2

Randomization test for numeric data

_# Data credit: David C. Howell #no_waitingis a vector that records the time it took a driver to leave the

parking spot if no one was waiting for the driver_

no_waiting <- c(36.30, 42.07, 39.97, 39.33, 33.76, 33.91, 39.65, 84.92, 40.70, 39.65, 39.48, 35.38, 75.07, 36.46, 38.73, 33.88, 34.39, 60.52, 53.63, 50.62)

_#waiting is a vector that records the time it takes a driver to leave if

someone was waiting on the driver_

waiting <- c(49.48, 43.30, 85.97, 46.92, 49.18, 79.30, 47.35, 46.52, 59.68, 42.89, 49.29, 68.69, 41.61, 46.81, 43.75, 46.55, 42.33, 71.48, 78.95, 42.06)

mean(waiting)

####### ## [1] 54.

mean(no_waiting)

####### ## [1] 44.

obs_dif <- mean(waiting) – mean(no_waiting)

Randomization test

Conduct a randomization test

Null Hypothesis: there is no difference in average time for drivers who have a person waiting vs those who do not have a person waiting

Alternative Hypothesis: drivers who have a person waiting will take longer than if they did not.

set.seed(1) differences <- rep(NA, 10000) records <- c(no_waiting, waiting) n1 <- length(no_waiting) n2 <- length(waiting) for (i in seq_along(differences)){ randomized <- sample(records) groupA <- randomized[1:n1] groupB <- randomized[(n1+1):(n1+n2)] differences[i] <- mean(groupB) – mean(groupA) }

# the empirical p-value mean(differences > obs_dif)

####### ## [1] 0.

since the empirical p-value less than 0.05, we reject the null hypothesis which means we do have evidence that drivers who have a person waiting will take longer than if they did not.

Comparison to traditional t-test

Conduct a traditional two-sample independent t-test.

t.test(waiting, no_waiting, alternative=’greater’, paired=FALSE)

####### ##

Welch Two Sample t-test

data: waiting and no_waiting

t = 2.1496, df = 37.984, p-value = 0.

alternative hypothesis: true difference in means is greater than 0

95 percent confidence interval:

2.088773 Inf

sample estimates:

mean of x mean of y

54.1055 44.

The p-value of t-test is 0.01901, which is slightly larger than the empirical p-value.

Part 3

Another Randomization test for numeric data

Exploratory Analysis

Carry out an exploratory analysis.

data <- read.csv(‘AfterSchool.csv’) data$Treatment <- as.factor(data$Treatment)

data %>% group_by(Treatment) %>% summarise(min=min(Victim), 1st quantile=quantile(Victim, 0.25), median=median(Victim), 3rd quantile=quantile(Victim, 0.75), max=max(Victim), mean=mean(Victim))

# A tibble: 2 x 7

Treatment min 1st quantile median 3rd quantile max mean

1 0 41.5 41.5 47.3 58.7 81.6 50.

2 1 41.5 41.5 47.3 53.0 75.9 49.

ggplot() + geom_boxplot(aes(x = Treatment, y = Victim), data = data, orientation=’x’)

0 1

Treatment

Victim

Randomization Test

Use the randomization test.

Null Hypothesis: The after-school program has no effect on victimization Alternative Hypothesis: The after-school program has an effect on victimization

treatment <- data$Victim[data$Treatment == 1] control <- data$Victim[data$Treatment == 0]

mean(treatment)

####### ## [1] 49.

mean(control)

####### ## [1] 50.

obs_dif <- mean(control) – mean(treatment)

set.seed(1) differences <- rep(NA, 10000) records <- c(treatment, control) n1 <- length(treatment)

n2 <- length(control) for (i in seq_along(differences)){ randomized <- sample(records) groupA <- randomized[1:n1] groupB <- randomized[(n1+1):(n1+n2)] differences[i] <- mean(groupB) – mean(groupA) }

# empirical p-value mean(abs(differences) > obs_dif)

####### ## [1] 0.

Since the empirical p-value is 0.2173 > 0.05, so we cant reject the null hypothesis, which means we do not have evidence that the after-school program has an effect on victimizaion.

Part 4

Randomization test

Perform a randomization test

Null Hypethesis: ECMO has the same effective as CMT at saving the lives of newborn babies with respiratory failure.

Alternative Hypothesis: ECMO is more effective at saving the lives of newborn babies with respiratory failure.

records <- c(rep(FALSE, 4), rep(TRUE, 6), rep(FALSE, 1), rep(TRUE, 28))

cmt <- records[1:10] ecmo <- records[11:29]

obs_dif <- mean(ecmo) – mean(cmt)

set.seed(1)

differences <- rep(NA, 10000) for (i in seq_along(differences)){ randomized <- sample(records) groupA <- randomized[1:10] groupB <- randomized[11:29] differences[i] <- mean(groupB) – mean(groupA) }

# empirical p-value mean(abs(differences) > obs_dif)

####### ## [1] 0.

The empirical p-value is 0.0043 < 0.05, so we reject the null hypothesis, which means we do have evidence that ECMO is more effective at saving the lives of newborn babies with respiratory failure.

Comparison to Fishers Exact Test

Use Rsfisher.test()

fisher.test(matrix(c(4, 6, 1, 28), nrow = 2), alternative =’greater’)

####### ##

Fishers Exact Test for Count Data

data: matrix(c(4, 6, 1, 28), nrow = 2)

p-value = 0.

alternative hypothesis: true odds ratio is greater than 1

95 percent confidence interval:

1.833681 Inf

sample estimates:

odds ratio

16.

The p-value of fisher exact test is 0.01102 which is larger than the empirical p-value.

Part 5

Comparing Groups, Chapter 7, Exercise 7.

Non-parametric bootstrap test

Use a non-parametric bootstrap test

data <- read.csv(‘HSB.csv’) data$Schtyp <- as.factor(data$Schtyp)

private <- data$Sci[data$Schtyp == 0] public <- data$Sci[data$Schtyp == 1]

obs_dif <- var(public) – var(private)

set.seed(1) differences <- rep(NA, 10000) for (i in seq_along(differences)){ groupA <- sample(data$Sci, 15, replace = TRUE) groupB <- sample(data$Sci, 15, replace = TRUE) differences[i] <- var(groupB) – var(groupA) }

mean(abs(differences) > abs(obs_dif))

####### ## [1] 0.

The empirical p-value is 0.4068 > 0.05, we cannot reject the null hypothesis which means we do not have evidence that there is a difference in the variances of science scores between public and private school students.

Parametric bootstrap test

Use a parametric bootstrap test

private <- data$Sci[data$Schtyp == 0] public <- data$Sci[data$Schtyp == 1]

obs_dif <- var(public) – var(private)

m <- data$Sci %>% mean() s <- data$Sci %>% sd()

set.seed(1) differences <- rep(NA, 10000) for (i in seq_along(differences)){ groupA <- rnorm(length(private), m, s) groupB <- rnorm(length(public), m, s) differences[i] <- var(groupB) – var(groupA) }

mean(abs(differences) > abs(obs_dif))

####### ## [1] 0.

The empirical p-value is 0.1731 > 0.05, we cannot reject the null hypothesis which means we do not have evidence that there is a difference in the variances of science scores between public and private school students.

Part 6

light <- c(28, 26, 33, 24, 34,-44, 27, 16, 40, -2, 29, 22, 24, 21, 25, 30, 23, 29, 31, 19, 24, 20, 36, 32, 36, 28, 25, 21, 28, 29, 37, 25, 28, 26, 30, 32, 36, 26, 30, 22, 36, 23, 27, 27, 28, 27, 31, 27, 26, 33, 26, 32, 32, 24, 39, 28, 24, 25, 32, 25, 29, 27, 28, 29, 16, 23)

Non-parametric bootstrap test

Perform a bootstrap test

obs_dif <- 33 – mean(light) data <- light + obs_dif

set.seed(1) differences <- rep(NA, 10000) for (i in seq_along(differences)){ measures <- sample(data, 10, replace = TRUE) differences[i] <- mean(measures) – 33 }

mean(abs(differences) > abs(obs_dif))

####### ## [1] 0.

The empirical p-value is 0.0521 > 0.05, we cannot reject the null hypothesis, which means we do not have evidence that the mean as Newcombs measurements is not equals 33.

Non-parametric bootstrap test with outliers removed

Perform the bootstrap test again after removing the two negative outliers (-2, and -44)

data <- light[light >= 0] obs_dif <- 33 – mean(data)

data <- data + obs_dif

set.seed(1) differences <- rep(NA, 10000) for (i in seq_along(differences)){ measures <- sample(data, 10, replace = TRUE) differences[i] <- mean(measures) – 33 }

mean(abs(differences) > abs(obs_dif))

####### ## [1] 0.

The empirical p-value is 0.0011 < 0.05, we reject the null hypothesis, which means we do have evidence that the mean as Newcombs measurements is not equals 33.