R代写|R语言代写|统计学代写: 这是一个利用R语言解决统计学问题的题目
In this document we’ll present both some simulations to indicate why multiple testing is a problem and then some tools for false discovery rates in R.
Simulation I – Pairwise Comparisons
This simulation is simply intended to demonstrate the potential effects of multiple testing on comparing factor levels. We’ll assume a completely randomized design in which 500 subjects are each assigned to one of 100 factor levels at random. This gives us 100 groups of five subjects:
mat = matrix(rnorm(500),5,100)
Note here that all groups are generated the same way – using a standard normal distribution – so that there
should be no difference between them.
We’ll now obtain the average value in each group (note that the apply function here usese the function mean on each column of the matrix mat)
means = apply(mat,2,mean)
and use the sort function to sort these from smallest to largest. The ‘ix’ component that is returned gives the
index of the smallest mean, then the second smallest etc. We can use this to re-order the columns of matt
If we produce a boxplot of these values, we see a generally increasing trend
boxplot(mat,cex.lab=1.5,cex.axis=1.5,col=’blue’)
ord = sort(means,index.return=TRUE)$ix mat = mat[,ord]
1
1 9 18 28 38 48 58 68 78 88 98
this is because we’ve sorted the columns so they should show that – lowest first, highest last. Now, of course, we were going to examine all pairwise comparisons, including testing between the highest and the lowest mean
t.test(mat[,1],mat[,100])
##
## Welch Two Sample t-test
##
## data: mat[, 1] and mat[, 100]
## t = -4.0935, df = 6.996, p-value = 0.004616
## alternative hypothesis: true difference in means is not equal to 0 ## 95 percent confidence interval:
## -3.870197 -1.035886
## sample estimates:
## meanofx meanofy
## -1.4961285 0.9569127
This is strongly significant! The reason for this is we are choosing the two means that are farther apart. With a large number of groups we expect the lowest mean among the groups to be pretty small, similarly for the group with the highest mean. A randomly-chosen pair of groups has a 5% chance of being declared significantly different, but if we look at all pairs of groups, we will chose these. Note that which pair of groups is farthest apart will change each time you do the simulation, but there will be some pair that is, and they will look significantly different.
We can also use this simulation to explore the use of Tukey’s HSD. For this we’ll need to turn mat back into a vector and add a group indicator
2
−3 −1 123
matdat = data.frame(y=as.vector(mat), x=as.factor( rep(1:100,1,each=5))) and we can run the TukeyHSD procedure as
HSD = TukeyHSD(aov(y~x,data=matdat))
This produces a 4950 (the number of possible pairwise comparisons) by 4 table with colums being – Difference between means – Lower confidence interval for difference – Upper confidence interval for difference – P-value of difference corrected for multiple comparisons If we look just at the smallest corrected p-values we get
min(HSD$x[,4]) ## [1] 0.1935257
which is now much larger than 0.05 and we can correctly conclude that none of the pairs of means are different from each other.
Simulation II – Inflation of P-values
This simulation is designed to illustrate the inflation of the effective p-value through multiple testing. That is, we normally conduct a test with the view that “If the null hypothesis was true, and we re-did the experiment many times, we would incorrectly reject the null 5% of the time.”
Here we will examine the situation where you are conducting 5 experiments and will publish whichever experiment results in a statistically significant result. The question posed is “Suppose all 5 null hypotheses are true, what is the probability that I will get to publish at least one paper?” That is, what is the chance that one out of five will come up. (Note the contrast to Simulation I above – there we had many levels to compare but only did one experiment; here we are repeating our five experiments many times).
We are going to suppose that all the hypotheses can be tested with a t-test with five degrees of freedom. So each time we repeat the experiment we get five t-statistics. If all five null hypotheses are true then each statistic has a t5 distribution and we can generate 1000 replicated experiments as follows:
tsim = matrix( rt(5000,5), 1000,5)
(we could complicate this so that we don’t the same distribution in each column, but this just makes this
simple).
Now let’s look at the first experiment (the first column of tsim). If we draw up a histograph of these replicated
experiments, it looks like a standard t distribution
hist(tsim[,1],cex.lab=1.5,cex.axis=1.5,100) abline(v=c(0,qt(0.95,5)))
3
Histogram of tsim[, 1]
−10 −5 0 5
tsim[, 1]
and we reject the null hypothesis about 5% of the time. We can see this empirically by asking how frequently is the first column of tsim larger than the critical value of a t5 distribution (we’re only doing one-sided tests here)
mean(tsim[,1]>qt(0.95,5)) ## [1] 0.045
However, this isn’t the way our experiment actually works – we get to write a paper if any of our five tests rejects. That is, if the largest t-statistic is significant, we write, so let’s look at the maximums of each of the five experiments
tmax = apply(tsim,1,max)
If we draw the histogram of these, we have a very different picture with values that are much more often positive, and much larger than before
hist(tmax,cex.lab=1.5,cex.axis=1.5,100) abline(v=c(0,qt(0.95,5)))
4
Frequency
0 20 40 60 80
Histogram of tmax
0246
tmax
How is this possible, when each column was generated the same way? Yes, but we got to choose which column to look at and we always chose the largest. Or, the maximum of five random numbers is going to tend to be larger than any individual one.
If we examine how frequently we get to write a paper out of this, we have
mean(tmax>qt(0.95,5)) ## [1] 0.232
or about 18% percent of the time. This is what is know as p-value inflation – given that you are testing five hypotheses, you are really rejecting at the 0.13 level, rather than 0.05.
Here we have independent tests, so it is best to apply a Bonferroni correction. That is we divide the α level by 5 and use the 0.01 quantile of the t distribution. If we do this, our rejection rate is
mean(tmax>qt(0.99,5)) ## [1] 0.049
much better.
False Discovery Rates Case 1: A Simulation For Single t-tests
The previous simulation looked at p-value inflation when you are conducting a small number of tests, and we see that a Bonferroni correction is fairly reasonable. However, in modern genomic testing we have many more than five tests that we run at any one time.
5
Frequency
0 10 30 50
Here, we will examine a single simulated experiment involving 2000 t-tests based to 10 observations, all of which correspond to the null case (we’ll use this later). We’ll think of these as examining 2000 genes for differential expression.
First we set up randomly-generated data:
sim.data = matrix( rnorm(20000),10,2000)
We can then construct the t-statistics all at once by taking means and standard deviations using the apply
function
sim.tstats = apply(sim.data,2,mean)/(apply(sim.data,2,sd)/sqrt(10)) and calculate the corresponding p-values
sim.pvals = 1-pt(sim.tstats,9)
If we draw a histogram of these, we see that they look approximately uniformly distributed.
hist(sim.pvals,xlab=’P-value’,cex.lab=1.5,cex.axis=1.5,col=’blue’)
Histogram of sim.pvals
0.0 0.2 0.4 0.6 0.8 1.0
P−value
This is no accident: if your p-value is generated under the null hypothesis, it should have a uniform distribution (and hence 5% of them should be less than 0.05).
Indeed, if we ask how many of these test are rejected in this experiment, we see that about 5% of them are
6
Frequency
050 150
sum(sim.pvals<0.05) ## [1] 99
and we clearly do not want to erroneously report that 100 genes were differentially expressed – particularly since we’d get a different list of 100 if we repeated the experiment.
However, if we use a Bonferroni correction, we reject only when the p-value is less than 0.05/2000:
sum(sim.pvals<0.05/2000) ## [1] 0
which should happen no more than 1 time in 20 that we re-do this entire simulation. (Note in comparision to Simulation II above, in 95% of the repeated experiments, no tests were returned as significant).
False Discovery RatesCase II: When Some Genes Are Not Null
Of course, we don’t actually believe that there are no genetic effects – if we did, we wouldn’t be doing the test. So the question is, what are our chances of picking them up?
In this case, we’ll run the same experiment as in Case I
Y = matrix( rnorm(20000), 10, 2000)
But in this case, we’ll make some of the genes non-null. To do this, I’ll add 0.5 times a Poisson random
variable to each of these genes. We’ll use a Poisson with intensity parameter 0.29
X = rpois(2000,0.29)
The reason for this is that about 75% of the time X is zero in which case the null hypothesis is still true. But
in the other
sum(X>0) ## [1] 479
The gene is differentially expressed and we would like to pick it up.
We can now construct our t-statistics and p-values for these tests. Note I’ve added 0.5X after taking the mean here – this is exactly the same as adding 0.5X to each value in the corresponding column in Y and then taking the mean as we normally would
If we look at the p-values for this data set we see a “spike” at very small values
sim2.tstats = ( apply(Y,2,mean) + 0.5*X)/(apply(Y,2,sd)/sqrt(10)) sim2.pvals = 1-pt(sim2.tstats,9)
7
hist(sim2.pvals,xlab=’P-value’,cex.lab=1.5,cex.axis=1.5,col=’blue’)
Histogram of sim2.pvals
0.0 0.2 0.4 0.6 0.8 1.0
P−value
This corresponds to the approximately 25% of genes (this will be different each time you randomly generate a new X) for which the null hypothesis isn’t true – so we should have a pretty good shot of picking them up.
Indeed, if we simply reject tests at the 0.05 level, and look only at the “true rejections” we get
sum(sim2.pvals[X>0]<0.05) ## [1] 216
usually around 1/2 of them are correctly rejected.
On the other hand, some of the null genes will be rejected, too
sum(sim2.pvals[X==0]<0.05) ## [1] 74
about 5% of those that are null, but this still makes for around 75 genes. This means that among 300-ish (the numbers will change each time you run this code) genes that you reject, about
sum(sim2.pvals[X==0]<0.05)/sum(sim2.pvals<0.05) ## [1] 0.2551724
8
Frequency
0 100 300
(usually somewhere between 20% and 30%) will be “false alarms”. This is probably too much to be useful, especially since we normally expecte many fewer than 25% of the genes to be genuinely non-null.
On the other hand, if we employ a Bonferroni correction, we reject only a handfull of genes
sum(sim2.pvals<0.05/2000) ## [1] 1
whether they are null or not – so we have lost almost all of what might be interesting.
False Discovery Rates
False Discovery Rates (FDR) is an in-between solution to this problem. Basically, we’ll accept that our list of differentially expressed genes will have some level of contamination with “false alarms”, but we want to control how much contamination there is.
The methods behind these become a bit mathematically obscure but they hinge around looking at the whole list of p-values and trying to separate them into the null distribution (which should be flat) and a distribution of alternative genes.
In our case, we know what these look like because we generated the X’s:
p1 = hist(sim2.pvals[X>0],20)
Histogram of sim2.pvals[X > 0]
0.0 0.2 0.4 0.6 0.8 1.0
sim2.pvals[X > 0]
9
Frequency
0 50 100 150 200
p2 = hist(sim2.pvals[X==0],20)
Histogram of sim2.pvals[X == 0]
0.0 0.2 0.4 0.6 0.8 1.0
sim2.pvals[X == 0]
plot(p1,xlab=’P-value’,cex.lab=1.5,cex.axis=1.5,col=rgb(0,0,1,1),xlim=c(0,1)) plot(p2,,col=rgb(1,1,1,0.25),add=TRUE)
10
Frequency
0 20 40 60 80
Frequency
0 50 100 200
sort.sim2pvals = sort(sim2.pvals,index.return=TRUE)
Histogram of sim2.pvals[X > 0]
0.0 0.2 0.4 0.6 0.8 1.0
P−value
Of course, we don’t actually know these distributions and so must estimate them.
On simple approach to this was suggested by Benjamini and Hochberg in 1995 – who originally coined the term False Discovery Rate. There are now many variations on this that improve its accuracy, but we can motivate the mathematics fairly easily:
Suppose we conduct m tests and declare any test with a p-value less than p to be significant. Imagine that k tests are declared significant under this cutoff. However, we expect at most mp null tests to be declared significant (since we can have at most m null cases), so we have that the False Discovery Rate is estimated to be less than mp/k.
(One of the ways this is improved on is by trying to estimate the number of null cases, m0 and using this instead of m).
This gives us a way to estimate the FDR for any threshhold, but normally we want to control it. The usual default value is at 0.1 – 10% of the genes (or SNP’s or anything else) you report are false alarms.
To do this, it’s helpful to first of all sort the p-values (so for the 57th p-value, there are 57 tests that would be declared significant at this threshhold)
We can plot these sorted values where we see that the p-values tend to stay low for a while in comparison to the Case 1 p-values above which all have a uniform distribution and which therefore tend to lie along a straight line:
plot(sort.sim2pvals$x,type=’l’,col=’red’,xlab=’Index After Sorting’,cex.lab=1.5,cex.axis=1 lines(sort(sim.pvals),col=’blue’)
abline(c(0,1/2000),col=’blue’,lty=2)
11
.5,ylab=’Sorte
abline(c(0,0.5/2000),col=’red’,lty=2)
legend(‘topleft’,c(‘Case I P-values’,’Case II P-values’,’Uniform Quantiles’,’50% False Discovery Line’),
Sorted P−values
0.0 0.4 0.8
0
Case I P−values
Case II P−values
Uniform Quantiles
50% False Discovery Line
500 1000 1500 2000
Index After Sorting
Now for each p-value, we calculate the corresponding q-value, given by pm/k where k is the number of p-values less than this one
sort.sim2qvals = sort.sim2pvals$x*2000/(1:2000)
plot(sort.sim2qvals,xlab=’Index After Sorting’,cex.lab=1.5,cex.axis=1.5,ylab=’Sorted Q-values’) abline(h=0.1)
12
Sorted Q−values
0.0 0.4 0.8
Of which
sum(sort.sim2qvals<0.1) ## [1] 31
0 500 1000 1500 2000
Index After Sorting
are declared to be significant. Since we know the true labels for the tests, let’s look at how we did. First of all, we’ll make sure that X is in the same order as the sorted p-values
Xsort = X[sort.sim2pvals$ix]
Then we can see how many non-null cases were retruned
sum(sort.sim2qvals[Xsort>0]<0.1) ## [1] 30
and how many null cases
sum(sort.sim2qvals[Xsort==0]<0.1) ## [1] 1
and the true (as opposed to estimated) false discovery rate in this data set is
13
sum(sort.sim2qvals[Xsort==0]<0.1)/sum(sort.sim2qvals<0.1)
## [1] 0.03225806
Sometimes the FDR calculation is presented a little bit differently (the result is the same) in that if we say that
qk =pkm/k
where pk is the kth sorted pvalue, then if we control at FDR q, we want the genes for which pk < qkk/m – that is, all the points that lie below the red dashed line two plots above (where we have chosen q = 0.5). This is illustrated below for the more realistic q = 0.1 where we have looked at pk − qk/m (often called “excess probability”) and declare the genes for which this is negative to be significant (of course, we have to look at many fewer genes)
plot( sort.sim2pvals$x[1:300] - 0.1*(1:300)/2000,col='blue',xlab='Index After Sorting',cex abline(h=0,col='red',lty=2)
.lab=1.5,cex.a
Excess Probability
0.00 0.02
0 50 100 150 200 250 300
Index After Sorting
Local False Discovery Rates
Beyond the Benjamini-Hochberg approach (and there are several variants), Efron, 2005 suggested an alternative false discovery rate methods that has become quite popular. This local in that is says “For p-values around this level, what proportion are from the null distribution?”. That is, it tries to make a division between the uniform distribution under the null hypothesis and an alternative distribution as in this plot (repeated from above)
14
plot(p1,xlab='P-value',cex.lab=1.5,cex.axis=1.5,col=rgb(0,0,1,1),xlim=c(0,1)) plot(p2,,col=rgb(1,1,1,0.25),add=TRUE)
Histogram of sim2.pvals[X > 0]
0.0 0.2 0.4 0.6 0.8 1.0
P−value
Without going into details of how to guess at this separation, it does so and then in each bin of the histogram we can report the local false discovery rate in that bin as being
[Expected null disbtrbution counts]/[Total counts in bin]
The local fdr tends to be larger than the FDR given by Benjamini and Hochberg and a cut-off of 0.2 is generally used (less than 20% of p-values around this size correspond to the null distribution).
In R
Fortunately, there is a library in R that makes these calculations for you
library(fdrtool)
To use this, you give fdrtool a vector of p-values (you can also use a vector of correlations or a statistic that
should be normal under the null hypothesis – you need to tell it which. Here we give it the p-values from above
qvals = fdrtool(sim2.pvals,statistic=’pvalue’) ## Step 1… determine cutoff point
## Step 2… estimate parameters of null distribution and eta0
15
Frequency
0 50 100 200
## Step 3… compute p-values and estimate empirical PDF/CDF
## Step 4… compute q-values and local fdr
## Step 5… prepare for plotting
Type of Statistic: p−Value (eta0 = 0.7921)
0.0
0.0
0.0
Mixture
Null Component
0.2 0.4
Alternative Component
0.6 0.8 1.0
1−pval
Density (first row) and Distribution Function (second row)
0.2 0.4
0.6 0.8 1.0
1−pval
(Local) False Discovery Rate
Fdr and fdr CDF Density
0.0 1.0 0.0 1.0 0 3
0.2 0.4
0.6
fdr (density−based) Fdr (tail area−based)
0.8 1.0
1−pval
fdrtool calculates both the FDR q-value and the local FDR; but note that its q-value calculation is different from the simple method we produced above. For this we reject
sum(qvals$qval<0.1) ## [1] 69
tests of which
sum(qvals$qval[X==0]<0.1) ## [1] 4
correspond to null cases. Using the local fdr we declare
sum(qvals$lfdr<0.2) ## [1] 114
16
to be significant of which
sum(qvals$lfdr[X==0]<0.2) ## [1] 10
are null cases
More Prostate Cancer Data
To illustrate the workings of false discovery rates with real data we’ll use a data set of 6033 genes whos expresssion level was measured in 50 patients with prostate cancer and 52 patients without (cancer cases came first).
These data are in
pgenes = read.table('prostategenes.dat')
Here columns correspond to patients with rows corresponding to genes. If we look at expression over genes
levels across patients
boxplot(pgenes)
−1 0 1 2 3 4 5
V1 V9 V18 V28 V38 V48 V58 V68 V78 V88 V98
we see that there are no patients that look very different from the others. Looking this way is a good way to check for large batch effects, for example.
17
We now need to run a two-sample t-test for each gene (ie, each row in the data set). I’ll use a for loop to do this and record the t-statistic and p-value for each
pvals = rep(1,nrow(pgenes)) tstats = rep(0,nrow(pgenes))
for(i in 1:nrow(pgenes)){
testres = t.test(pgenes[i,1:50],pgenes[i,51:102]) pvals[i] = testres$p.value
tstats[i] = testres$statistic
}
If we look at the collection of p-values, we see the classic spike at 0 indicating that there should be some differential expression
hist(pvals,200,xlab='p values',cex.lab=1.5,cex.axis=1.5)
Histogram of pvals
0.0 0.2 0.4 0.6 0.8 1.0
p values
We can also observe this by drawing a histogram of t-statistics and overlaying it with the standard t distribution
hist(tstats,200,prob=TRUE,cex.lab=1.5,cex.axis=1.5,xlab='t statistics') lines(seq(-6,4,0.1),dt(seq(-6,4,0.1),96),col='blue',lwd=2)
18
Frequency
0 40 80 120
Histogram of tstats
−6 −4 −2 0 2 4
t statistics
where there is a subtle excess probability in the tails of the distribution. To try and come up with a list we use fdrtool:
qval = fdrtool(pvals,statistic='pvalue')
## Step 1... determine cutoff point
## Step 2... estimate parameters of null distribution and eta0
## Step 3... compute p-values and estimate empirical PDF/CDF
## Step 4... compute q-values and local fdr
## Step 5... prepare for plotting
19
Density
0.0 0.1 0.2 0.3 0.4
Type of Statistic: p−Value (eta0 = 0.9101)
0.0
0.0
0.0
Mixture
Null Component
0.2 0.4
Alternative Component
0.6 0.8 1.0
1−pval
Density (first row) and Distribution Function (second row)
0.2 0.4
0.6
0.8 1.0
fdr (density−based) Fdr (tail area−based)
0.8 1.0
1−pval
(Local) False Discovery Rate
Fdr and fdr CDF Density
0.0 1.0 0.0 1.0 0.0 2.0
0.2 0.4
0.6
and we would report all the genes for which either the q-value was less than 0.1:
# FDR cutoff of 0.1
sum(qval$qval<0.1) ## [1] 60
or the local fdr was less than 0.2
# Local FDR cutoff of 0.2
sum(qval$qval<0.2) ## [1] 113
1−pval
Remember that this just gives us a list! In the first case, we expect that 10% of the list are not really differently expressed, but we don’t know which ones these are (the lfdr is usually smaller than the FDR). Also remember that there are likely many genes that are differentially expressed, but the effect is small enough that we can’t pick them up.
20