代做mining | Python作业 | Statistics代写 | 统计作业 – MTHM017 Advanced Topics in Statistics

Assignment

Please make sure that the submitted work is your own. This is NOT a group assignment, therefore approaches and solutions shouldnt be discussed with other students. Plagiarism and collusion with other students are examples of academic misconduct and will be reported. More information on academic honesty can be found here****.

The assignment has two main parts. Part A involves (i) fitting a Poisson regression model to assess the effect of using different priors, and (ii) fitting a fixed effects and a random effects model in order to compare posterior distributions under different models. Part B involves using different methods for classification of data into two groups.

A. Bayesian Inference [66 marks]

1. The first question of part A involves fitting a Poisson regression model using theScotlanddataset, which contains the observed and expected counts of lip cancer cases for the local authority administrative areas in Scotland between 1975 and 1986.
``````a. [3 marks] Calculate the Standard Mortality Rat ios (SMR=observed/expected) for each administrative
area and plot the distribution of the SMRs. Then plot a map of the SMRs by administrative area. To
map the SMRs you may use theScotlandMapfunction readily available on the ELE page.
The code below uses random numbers to demonstrate how theScotlandMapfunction works. Note
that in order for this to run you need to add the shapefiles for Scotland (.shx, .shp, .prj, .dbf) to your
working directory. TheScotlandMap.Rfile and the shapefiles are available on the ELE page.
library(tidyverse)
library(RColorBrewer)
library(sf)
library(rgdal)
``````
``````source("ScotlandMap.R") # need to read in the ScotlandMap function
testdat <- runif(56) # generate random numbers to use as observations
ScotlandMap(testdat,figtitle="Scotland random numbers")
``````
``````We are interested in estimating the relative risk (RR) for each area. The relative risk is the parameter
that quantifies whether an area i has a higher ( RRi> 1 ) or lower ( RRi< 1 ) occurrence of cases than
what we would expect from the reference rates. In order to estimate the relative risk we are going to fit
a Poisson model of the following form:
``````
``````Obs i  Pois ( i )
log( i ) = log(Exp i ) +  0 + i
RRi = exp (  0 ) exp ( i )
``````
``````Where i  Normal (0 ,^2 ). Here, the Exp(ected) numbers are an offset, which are treated as fixed
data (i.e. no coefficient is estimated for this term).
b. [3 marks] Describe the role (interpretation) of  0 and the set of  s in this model and how they contribute
to the estimation of RR. Focus on the meaning of these parameters in the estimation of the relative risk.
``````
``````c. [14 marks] Code up this Poisson(-Lognormal) model inJAGSto analyse the Scotland data. For the
parameter  0 use a Normal prior with mean 0 and variance 1000. And for the precision parameter
(= 1 /^2 ) of i use a prior   Gamma (1 , 0. 05)(which is parametrised so as to have a mean of 20).
Initialise 2 chains and run the model with these two chains. You will have to decide on the appropriate
values ofn.iterandburnin. Produce trace plots for the chains and summaries of all the model
parameters. Investigate whether the chains for all the parameters have converged.
d. [4 marks] Extract the posterior means for the RR and map them. Compare the posterior mean relative
risk to the SMR for each area. Note that we can think about the SMR as the maximum likelihood
estimate of the relative risk.
e. [7 marks] Now calculate the posterior probabilities that the relative risk in each area exceeds 1. Extract
these probabilities and map them. For which areas are you at least 95% certain that there is an excess
risk of lip cancer (i.e. relative risk > 1)?
f. [6 marks] Repeat the analysis with different priors for  0 and . The exact choice is yours, but explain
why you have chosen them and what they mean. Map the two sets of RRs and explain any differences
you see in the map and the summaries of the posteriors for the parameters of the model.
``````
1. In this question we will analyse the effect different models have on the posterior distribution. We will use thesurgical.csvfile, which contains data on 12 hospitals, where the columnngives the total number of heart surgery operations carried out on children under 1 year old by each centre between April 1991 and March 1995, and the columnrgives the number of these operations where the patient died within 30 days of the operation. A binomial model seems reasonable for these data, so we will focus on comparing different models for the hospital specific mortality rates. We will first fit a logistic random effects model of the following form:
``````ri  Binom ( ni,i ) ,
logit ( i ) N ( ,^2 ).
``````
``````This model assumes that failure rates ( i ) across hospitals are similar in some way. This similarity is
represented by the assumption that the mortality rates for the hospitals are drawn from the same (parent)
distribution.
a. [9 marks] Code this model in JAGS to analyse the surgical data. Use vague priors for the parameters
of the shared normal distribution. In particular you can use the following model definition:
jags.mod <- function (){
for (i in 1:12){
r[i] ~ dbin(theta[i],n[i])
logit(theta[i]) <- logit.theta[i]
logit.theta[i] ~ dnorm(mu, tau)
}
mu ~ dnorm(0,1.0E-3)
tau <- 1/sigma
sigma ~ dunif(0,100)
}
``````
``````Run the model for 10,000 iterations, with 2 chains, discarding the first 5,000 as burn-in. Produce
traceplots for the chains and summaries for the fitted parameters. Comment on whether the chains for
all the parameters have converged.
b. [8 marks] We want to know whether the assumptions of the above random effects model are reasonable,
or whether there is any evidence that the mortality rates for one or more of the hospitals are not drawn
from the same parent distribution. We will assess this by comparing the observed and predicted number
of deaths in each hospital.
``````
``````Edit your JAGS model so that it (i) finds the posterior prediction of the number of deaths ( rpredi ) in
each hospital, and (ii) computes the following posterior probability for each hospital:
``````
``````pi = P ( ripred> ri ) +
``````
2
``````P ( rpredi = ri ).
``````
``````The above probability is called the mid p-value. This allows us to summarise conflict between discrete
quantities.
Exa mining the mid p-values, are there any hospitals that appear to have an unusual mortality rate?
Confirm your conclusion by producing kernel density plots of the predicted number of deaths in each
hospital, and visually comparing these to the observed number of deaths in each hospital.
c. [8 marks] As an alternative model we will consider the independent (fixed) effects model, which has the
following hospital-specific mortality rates:
``````
``````i  Unif (0 , 1) ,
``````
``````for i = 1 , 2 ,..., 12.
Fit this alternative model, and compare the posterior distributions of the hospital mortality rates under
the random effects and the fixed effects models. Interpret the differences. (Hint: the comparison is
easiest to do by producing box plots of the mortality rates  under each model).
d. [4 mark] Compute the mid p-values for the fixed effects model and compare these to the mid p-values
of the random effects model. Does the fixed effects model better explain the unusual mortality rates?
``````

B. Classification [34 marks]

The following figure shows the information in the datasetClassification.csv- it shows two different groups, plotted against two explanatory variables. This is simulated data – the aim is to find a suitable method for classifying the 1000 datapoints into two groups from a selection of possible approaches.

``````
``````
``````
``````
``````
``````
``````0
``````
``````1
``````
``````2
``````
``````3
``````
``````2 0 2
``````
Group
``````0
1
``````
1. [5 marks] Summarise the two groups (separately) in terms of the variables X 1 and X 2 , and for each group plot the distribution of both variables. Describe your findings.
1. [4 marks] Considering the plot showing the observations, your density plots and the numerical summaries, which of the following classification methods do you think are suitable for classifying this data? For each method, you should explain the reasons behind your answer (i.e. why is the given method suitable/why it might not be suitable)? a. Linear discriminant analysis. b. Quadratic discriminant analysis. c. K-nearest neighbour classification. d. Support vector machines. e. Random forests.
2. [1 marks] Select 80% of the data to act as a training set, with the remaining 20% for testing/evaluation.
3. [22 marks] Choose three of the methods listed in Q2 that might be suitable to classify the data. Perform classification using these methods. (If you use more than three of the listed methods, only the first three will be considered for marking). In each case, briefly describe how the given classification method classifies the data, present the results of an evaluation of the model fit (highlighting different aspects of the model performance) and describe your findings. Where appropriate optimise the (hyper)parameters of the method.
4. [2 marks] Compare the results from your chosen three approaches and select what you think is the best method for classification in this case, explaining your reasoning.

Total for paper = 100 marks

The deadline for submission is Noon (12pm), 31st March. Note that late submissions will be penalised.

You should submit a pdf that contains your answers (and relevant output/plots!) to the questions via eBart. In Part A you should use the R programming language, but in Part B you can choose to use R or Python (or both).