### The data

VO 2 maxis the maximum rate of oxygen consumption during exercise is a useful measure of physical fitness, but accurately measuring it is unpleasant for the individual concerned. Models to describeVO 2 max, based on easy to measure variables which are minimally in- vasive, are therefore very useful.

In the dataset are a number of possible predictors ofVO 2 max; please see the associated data dictionary. Note that these data are simulated to mirror real data. Simulating data allows me to provide you with a suitably structured dataset with the right features.

### The tasks

There aretwo tasksassociated with this ICA. You must complete both. Briefly (more details below):

- Write a short supplement to the existing STAT0006 course notes, which explains what outliers, points of high leverage, and influential observations are in the context of a linear model; this supplement should also describe the methods available to detect such observations. You should make sure that your supplement is suitable for students on this course.
- Analyse theVO 2 maxdata, culminating in building a model forVO 2 max, writing up your findings as a short report.

#### Task 1

As a group, investigate what is meant by the terms outlier, high leverage and influential observation in the context of linear models. How do the following methods/ statistics help us to spot such observations?

- Hat matrix;
- Cooks distance;
- Difference in Fits (DFFITs);
- Difference in Betas (DFBETAs).

Write up your findings in the style of a supplement to the current STAT0006 course notes. That is, your supplement should be accessible (understandable) to STAT0006 students. Your supplement should not be overly mathematical. Instead, your focus should be on explaining:

- what outliers, points of high leverage and influential observations are;
- what are the above methods/ statistics – what do they do and how;
- how to use the methods/ statistics listed above to detect such observations;
- what courses of action are sensible if you detect such observations.

This supplement should be no longer than 2 pages. Minimum font size is 11pt, but you may choose your margin size and font. Any graphics should be large enough to be easily readable, adequately labelled, and captioned.

#### Task 2

How do the factors in the given dataset collectively affectVO 2 max? Your analysis should include building a linear model forVO 2 max. Write a report on your findings, which should include the following things:

- An initial exploratory analysis of the dataset. The aim of this is to give someone who doesnt have access to the data an overview of what the data are and a feel for the variables in the dataset (e.g. summaries of each variable or simple relationships). This should be non-technical.
- A description of how you approached the model-building phase. Dont just show your chosen final model. How did you choose your particular model? What processes did you go through?
- Check the final model for any outliers, points of high leverage and influential obser- vations, using what you learned from doing Task 1 of this ICA. Comment on these, including any course of action you took as a result (if any are needed).
- A brief description of the final model. What does it tell you about the drivers of VO 2 max?
- Conclusion, including a brief discussion of limitations of the data and model. Do you think the model is reliable?

The maximum length for Task 2 istwo sides of A4, which is to include plots/ tables/ figures. Make sure that any plots/ tables/ figures, if applicable, are legible (i.e. dont squeeze these in if they are not readable – you will be penalised for this). The minimum font size is 11pt. You may choose your own margin size. Given the maximum length, you are strongly advised to select plots/ tables/ figures with care.

### Administrative details

#### Basic details

- This assessment counts for 25% of your final mark for STAT0006.
- You should work in groups of no more than 5 students. You may work on the project alone if you wish, but note that this is not efficient. It is up to you to form your own groups. You should have already registered your group on Moodle.

In addition to the outputs from Tasks 1 and 2, all groups must submit an additional page where each group member briefly describes their contribution to the project.

- You will need to agree this in your groupsbeforesubmitting the report.
- If all group members agree that everyone contributed equally, then it is sufficient to write a single sentence to that effect, or alternatively you are very welcome to describe your own personal contribution to the project.
- Note that I will not mark this page, nor allocate different marks to different group members based on this. The purpose is to encourage you all to be mindful about contributing to this piece of groupwork.

- If you feel that one or more of your peers is not contributing fairly, please contact me by email in the first instance BEFORE SUBMISSION of the report and as early as possible.

You should insert student ID numbers of all students in your group on the report, butdo not write your names. Your report will be marked anonymously. This also applies to the page with descriptions of contributions.

#### How do I get help with this assignment?

You can ask for help from me during office hours. Please note that I will not provide comments on draft reports. Note that it may not be appropriate for me to answer all your questions.

You may also post to the Moodle forum to ask questions. Please do not email me with statistical questions – if you do, I will ask you to post them to the Moodle forum instead. This being said, you should email me immediately if you have any technical difficulties with Moodle (e.g. with submitting your report).

#### Submitting your work

The outputs from both tasks should be submitted by 12 noon on the 22nd November 2019. Details of how to submit will be announced a few days prior to this date.

#### How will the report be marked?

Your report will be marked out of 50, with allocation as follows:

- 20 marks for Task 1, split as follows:
- 15 marks for technical accuracy of your supplement;
- 5 marks for overall presentation and clarity of the supplement, including suit- ability for the intended audience.

- 30 marks for Task 2, split as follows:
- 20 marks for the content of the report, including whether you have selected appropriate information and supporting evidence (e.g. plots, tables), whether your interpretation of the results are accurate, etc.
- 10 marks for the presentation and clarity of the report overall, including clarity of expression and how easy it is to read and understand, whether you have structured the report sensibly, good use of plots/tables where appropriate, ad- equately sized graphics with suitably informative captions and labelling, and so on.

The mark you will receive is your group mark – everyone in the group will be awarded the same mark, unless there are exceptional circumstances (e.g. a member of a group did not contribute to the project).

Elinor M Jones
October 2019
```