Math 308: Fundamentals of Statistical Learning
report作业 | math代做 | Machine learning代做 | 代写assignment – 本题是一个利用report进行练习的代做, 对report的流程进行训练解析, 涵盖了report/math/Machine learning/unity等方面, 该题目是值得借鉴的assignment代写的题目
assignment 3
(Q1) Multiple Correspondence Analysis: We will look at a canonical dataset on breast cancer from the Machine learning comm unity for this question which we previously looked at in Assignment 2 for Bivariate Correspondence Analysis. The data set is publicly available from UCI Machine Learning repository and has the following variables (description can be found here).
- Recurrent event ("RecEv"): whether the patient experience cancer recurrence or not
- Age ("AgeGrp"): age group of the patient at the time of diagnosis;
- Menopause ("Meno"): whether the patient is pre- or postmenopausal at time of diagnosis ("lt40" means menopause occurred before 40, "ge40" means at or after 40)
- Tumor size ("TumSize"): the greatest diameter (in mm) of the excised tumor;
- Inv-nodes ("InvNodes"): the number (range 0 – 39) of axillary lymph nodes that contain metastatic breast cancer visible on histological examination;
- Node caps ("NodeCaps"): if the cancer does metastasize to a lymph node, although outside the original site of the tumor it may remain "contained" by the capsule of the lymph node. However, over time, and with more aggressive disease, the tumor may replace the lymph node and then penetrate the capsule, allowing it to invade the surrounding tissues;
- Degree of malignancy ("DegMal"): the histological grade (range 1-3) of the tumor. Tumors that are grade 1 predominantly consist of cells that, while neoplastic, retain many of their usual characteristics. Grade 3 tumors predominately consist of cells that are highly abnormal;
- Breast side ("Side"): breast cancer may obviously occur in either the left or right breast;
- Breast quadrant ("Quad"): the breast may be divided into four quadrants, using the nipple as a central point;
- Irradiation ("Irrad"): radiation therapy is a treatment that uses high-energy x-rays to destroy cancer cells.
library (tidyverse)
library (kableExtra)
library (ggpubr )
library (FactoMineR)
breast_cancer<- read_csv ("breast_cancer_data.csv",col_names=FALSE, col_types = cols ())
names (breast_cancer)<- c ("RecEv","AgeGrp","Meno","Size","InvNodes",
"NodeCaps","DegMal","Side","Quad","Irrad")
### Remove missing obs
breast_cancer<-breast_cancer %>% filter (Quad!="?",NodeCaps!="?")
1
math 308: Winter 2023 Shomoita Alam
RecEv AgeGrp Meno Size InvNodes NodeCaps DegMal Side Quad Irrad
no-recurrence-events AG30-39 premeno 30-34 IN0-2 no 3 left left_low no
no-recurrence-events AG40-49 premeno 20-24 IN0-2 no 2 right right_up no
no-recurrence-events AG40-49 premeno 20-24 IN0-2 no 2 left left_low no
no-recurrence-events AG60-69 ge40 15-19 IN0-2 no 2 right left_up no
no-recurrence-events AG40-49 premeno 0-4 IN0-2 no 2 right right_low no
no-recurrence-events AG60-69 ge40 15-19 IN0-2 no 2 left left_low no
breast_cancer<-breast_cancer %>% mutate (AgeGrp= paste ("AG",AgeGrp,sep=""),
InvNodes= paste ("IN",InvNodes,sep=""))
head (breast_cancer) %>% kable (.) %>% kable_styling ()
For this part, we will focus on all of the variables in a single multiple correspondence analysis. Note
that in order to use MCA, we need to make one more change to the dataset:
## Convert all columns to factors
breast_cancer<-breast_cancer %>% mutate_all (~ factor (.))
a) (20 points) Conduct a multiple correspondance anaylsis for this data, being sure to complete the
following tasks:
- report the table of eigenvalues for the first 5 components and explain how many components you think are sufficient to analyze the data.
- Generate a factor map for the first two dimensions of the correspondance analysis (regardless of what your answer is to the first bullet point). Give a summary of which levels of which variables are most strongly associated with each of the first two dimensions and how you made your decisions.
b) (10 points) Note that of particular interest is which variables are related to cancer recurrence.
Does the recurrence variable load highly on any of the dimensions that you discussed in part (a)?
If so, explain which dimensions those are and for each of those dimensions, indicate what other
variable levels also load highly on those dimensions.