统计代写 | R代写 | data science | python | data mining – Alcohol and congenital malformations

Alcohol and congenital malformations

统计代写 | R代写 | data science | python – 本题是一个数据分析比较典型的等代写方向的题目

ios代写 代写ios swift代写 移动开发 苹果代写

Johanna G. Neslehova

09/03/

In this illustration, we will study the effect of alcohol on congenital malformations. The data are as follows.

alc <- factor ( c ("0","<1","1-2","3-5",">= 6"),levels= c ("0","<1","1-2","3-5",">= 6")) alc.s <- c (0,0.5,1.5,4,7) malf <- c (48,38,5,1,1) no.malf <- c (17066,14464,788,126,37) data <- data.frame ( cbind (malf,no.malf,alc,alc.s)) data

malf no.malf alc alc.s

1 48 17066 1 0.

2 38 14464 2 0.

3 5 788 3 1.

4 1 126 4 4.

5 1 37 5 7.

Here,alcrecords the average number of drinks a pregnant woman has per day,malfthe number of babies with malformations andno.malfthe number of babies without malformations.

To study whether alcohol consumption has an effect, we can first fit a logistic model to the data with the intercept andalcas predictors. Fit the model and look at its summary.

m1 <- glm ( cbind (malf,no.malf) ~ alc,family= binomial) summary (m1)

Call:

glm(formula = cbind(malf, no.malf) ~ alc, family = binomial)

Deviance Residuals:

[1] 0 0 0 0 0

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) -5.87364 0.14454 -40.637 <2e-16 ***

alc<1 -0.06819 0.21743 -0.314 0.

alc1-2 0.81358 0.47134 1.726 0..

alc3-5 1.03736 1.01431 1.023 0.

alc>= 6 2.26272 1.02368 2.210 0.0271 *

Signif. codes: 0′‘0.001 ‘‘ 0.01’‘ 0.05′.’ 0.1” 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 6.2020e+00 on 4 degrees of freedom

## Residual deviance: 1.8163e-13 on 0 degrees of freedom
## AIC: 28.
##
## Number of Fisher Scoring iterations: 4
Are the data entered in a grouped form? Yes, the data entered in a group form. Why are the deviance
residuals all 0? Because this model is saturated. Now compare the observed and the fitted counts.
m <- malf + no.malf
fitm1.malf <- predict (m1,type="response") * m
fitm1.no.malf <- m - fitm1.malf
cbind (malf,no.malf,fitm1.malf,fitm1.no.malf)
## malf no.malf fitm1.malf fitm1.no.malf
## 1 48 17066 48 17066
## 2 38 14464 38 14464
## 3 5 788 5 788
## 4 1 126 1 126
## 5 1 37 1 37

Why are the observed counts and the fitted counts the same? It is no surprise that odds and odds rat ios estimated using the modelm1and the sample odds and odds ratios are the same. Now compare the odds and odds ratios estimated using the modelm1and the sample odds and odds ratios. prob.observed <- malf / (malf + no.malf) prob.fitted <- as.numeric ( predict (m1,type="response")) a <- log (prob.observed / (1 prob.observed)) b <- log (prob.fitted / (1 prob.fitted)) #exp(a) #exp(b) exp (a)[2 : 5] /exp (a)[1 : 4]

## [1] 0.9340835 2.4151750 1.2507937 3.
#log(exp(a)[2:5]/exp(a)[1])
#coef(m1)[2:5]
exp (b)[2 : 5] /exp (b)[1 : 4]
## [1] 0.9340835 2.4151750 1.2507937 3.
It seems that the odds of having a malformation increase with increasing alcohol consumption of the mother.
However, is effect of alcohol significant? the effect of alcohol is not significant
anova (m1,test= "Chisq")
## Analysis of Deviance Table
##
## Model: binomial, link: logit
##
## Response: cbind(malf, no.malf)
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL 4 6.
## alc 4 6.202 0 0.000 0.

What do you conclude from the analysis of deviance? This does not mean that alcohol consumption as such

has no effect, It simply means that the data do not provide enough evidence that the factor predictoralcis significant.

Becausealcis ordinal (that is, its values can be ordered), we can try to turn it into a continuous predictor by assigning a score to each category. This is the variablealc.s. Fit the model, look at its summary and assess whether alcohol is significant.

alc.s

[1] 0.0 0.5 1.5 4.0 7.

m3 <- glm ( cbind (malf,no.malf) ~ alc.s,family=binomial) summary (m3)

Call:

glm(formula = cbind(malf, no.malf) ~ alc.s, family = binomial)

Deviance Residuals:

1 2 3 4 5

0.5921 -0.8801 0.8865 -0.1449 0.

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) -5.9605 0.1154 -51.637 <2e-16 ***

alc.s 0.3166 0.1254 2.523 0.0116 *

Signif. codes: 0′‘0.001 ‘‘ 0.01’‘ 0.05′.’ 0.1” 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 6.2020 on 4 degrees of freedom

Residual deviance: 1.9487 on 3 degrees of freedom

AIC: 24.

Number of Fisher Scoring iterations: 4

anova (m3,test="Chi")

Analysis of Deviance Table

Model: binomial, link: logit

Response: cbind(malf, no.malf)

Terms added sequentially (first to last)

Df Deviance Resid. Df Resid. Dev Pr(>Chi)

NULL 4 6.

alc.s 1 4.2533 3 1.9487 0.03917 *

Signif. codes: 0′‘0.001 ‘‘ 0.01’‘ 0.05′.’ 0.1” 1

Now alcohol consumption becomes significant. Because the data are entered in a grouped form. Check the goodness of fit ofm3using a suitable statistical test.

pchisq ( deviance (m3),df=3,lower.tail= FALSE)
## [1] 0.

What do you conclude at the 5% level? The p value is high, so there is no evidence of a lack of fit. Alternatively, we can also merge some categories ofalcto reduce the number of parameters. Repeat the above steps: fit the model, check whether alcohol is significant, and assess goodness-of-fit. alc.g <- c (0,0,1,1,1) m4 <- glm ( cbind (malf,no.malf) ~ alc.g,family=binomial) summary (m4)

##
## Call:
## glm(formula = cbind(malf, no.malf) ~ alc.g, family = binomial)
##
## Deviance Residuals:
## 1 2 3 4 5
## 0.21136 -0.23221 -0.33919 0.07411 1.
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.9043 0.1080 -54.680 <2e-16 ***
## alc.g 0.9927 0.3944 2.517 0.0118 *
## ---
## Signif. codes: 0'***'0.001 '**' 0.01'*' 0.05'.' 0.1'' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 6.2020 on 4 degrees of freedom
## Residual deviance: 1.3511 on 3 degrees of freedom
## AIC: 23.
##
## Number of Fisher Scoring iterations: 5
anova (m4,test="Chisq")
## Analysis of Deviance Table
##
## Model: binomial, link: logit
##
## Response: cbind(malf, no.malf)
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL 4 6.
## alc.g 1 4.8509 3 1.3511 0.02763 *
## ---
## Signif. codes: 0'***'0.001 '**' 0.01'*' 0.05'.' 0.1'' 1
pchisq ( deviance (m4),df=3,lower.tail=FALSE)
## [1] 0.
To comparem3andm4, we can look at the fitted odds ratios and compare them to the odds ratios in the
saturated model.
prob.fitted.m3 <- as.numeric ( predict (m3,type="response"))
c <- log (prob.fitted.m3 / (1 - prob.fitted.m3))
prob.fitted.m4 <- as.numeric ( predict (m4,type="response"))
d <- log (prob.fitted.m4 / (1 - prob.fitted.m4))
exp (c)[2 : 5] /exp (c)[1 : 4]
## [1] 1.171494 1.372399 2.206486 2.
exp (d)[2 : 5] /exp (d)[1 : 4]
## [1] 1.000000 2.698628 1.000000 1.
exp (b)[2 : 5] /exp (b)[1 : 4]
## [1] 0.9340835 2.4151750 1.2507937 3.

We can also compare the two models using AIC.

AIC (m3)
## [1] 24.
AIC (m4)
## [1] 23.
Also, we can look at the fitted counts and comparing them to the observed counts for the two models.
m <- malf + no.malf
fitm3.malf <- predict (m3,type="response") * m
fitm3.no.malf <- m - fitm3.malf
fitm4.malf <- predict (m4,type="response") * m
fitm4.no.malf <- m - fitm4.malf
round ( cbind (malf,fitm3.malf,fitm4.malf,no.malf,fitm3.no.malf,fitm4.no.malf),1)
## malf fitm3.malf fitm4.malf no.malf fitm3.no.malf fitm4.no.malf
## 1 48 44.0 46.6 17066 17070.0 17067.
## 2 38 43.7 39.4 14464 14458.3 14462.
## 3 5 3.3 5.8 788 789.7 787.
## 4 1 1.2 0.9 126 125.8 126.
## 5 1 0.9 0.3 37 37.1 37.

Which of the two models do you prefer? Your answer is subjective, which is OK, as long as it is justified.