数据分析| 数据挖掘 | AI代写 | Machine learning代写 ： 这是一个数据分析和训练的题目
Sharing or copying any part of the midterm is an infraction of the University’s rules on Academic Integrity and will be disciplined accordingly. Neither the TA nor I will be available to offer help with solving the exercises as we do with HW. Further, you are not to consult with fellow students or any other person regarding the Midterm. You are allowed to consult online notes, videos, lecture notes, R documentation, and other static rescources – but not people.
You must complete the Midterm and turn in the .Rmd file and Report just like with HW. Submissions must be uploaded to our Compass 2g site on the Midterm page. No email, hardcopy, or late submissions will be accepted.
Your assignment must be submitted through the submission link on Compass 2g. You are required to attach one .zipfile, named midterm_yourNetID.zip, which contains:
Your RMarkdown file which should be saved as midterm_yourNetID.Rmd. For example midterm_dunger.Rmd.
The result of knitting your RMarkdown file as midterm_yourNetID.html. For example midterm_dunger.html.
Your resulting .html file will be considered a “report” which is the material that will determine the majority of your grade. Be sure to visibly include all R code and output that is relevant to answering the questions. (You do not need to include irrelevant code you tried that resulted in error or did not answer the question correctly.)
You are granted an unlimited number of submissions, but only the last submission before the deadline will be viewed and graded.
Your .Rmd file should be written such that, if it is placed in a folder with any data your are asked to import, it will knit properly without modification.
Include your Name and NetID in the final document, not only in your filenames.
An Analysis of the Abalone Data
According to Wikipedia, Abalone is a common name for any of a group of small to very large sea snails, marine gastropod molluscs in the family Haliotidae. The dataset is an updated version of a study conducted by a team of scientists in 1994. The age of abalone is determined by cutting the shell through the cone, staining it, and counting the number of rings through a microscope. Other physical measurements, which are easier to obtain, can be used to predict the age.
The dataset contains 3977 observations and 9 variables. From the original data, examples with missing values were removed, and the ranges of the continuous values have been scaled for easier use with other statistical methods beyond what we’re doing.
Variables in order:
Sex – M, F, and I (infant)
Length – longest shell measurement (mm)
Diameter – measured perpendicular to length (mm)
Height – with meat in shell (mm)
Whole – whole abalone weight (g)
Shucked – weight of meat only (g)
Viscera – gut weight, after bleeding (g)
Shell – weight after being dried (g)
Rings – +1.5 gives the age in years
The official training dataset for this Midterm to use for generating your model is the abalone_train.dat dataset which must be properly uploaded to R.
Formulate and justify a best model for predicting Rings, and thus, by adding 1.5, you are predicting Age. You have access to all variables as predictors, but you do not have to include Sex in your analysis. Your analysis should be defenisible using techniques and methods seen up to and including Chapter 11: Categorical Predictors and Interactions, and it should not include any analysis that we have not yet formally discussed.
Regarding final model selection
There is not necessarily one, singular correct answer/model, but certainly some methods and models are better than others in certain situations.
You do not necessarily have to use all predictors.
You may assume that the reader has limited knowledge of the STAT 420 concepts we’ve covered, but has no knowledge or background in the specific data problem. You will still need to explain the rationale for your decision-making, and the report should be a standalone document.
The popular question is, “How long should it be?” You’ll be creating an html file, so there’s no such thing as a page count. On one hand, you need to provide results and evidence to support your decisions, and you need to be thorough and diligent as you walk through the steps of finding your best model. On the other hand, a well-crafted data analysis will utilize brevity and conciseness. If you have a point to make, get to it. If you find yourself writing things simply for the sake of padding the word-count, you’re writing the wrong things.
Guidelines for the Report
A summary report in html format that includes the following.
Title of the project.
Description of the original data file including description of all relevant variables. You can use the information on the Abalone data given above.
Description of the process you chose to follow.
Narrative of your step-by-step decision making process throughout the analysis as you adjusted the model.
Final model selected.
Write-up of the results. Point out notable information from tables and graphs as necessary.
Give an interpretation, in the context of this data, of all of the parameters in your final model selected.
The abalone_test.dat dataset which contains 200 observations to be used to “test” your model. Report the mean and standard deviation of the errors when using your model to predict the ages of these observations.
Appendix section: (optional)
Write in complete sentences and pay attention to grammar, spelling, readability and presentation. If you include a table or graph, make sure you say something about it. If you’re not discussing a result, then it doesn’t belong in your report.
Submit a .Rmd program file and the project report (.html file) in a .zip file just as you do in homework assignments.
The grading rubric for the final project is summarized by five criteria, each worth 20 points.
Methodology of model building
Correctness of results
Interpretation of results
.Rmd file/programming (similar to HW expectation)
Report organization and presentation (similar to HW expectation)
Maximum total points: 100