作业essay | mining | 代写project | 代做database – CP3403/ CP5634 Data Mining

CP3403/ CP5634 Data Mining

作业essay | mining | 代写project | 代做database – 这是一个mining面向对象设计的practice, 考察mining的理解, 是有一定代表意义的essay/mining/database等代写方向, 这个项目是project代写的代写题目

data mining代写 代做data mining

Final Exams SP51, 2021

On-line Examination for JCUS student only

College Of Science and Engineering

Exam Duration:

2 hours (this includes uploading of answers to the dropbox within the given duration and Late submissions will not be processed)

Exam Availability :

The exam is available on LearnJCU for 4 hours between 2pm and 6pm. Once opened, it must be completed within the stipulated duration.

Exam Conditions:

Email [email protected] if you have any problems during the exam.

Exam Advisory:

Students are advised not to share their answer scripts with other students, copy material from online and offline sources as well as discuss their answers.

Students caught through plagiarism checks, will be referred to the Exam misconduct panel with a possibility of their exam attempt being voided or shown a failure in the subject.

Instructions to Students (lecturer to amend accordingly):

  1. Students answer the exam by typing on a word document or hand write on paper.
  2. Type/Write your full name, JCU number and Subject Code on the top right-hand corner of each answer page
  3. Number your answer clearly on each answer page.
  4. Upon completion of the answer script (students to save file on your computer hard drive, or an external drive/USB, to minimise risk of access disruption when completing and uploading the exam): a. For Handwritten scripts on paper: students are to take pictures of their answer scripts using the Clear Scanner app, save the answer script into a single PDF file b. For those typed on a word document: students to save the file.
Student Number |__|__|__|__|__|__|__|__|
Family Name _____________________
First Name _____________________

Submission : Submit the exam, then ensure the upload is complete before pressing ‘submit’ again. Confirm that you see a ‘pending’ notification to indicate that the assessment has been submitted.

Instructions to Candidates (lecturer to amend accordingly): Answer ALL questions. The paper has two sections. Section A consists of MCQ and True/ False questions and comprises 22 marks. Section B has short essay questions with 18 marks. This examination is worth 40 marks in total.

This is an open book exam. You may refer to the lecture notes/ tutorial materials and your personal notes to answer the questions. However, you are not allowed to access the Internet during the exam. You are permitted to use Excel for your calculations.

Section A (22 marks)

  1. You recently joined a Ministry of Health as a data scientist and embark on a project to determine how eating habits impact diabetes. You are provided with a data sheet with all the patients IDs, their bio-data and medical history. What is one serious concern of this exercise? *a) Data privacy b) Statistical accuracy c) Data relevance d) None of the above 1 Mark
  2. In the figure below with the histograms of various attributes of banking customers, what type of mining does this belong to?

a) Predictive mining *b) Descriptive mining c) Prescriptive mining d) None of the above 1 Mark

  1. In the figure below, the daily price values were averaged for the monthly price values. This process can result in:

a) Dimensionality reduction *b) Loss of information c) Discretisation d) Data sampling 1 Mark

  1. In feature selection, some of the factors that should be considered are: (there may be more than 1 answer): *a) Domain knowledge *b) Dirty data c) Variable type d) None of the above 1 Mark
  2. Suppose a medical company has sales receipts (the fact table) and dimensions: patients, clinic and time Patient dimensions: 10000 unique patients Clinic dimensions: 10 unique clinics/ hospitals Time dimensions: 5 years of daily data ( assume 260 business days in a year) What is the maximum possible no of records in the fact table? a) 26 million *b) 130 million c) Variable type d) Cannot be calculated

1 Mark

  1. Calculate the number of records that could participate in the summation of the following query: total sales for all patients in the year 2020 in clinic A

a) 5.2 million b) 130 million *c) 2.6 million d) Cannot be calculated

1 Mark

  1. For the following diagram with 8 points, suppose a DBSCAN is run through it with minPts =3, and eps=1, what will be the no of clusters and noise points resulting? a) 2 clusters and 3 noise points b) 1 clusters and 6 noise points *c) 2 clusters and 2 noise points d) All noise points

1 Mark

  1. Clustering is a form of unsupervised learning that have its applications in. (there may be more than 1 answer) *a) Target advertising amongst customers b) Prediction on manufacturing machines half-life *c) Insurance groups of customers d) All of the above 1 Mark
  1. In k-means clustering, the no of groups k is determined by:(there may be more than 1 answer) *a) Business reasons b) The more groups the better c) Reduction of within group errors as much as possible d) All the above 1 Mark
  2. In the dendrogram below for agglomerative clustering, suppose the clustering stage ends at the orange line. At this stage, the following clusters are formed: a) {P1, P7}, {P6}, {P8}, {P5},{P2,P4,P3} b) {P1, P7}, {P6, P8}, {P5}, {P2,P4,P3} *c) {P1, P7, P6}, {P8}, {P5}, {P2,P4,P3} d) {P1, P7}, {P6, P8}, {P5}, {P2}, {P4,P3}

1 Mark

  1. The formation of the dendrogram in agglomerative clustering depends on the following factors: (there may be more than 1 answer) *a) Type of linkage (single, complete etc.) b) No of clusters determined at the start *c) Distance measures (eg Euclidean, manhatten) d) All the above 1 Mark
  1. In the following frequency pattern tree with minSup=2 with the associated transactions listed, the itemset {A} is a: (there may be more than 1 answer) *a) Frequent itemset b) Closed itemset c) Maximal itemset d) None of the above

1 Mark

  1. In the same frequency pattern tree with minSup=2 with the associated transactions listed, the itemset {B} is a: (there may be more than 1 answer)

*a) Frequent itemset *b) Closed itemset c) Maximal itemset d) None of the above 1 Mark

  1. The below shows a transactions database for 4 items: DM book, baby toy, diapers and milk powder. For the association DM book -> Milk powder, the support and confidence are respectively:

a) support = 1.0, confidence = 0. *b) support = 0.6, confidence = 1. c) support = 0.8, confidence = 0. d) None of the above 1 Mark

  1. Consider a minimum support of 0.6 and confidence of 0.8, the association rule DM book Milk powder is considered: a) Weak rule and trivial *b) Strong rule and inexplicable c) Strong rule and trivial d) None of the above 1 Mark
  2. In the model training, suppose you use hold-out of a part instead of re- substitution for the model testing. In this case, you can expect to get higher accuracy results. a) True *b) False 1 Mark
  3. In the model training, suppose you use bootstrap sampling with replacement. In this case, it is possible to have a much larger training set. *a) True b) False 1 Mark
  1. In the confusion matrix below, the accuracy is

a) 0. *b) 0. c) 0. d) 0. 1 Mark

  1. In the covid-test, it is better for the test kits to have high precision than a high sensitivity so that possibly infected persons are detected.

*a) False b) True 1 mark

  1. A classifier that tries to learn even the noise data tends to overfit. a) False *b) True 1 mark
  2. In the two sentences below, the word bank has different ______-: i. John went to the bank to withdraw money. ii. John went to the river bank to have a picnic. a) Part of speech *b) Word sense c) Stop words d) None of the above 1 mark
  3. In the decision tree classifier, the attribute that has the highest information gain is usually placed near to the leaf nodes: a) True *b) False 1 mark

Section B (Short essay questions) 18 marks

  1. In a bundled promotion campaign, McDonough decided to bundle a BugMac (B) with a McFishie (M) (B -> M) together. However, you advised the manager this is not advisable as there is low support and low confidence for such itemsets, but the manager doesnt quite understand. Explain in layman terms to the manager why such a promotion is not advisable. (In your answer, please do not give textbook definitions. Rather marks are given to the specificity of the situation.)

3 Mark

  1. Consider the decision table below. Which variable can best determine whether Man Utd wins or not? (Hint: use information gain)

4 marks

  1. Suppose k-means clustering is done on the data points below with k=.

The initial seeds chosen are P3 and P6 the orange dots indicated. In the first iteration, which are the points clustered together? Compute the centroid of each of this cluster. Will your answer depend if you use the Euclidean or Manhatten distance? (Please do not calculate on for the second iteration.)

4 marks

  1. Text mining. Calculate the cosine similarity between the 2 sentences.
  2. James Cook University is ranked in the top 2% universities of the world.
  3. Captain James Cook founded Australia. Use only the vocabulary words (with its stemmed words) for the calculation James, Cook, university, Australia and rank. Ignore other words.

4 marks

  1. A predictive classification using a single perceptron was done to determine if a lift machine will break down in a year. The model was trained with variables – time since last servicing in number of months (x 1 ), age of lift in number of years (x 2 ) and frequency of usage (x 3 – no of trips per week). The resulting model is:

( 1 , 2 , 3 )=

1 .00.05 1 0.1 2 +0.002 3 > 0

^

Suppose two different lifts. Lift A has just undergone servicing recently but is quite old. It is heavily used (no of trips per week = 200). Lift B is quite new but its servicing was quite a while ago. It is not often used (no of trips per week = 50). Which lift is predicted to break down in the coming year? Which factor appears most important in determining the lift will break down?

  • Lift A – last servicing 1.5 months ago, 15 years old
  • Lift B last servicing 1 year ago and 1.5 years old

3 marks