代写Data Structures | 作业C++ | 代做Scala – Coursework 7 (Scala)

代写Data Structures | 作业C++ | 代做Scala | Lab – 这是一个涉及C++ 以及scala的代写任务

Coursework 7 (Scala)

This coursework is worth 10%. You are asked to implement scala programs for measuring similarity in texts, and for recommending movies according to a ratings list. Note the second part might include material you have not yet seen in the first two lectures.

Important:

  • Make sure the files you submit can be processed by just calling scala <<filename.scala>>on the commandline.^1 Use the template files provided and do not make any changes to arguments of functions or to any types. You are free to implement any auxiliary function you might need.
  • Do not leave any test cases running in your code because this might slow down your program! Comment test cases out before submission, other- wise you might hit a time-out.
  • Do not use any mutable Data Structures in your submissions! They are not needed. This means you cannot create newArrays orListBuffers, for example.
  • Do not usereturnin your code! It has a different meaning in Scala than in Java.
  • Do not usevar! This declares a mutable variable. Only useval!
  • Do not use any parallel collections! No.partherefore! Our testing and marking infrastructure is not set up for it.

Also note that the running time of each part will be restricted to a maximum of 30 seconds on my laptop.

Disclaimer

It should be understood that the work you submit represents your own effort! You have not copied from anyone else. An exception is the Scala code I showed during the lectures or uploaded to KEATS, which you can freely use.

(^1) All major OSes, including Windows, have a commandline. So there is no good reason to not download Scala, install it and run it on your own computer. Just do it!

Reference Implementation

Like the c++ assignments, the Scala assignments will work like this: you push your files to GitHub and receive (after sometimes a long delay) some automated feedback. In the end we take a snapshot of the submied files and apply an automated marking script to them. In addition, the Scala assignments come with a reference implementation in form of ajar-file. This allows you to run any test cases on your own com- puter. For example you can call Scala on the command line with the option-cp docdiff.jarand then query any function from the template file. Say you want to find out what the functionoccurrencesproduces: for this you just need to prefix it with the object nameCW7a(andCW7brespectively fordanube.jar). If you want to find out what these functions produce for the listList("a", "b", "b"), you would type something like:

$ scala -cp docdiff.jar
scala> CW7a.occurrences(List("a", "b", "b"))
...

Hints

For Part 1: useful operations involving regular expressions:

reg.findAllIn(s).toList

finds all substrings insaccording to a regular regular expressionreg; useful list operations:.distinctremoving duplicates from a list,.countcounts the number of elements in a list that satisfy some condition,.toMaptransfers a list of pairs into a Map,.sumadds up a list of integers,.maxcalculates the maximum of a list.

For Part 2 + 3: use.split(",").toListfor spliing strings according to com- mas (similarlynn),.getOrElse(..,..)allows to querry a Map, but also gives a default value if the Map is not defined, a Map can be updated by using+, .containsand.filtercan test whether an element is included in a list, and respectively filter out elements in a list,.sortBy(_._2)sorts a list of pairs ac- cording to the second elements in the pairsthe sorting is done from smallest to highest,.take(n)for taking some elements in a list (takes fewer if the list contains less thannelements).

Part 1 (4 Marks, file docdiff.scala)

It seems source code plagiarismstealing and submiing someone elses code is a serious problem at other universities.^2 Detecting such plagiarism is time- consuming and disheartening for lecturers at those universities. To aid these poor souls, lets implement in this part a program that determines the similarity between two documents (be they source code or texts in English). A document will be represented as a list of strings.

Tasks

(1)Implement a function that cleans a string by finding all (proper) words in
this string. For this use the regular expressionnw+for recognising word
characters and the library functionfindAllIn. The function should re-
turn a document (a list of strings).
[1 Mark]
(2)In order to compute the overlap between two documents, we associate
each document with aMap. This Map represents the strings in a docu-
ment and how many times these strings occur in the document. A simple
(though slightly inefficient) method for counting the number of string-
occurrences in a document is as follows: remove all duplicates from the
document; for each of these (unique) strings, count how many times they
occur in the original document. Return a Map associating strings with
occurrences. For example
occurrences(List("a", "b", "b", "c", "d"))
producesMap(a -> 1, b -> 2, c -> 1, d -> 1)and
occurrences(List("d", "b", "d", "b", "d"))
producesMap(d -> 3, b -> 2). [1 Mark]
(3)You can think of the Maps calculated under (2) as memory-efficient rep-
resentations of sparse vectors. In this subtask you need to implement
the product of two such vectors, sometimes also called dot product of two
vectors.^3
For this dot product, implement a function that takes two documents
(List[String]) as arguments. The function first calculates the (unique)
strings in both. For each string, it multiplies the corresponding occur-
rences in each document. If a string does not occur in one of the docu-
ments, then the product for this string is zero. At the end you need to

(^2) Surely, Kings students, after all their instructions and warnings, would never commit such an offence. Yes? (^3) https://en.wikipedia.org/wiki/Dot_product

add up all products. For the two documents in (2) the dot product is 7,
because

(^1) |{z (^0) } a

  • (^2) |{z (^2) } b
  • (^1) |{z (^0) } c
  • (^1) |{z (^3) } d
= 7
[1 Mark]
(4)Implement first a function that calculates the overlap between two docu-
ments, sayd 1 andd 2 , according to the formula
overlap(d 1 ,d 2 )=
d 1 d 2
max(d^21 ,d^22 )
You can expect this function to return aDoublebetween 0 and 1. The
overlap between the lists in (2) is0.5384615384615384.
Second, implement a function that calculates the similarity of two strings,
by first extracting the substrings using the clean function from (1) and
then calculating the overlap of the resulting documents.
[1 Mark]

Part 2 (2 Marks, file danube.scala)

You are creating Danube.co.uk which you hope will be the next big thing in online movie renting. You know that you can save money by anticipating what movies people will rent; you will pass these savings on to your users by offering a discount if they rent movies that Danube.co.uk recommends. Your task is to generate two movie recommendations for every movie a user rents. To do this, you calculate what other renters, who also watched this movie, suggest by giving positive ratings. Of course, some suggestions are more popular than others. You need to find the two most-frequently suggested movies. Return fewer recommendations, if there are fewer movies suggested. The calculations will be based on the small datasets which the research lab GroupLens provides for education and development purposes.

https://grouplens.org/datasets/movielens/

The slightly adapted CSV-files should be downloaded in your Scala file from the URLs:

https://nms.kcl.ac.uk/christian.urban/ratings.csv (940 KByte)
https://nms.kcl.ac.uk/christian.urban/movies.csv (280 KByte)

The ratings.csv file is organised as userID, movieID, and rating (which is be- tween 0 and 5, with positive ratings being 4 and 5). The file movie.csv is organ- ised as movieID and full movie name. Both files still contain the usual CSV-file header (first line). In this part you are asked to implement functions that pro- cess these files. If bandwidth is an issue for you, download the files locally, but in the submied version useSource.fromURLinstead ofSource.fromFile.

Tasks

(1)Implement the functionget_csv_urlwhich takes an URL-string as argu-
ment and requests the corresponding file. The two URLs of interest are
ratings_urlandmovies_url, which correspond to CSV-files mentioned
above. The function should return the CSV-file appropriately broken up
into lines, and the first line should be dropped (that is omit the header of
the CSV-file). The result is a list of strings (the lines in the file). In case
the url does not produce a file, return the empty list.
[1 Mark]
(2)Implement two functions that process the (broken up) CSV-files from (1).
Theprocess_ratingsfunction filters out all ratings below 4 and returns
a list of (userID, movieID) pairs. Theprocess_moviesfunction returns a
list of (movieID, title) pairs.
[1 Mark]

Part 3 (4 Marks, file danube.scala)

Tasks

(3)Implement a kind of grouping function that calculates a Map containing
the userIDs and all the corresponding recommendations for this user (list
of movieIDs). This should be implemented in a tail recursive fashion us-
ing a Map as accumulator. This Map is set toMap()at the beginning of
the calculation. For example
val lst = List((" 1 ", "a"), (" 1 ", "b"),
(" 2 ", "x"), (" 3 ", "a"),
(" 2 ", "y"), (" 3 ", "c"))
groupById(lst, Map())
returns the ratings map
Map(1 -> List(b, a), 2 -> List(y, x), 3 -> List(c, a)).
In which order the elements of the list are given is unimportant.
[1 Mark]

(4)Implement a function that takes a ratings map and a movieID as argu- ments. The function calculates all suggestions containing the given movie in its recommendations. It returns a list of all these recommendations (each of them is a list and needs to have the given movie deleted, other- wise it might happen we recommend the same movie back). For exam- ple for the Map from above and the movie"y"we obtainList(List("x")), and for the movie"a"we getList(List("b"), List("c")). [1 Mark]

(5)Implement a suggestions function which takes a ratings map and a movieID as arguments. It calculates all the recommended movies sorted accord- ing to the most frequently suggested movie(s) sorted first. This function returns all suggested movieIDs as a list of strings. [1 Mark]

(6)Implement then a recommendation function which generates a maximum of two most-suggested movies (as calculated above). But it returns the actual movie name, not the movieID. If fewer movies are recommended, then return fewer than two movie names. [1 Mark]

发表评论

电子邮件地址不会被公开。 必填项已用*标注