web | 作业html – Assignment 3: Deduplication

Assignment 3: Deduplication

web | 作业html – 这个题目属于一个web的代写任务, 涉及了web/html等代写方面

web代写 代写web 网站代写

Beatrix Jones


Consider the file articles.csv which contains information for the purposes of conducting a systematic review. A systematic review aims to capture all articles about a given topic by conducting a specific search in several databases, and then pooling the results. As most article databases (eg Scopus, web of Science) have significant overlap, an important step is to deduplicate the pooled results. Because the goal is to capture all articles, it is important that articles that are not duplicates are not removed. This is complicated by the fact that there are frequently formatting changes between the different records (capitailization, punctuation, abbreviations, etc.) The following criteria has been agreed by the people conducting the review: two articles will be considered identical if they have the same title, disregarding changes in spacing, capitalization, and punctuation , the same year of publication, and the same first 6 characters for the first authors name (again stripping capitalization, spacing, punctuation).

We will not do the full deduplication in this assignment. There is a subproblem that must be solved first (however the context above is important). There is a subset of articles that each have a line in articles.csv, but have very little information filled in. In particular they have no title, making the deduplication strategy tricky. However, much of the information we might want for our strategy is embedded in the url field. Your task is to find the articles with no title, and repopulate any missing fields needed for the deduplication strategy using the url information. Note that because this will be used for the process of deduplication, retaining things like punctuation is not important. You will hand in either a pdf file containing everything, or an html file and the .Rmd file used to produce it. You should also hand in articlesNew.csv which has just the improved version of the records you have altered. It should have the same number of columns (and same headings) as articles.csv.

1.(20 marks) Write code to accomplish the task using the basic text functions we have learned about. You
should explain and demonstrate the codeyou will lose marks if you do not explain AND demonstrate.
The code must work and be understandable, but does not need to have undergone refactoring.
2.(10 marks) Assuming you will need to perform this task for other files in the future, make a list of
things you about your code that you would address during a refactoring process. You do not need to
make these changes, just describe the sort of things you would do. Significantly poor R code that is not
mentioned in this section will lose marks.
3.(10 marks) How does the intended deduplication process inform the choices you have made? Explain.
(Another way to think of this question would be, would you do anything differently if you were extracting
this information for a different purpose, eg to appear in a bibliography?)