DATA MANAGEMENT AND VISUALIZATION ASSIGNMENT 4
UNIVERSITY OF VICTORIA
Due: Thursday, March 30th, 2017 in class. Late assignments will not be accepted.
Submit your answers to the questions below on paper by the deadline above. Addition- ally, you must upload a copy of all code you use (SQL, Python, R, Excel, LibreOffice, etc.) to conneX by 11:59pm on March 30th. All code submissions must be your own work. However, you are permitted to use the code discussed in lectures (or code posted on conneX) as the basis for your submissions if proper attribution is given.
Question 1: BC Ferries Data Visualizations [7 marks]
The questions below use the a3_28days.sql dataset from assignment 3. For all of the visualizations in this section, you may use any software you want (Python, R, Excel, LibreOffice, etc.), but you must hand in all of the source files and intermediate data you use. Additionally, you will lose marks if your visualizations are not well-formatted, with axis labels, titles, a legend, good choices for axis ranges and histogram bins (where applicable).
(a) The query below generates the duration of every sailing on route 1.
select (strftime(‘%s’,arrival)-strftime(‘%s’,scheduled_departure))/60 as duration_minutes from sailings where route_number = 1;
Create a histogram of the sailing times for route 1. Choose a set of bins which makes the distribution clear.
(b) Create a histogram of the sailing times for route 30 in the same format as the histogram above.
(c) The query below computes the average sailing time per day in February for each route.
select
route_number,
strftime(‘%d’,scheduled_departure) as day, avg((strftime(‘%s’,arrival)-strftime(‘%s’,scheduled_departure))/60) as avg_duration_minutes from sailings
where strftime(‘%m’,scheduled_departure) = ’02’
group by route_number, day
order by route_number, day;
Construct a line plot with the average sailing time per day for each of routes 1, 2 and 30 (on
the same plot). You will receive a zero if more than one plot is used. To separate the data by
route, you will need to either modify the query above (and make three variations) to produce separate tables for the three routes, or use np.where to filter the result of the query above.
Question 2: Regular Expressions [6 marks]
Design regular expressions to match the classes of text described below. You will lose marks if
your expressions matches anything not within the described class or if it does not match anything
1
which is. You can use regular parentheses () to delineate groups if necessary (there is no need to indicate non-capturing groups).
(a) Street addresses, which consist of a number, possibly followed by exactly one letter (as in ‘1234a Main St.’), followed by a street name, which must be one word long and capitalized. The street name is then followed by a capitalized extension, which must end in a period (e.g. ‘St.’, ‘Rd.’, ‘Ave.’, etc.). Any number of spaces (but at least one) may appear between the number and the street name or the street name and the extension.
(b) Sequences of lowercase letters (not necessarily forming an English word) which contain the letter ‘A’ at most 3 times.
(c) Sequences of lowercase letters (not necessarily forming an English word) with an even number of characters (0, 2, 4, 6, etc.) and where every second character is a vowel, starting with the first character. The other characters may any letter (including vowels). For example, sequences like ‘at’, ‘away’, ‘economic’, ‘avocados’ and ‘onomatopoeia’ should match, but sequences like ‘canoe’, ‘bore’, ‘omega’ and ‘avocado’ should not.
Question 3: Technological Literary Analysis and Deconstruction [11 marks]
Project Gutenberg (https://www.gutenberg.org) maintains a large database of public domain eBooks. Most of the books in their collection are in the public domain due to their age (since the copyright on literary works expires after a certain number of years since the death of the author). These books are useful for analysing the patterns found in English text.
A sign-up sheet has been posted to conneX, containing a selection of works from the Project Gutenberg repository. For this question, you must choose one of the books in the list and reserve it on conneX. Only one student is allowed to choose each book, and you will receive zero marks on this question if you analyse a book other than the one you have reserved on the online system. Since the analysis does not require actually reading the book1, the length or genre of the book should not affect the difficulty of this question.
Once you have chosen a book, follow the link from the signup sheet and download the UTF-8 text version. This will be your input data. You should not modify this file at all (leave all of the Project Gutenberg preamble in place).
The purpose of this question is to analyse the length and quantity of English words used in your chosen book. An “english word” is defined to be any sequence of characters that meets either of the following criteria.
• A sequence of one or more letters, which may be uppercase or lowercase in any combination.
• A sequence of one or more letters followed by exactly one hyphen (the ‘-’ character), followed by a sequence of one or more letters. The letters may be uppercase or lowercase in any
combination.
You can use regular expressions to extract all of the words from the downloaded text file. If you print all of the extracted words to a text file (one per line), you can treat the resulting file as a CSV spreadsheet (with one column) and import it into SQLite (as a table with one column). You can then perform analysis on the lengths or frequencies of words and export the resulting data for visualization.
1. Although some of the books are certainly worth reading…
2
In your submission, answer the questions below for your chosen book (and be sure to clearly indicate what book you chose).
(a) Generate a list of the 25 words that appear most frequently, along with their counts, ranked in descending order. This can be done with a simple aggregation-based SQL query. The result should be case-sensitive (so ‘The’ should be considered to be a different word than ‘the’). For Jane Austen’s Pride and Prejudice (which is not available as a choice for this question), the 25 most frequently occurring words are as follows.
word occurrences
———- ———–
the 4218
to 4163
of 3707
and 3493
her 2138
I 2070
a 1955
was 1844
in 1836
that 1540
not 1419
she 1384
it 1301
be 1255
you 1201
his 1190
had 1152
as 1134
he 1104
with 1066
for 1060
is 861
have 835
Mr 786
him 764
(b) Find the longest word in the book. In SQLite, you can use the length() function to get the length of a text field (for example, select word, length(word) from word_list; will list each word and its length). The length function can also be used for aggregation or WHERE clauses. For Jane Austen’s Pride and Prejudice, the longest word is ‘respectable-looking’
(c) Generate a table consisting of the number of words of each length. For Pride and Prejudice, the table is
word_length words_with_that_length
———– ———————-
1 4809
2 23900
3
3 28961
4 22163
5 12013
6 9284
7 8670
8 5126
9 5163
10 2490
11 1428
12 917
13 432
14 123
15 41
16 17
17 8 19 1
(d) Plot a histogram of the number of words of each length (that is, a bar chart of the data from part (c). You may use any visualization software, and your result should look superficially similar to the chart below.
30000
25000
20000
15000
10000
5000 0
Words in Jane Austen’s “Pride and Prejudice” Length vs. Frequency
1 2 3 4 5 6 7 8 9 10111213141516171819 Word Length (# characters)
Submission Instructions
Your primary submission should be a hard copy document containing the answers to all of the above questions. In addition to the hard copy, you must submit all code (and intermediate data such as CSV files or spreadsheets) to conneX.
The usual rules for electronic submissions apply. You are permitted to delete and resubmit your assignment as many times as you want before the due date, but no submissions or resubmissions will be accepted after the due date has passed.
Ensure that each code file contains a comment with your name and student number. If you do not submit the code files needed for a part of the assignment, it will not be possible to mark that part
4
Number of Occurrences
of your submission and you will receive a mark of zero.
After submitting your assignment, conneX will automatically send you a confirmation email. If you do not receive such an email, your submission was not received. If you have problems with the submission process, send an email to the instructor before the due date.