MAST30034: Applied Data Science
R语言 | app | 大数据代写 | 可视化代写 | Data Science代写 – 这是一个数据科学的practice, 考察数据可视化的理解, 涵盖了R语言 | app | 大数据代写 | 可视化代写 | Data Science代写 程序代做方面, 该题目是值得借鉴的代写的题目
MAST30034: Applied Data Science
Workshop 1
1 Introduction
The aim of this lab is to get you setup with the tools we will be using, and to give you a chance to have a first look at the data set.
2 Setting Up Git
On the Subject LMS you will find a Resources section with a selection of screen- casts showing how to setup your first repository, clone, add files, and commit changes. Run through these yourself, creating your first repository to check you have setup your software correctly and are able to use Git. If you are already familiar with Git you can skip this step, or if you would rather use the command line you can view the associated Git PDF linked on the Wiki page.
3 Accessing Slack
Slack is an increasingly common communication tool used in business. We have setup a Slack channel for this subject to give you a chance to familiarise yourself with it. We will be monitoring it throughout, and using it more intensively during the group projects, during which we will expect groups to use Slack as one of the primary ways of communicating as a team. Now is a good time to familiarise yourself with Slack. You can join the MAST30034- 2019 Slack channel by visiting the following link:
https://join.slack.com/t/mast30034-2019/sharedinvite/enQtNjg2OT
MxMTk4NjEwLWExNjk0ZDEyNWYwMjY0OWE1ZTEyY2M3YTVlZWU5Y2NkMjY5ODIzND
AxNzRkZTBiNDM0MzljZDMzOGZmNjNjYWE
The workspace is available at:https://mast30034-2019.slack.com
Slack can be accessed via a web interface, or you can install an app on your Laptop, Mobile Phone, or Tablet. Use this time to get the software setup and configured and try posting a message and familiarising yourself with the interface.
4 MSE Cloud Instance
A Virtual Machine will be provided on the MSE Cloud Infrastructure. Please complete the pre-lab and submit your public key via the LMS to have an instance created for you. Additional guides on accessing your instance are provided on the LMS weekly schedule page.
5 Azure for Students
Whilst not an official part of the subject, if you have not yet registered for the Azure for Students programhttps://azure.microsoft.com/en-au/free/stu dents/you may want to consider doing so. It will give you 1 year free access worth approximately $100 (USD). Azure has a number of pre-configured data science virtual machines, which may be useful for running large scale analysis. You are not required to use Azure, you are free to use your own equipment and the facilities available at the University, but we are happy to assist in supporting your usage of Azure.
6 First Look at the Data
The full data set can be downloaded fromhttps://www1.nyc.gov/site/tlc/ about/tlc-trip-record-data.page. We have downloaded a full copy of the data, with data from 2015 available fromhttps://cloudstor.aarnet.edu.a u/plus/s/vttepW9CpXoaK5OInside of that folder is a misc folder that contains the data dictionary and taxi area map. Additionally on the LMS (Sample Data link on the menu) there is a sample of the first 100,000 rows of the data for May 2015. This smaller size is useful when first starting out as it can be loaded quickly into memory and allow initial processing without the heavy load of a full months worth of data. Take a look at the data dictionary and the sample data file, start to think about what attributes you might want to analyse and visualise. Using your tool of choice try loading the data in and doing a basic visualisation. An example for R is given below:
6.1 Initial Load in R and Visualisation
We would recommend using RStudio, but you are free to use whatever tool you are most familiar with, and whatever language you prefer. RStudio can be downloaded fromhttps://www.rstudio.com/
6.1.1 Install ggmap
For this example we will be using ggmap, the installation process has become more complicated recently as a result of changes in access to Google Maps. Further instructions on installation are given below.
6.1.2 Load Data
Download a copy of the sample data file and save it in a folder. In R make that folder your current working directory usingsetwd(“/home/yourname/NYCTaxiData”) stating the path where you have saved the file.
Load the data using:
mydata = read.csv(“100kyellow 2015 05.csv”)
6.1.3 Basic Visualisation
Access to Google Maps has recently changed, it now requires an API key that can only be obtained by registering a payment method with Google. A workaround has been found, although it requires installation of some additional packages.
Required Packages Some additional operating system packages might be required, depending on the OS you are using. Instructions and sample scripts for installing the necessary packages are available fromhttps://github.com/m tennekes/tmap#installation. For Ubuntu 18.04 the required pacakges can be installed with the following script:
sudo apt-get install libgdal-dev libgeos-dev libproj-dev libudunits2-dev libv8-dev libjq-dev libprotobuf-dev protobuf-compiler libssl-dev libcairo2-dev
R packages Install the following R packages:
install.packages(“dplyr”) install.packages(“sf”) install.packages(“curl”) #Restart your R Session install.packages(“tmap”)
ggmap On some Operating Systems the official release of ggmap works fine, and can be installed as follows:
install.packages(“ggmap”)
If you receive any errors when running the example, the first action should be to try the development release by installing the latest version from GitHub
install.packages(“devtools”) devtools::install_github(“dkahle/ggmap”) #Restart your R Session
Downloading the Map Image The first stage is to download the map im- age. This can involve a bit of trial and error, in terms of setting place and zoom. For this example the following will suffice:
map<-get_stamenmap(rbind(as.numeric(paste(geocode_OSM(“Manhattan”)$bbox))), zoom = 11)
That will download a map of Manhattan from stamen, with a zoom level of 11. You can try playing around with different zoom levels. To view the map type:
ggmap(map)
and look in the plots tab in RStudio. There is a manual covering the use of ggmap available athttps://cran.r-project.org/web/packages/ggmap/ ggmap.pdf
We are now going to plot the pick-up locations from the sample data set onto the map. To do that use the following command:
ggmap(map) + geompoint(aes(x = pickuplongitude, y = pickuplatitude), colour=”white”, size = 0.01, data = mydata, alpha = .5)
The above command overlays points onto our map image using geompoint. We reference the x co-ordinate as the pickuplongitude field, and the y as the pickuplatitude.
- aes – this is the aesthetic mapping, do not worry too much about it for the moment, it will be important when you are mapping more complex values, in particular when adding a 3rd dimension.
- color – we set the constant colour of the points, i.e. all points will be this colour
- size – the size of the points, again as this is outside of the aes function this is a constant size
- data – the data we want to use to create the plot
- alpha – make the points slightly transparent to be able to still see some of the underlying map
If everything has gone well you should end up with a plot that looks something like this:
Figure 1: Sample Plot
The above visualisation is just an example of using the tools, as a data visuali- sation, in itself, it is not particularly useful. We will discuss this further in next weeks lecture. You can use the remaining time to start exploring what can be done with ggmap, or equivalent plotting tools, and to start exploring the data set.
