代做spark | oop | 代做大数据 | 代写sql – 1_spark_covid_19

spark covid

代做spark | oop | 代做大数据 | 代写sql – 这个题目属于一个IT的代写任务, 涵盖了spark/oop/IT/sql等方面

spark代写 代做spark 大数据代写 bigdata代写

In [0]:

!pip install kaggle !apt-get install openjdk-8-jdk-headless -qq > /dev/null import os os.environ[“JAVA_HOME”] = “/usr/lib/jvm/java-8-openjdk-amd64”

#install spark !wget -q https://downloads.apache.org/spark//spark-2.4.5/spark-2.4.5-bin-had oop 2.7.tgz !tar xf spark-2.4.5-bin-hadoop2.7.tgz >/dev/null

#Provides findspark.init() to make py spark importable as a regular library. os.environ[“SPARK_HOME”] = “spark-2.4.5-bin-hadoop2.7” !pip install -q findspark import findspark findspark.init()

_#The entry point to using Spark sql is an object called SparkSession. IT initiat es a Spark Application which all the code for that Session will run on

to larn more see https://blog.knoldus.com/spark-why-should-we-use-sparksessio

n/_ from pyspark.sql import SparkSession spark = SparkSession.builder .master(“local[*]”) .appName(“Learning_Spark”) .getOrCreate()

Setup Kaggle Access

To use the Kaggle API, sign up for a Kaggle account at https://www.kaggle.com (https://www.kaggle.com).

Then go to the ‘Account’ tab of your user profile – https://www.kaggle.com/

(https://www.kaggle.com/)/account and select ‘Create API Token’. This will trigger the download of

kaggle.json, a file containing your API credentials. Open this file in an editor and then enter the name and

detail in form below

In [0]:

from getpass import getpass username = getpass(‘Enter username here: ‘) key = getpass(‘Enter api key here: ‘)

In [0]:

#set up kaggle configuration api_token = {“username”:username,”key”:key} !mkdir -p /root/.kaggle import json import zipfile import os with open(‘/root/.kaggle/kaggle.json’, ‘w’) as file: json.dump(api_token, file) !chmod 600 /root/.kaggle/kaggle.json

In [0]:

#download the kaggle data. #note to be able to download this data – you might have to verify your account w ith kaggle and visit the competition page at https://www.kaggle.com/c/covid19-gl obal-forecasting-week-2/data first. #learn more about this competition first — https://www.kaggle.com/c/covid19-glo bal-forecasting-week-2/data !kaggle competitions download -c covid19-global-forecasting-week-

In [0]:

#the command above should have downloaded three files — we are interested in te st.csv and train.csv

Task 1 : Open the file, print the schema and count the

number of lines of test and train files. Print the number

of cases in India [20 points]

use spark.read.csv
use the count function
use group by and then filter to check for names by a country

In [0]:

# write code to open the file and count it – 10 points

In [0]:

# write code to group by and filter – 10 points

Task 2 – Get Data per Month on Specific Country [

points]

In [0]:

#thcode for task 2 comes here #write a map and reduce function with spark to count all cases in Fatalities in Australia in January, February and March. Then plot the trend

Task 3: Implement the Power of Elaspsed Time Model

in Spark [40 points]

number of cases C x (time elapsed since first infection)^ where C is a constant that depends on the size

of the region (country), and ranges from around 2 to 9, depending strongly on latitude (affected by

temperature).

See the existing work from https://www.kaggle.com/robertmarsland/covid-19-spreads-as-power-of-

elapsed-time (https://www.kaggle.com/robertmarsland/covid-19-spreads-as-power-of-elapsed-time) to

start.

In [0]:

#the code for task 3 comes here.