TERM PROJECT: SEATTLE U. FILE SYSTEM (SUFS)
web | report代写 | 代做security | Network | GUI代做 | network代做 | 代做aws | oop作业 | 代写project | hadoop | assignment | IT – 本题是一个利用Network进行练习的代做, 对Network的流程进行训练解析, 包括了web/report/security/Network/GUI/network/aws/oop/hadoop/IT等方面, 该题目是值得借鉴的assignment代写的题目
CPSC 4910/5910 CLOUD COMPUTING
In this project, your team will develop the Seattle University File System (SUFS). IT is basically a clone of HDFS (the Had oop Distributed File System) that runs inside aws using EC2 instances to form the cluster. You will write the code and deploy it on EC2 instances.
You may implement this project in any (not obscure) language you choose except Java. (Why not Java? HDFS is an open source project that is implemented in Java. This makes it harder to cheat by simply copy-pasting out of the public code. Note that it is still cheating, however, to copy-and-translate to another language! But it is okay for this assignment if you want to look at the open source code to understand how it works; just create your own implementation of it then.) That said, you may want to look at this link and see what languages have SDKs (software development kits) for AWS.
The first thing you need to do is understand what HDFS is, what it does, and how it works. Read the following two web pages (it doesn’t matter what order you read them in).
Feel free to explore the hadoop / HDFS documentation further, or to google for more information as well.
WHAT TO DO
Your team will implement SUFS, the HDFS clone, including DataNodes, a NameNode, and simple client. You will deploy the nodes on EC2 instances. (The client should, in theory, be able to run anyway, but for demo & grading purposes, its okay to run it on an EC2 instance as well.) The easiest way to implement the client is probably to make it a command line tool. More specific details about what functionality you do and do not have to implement is later in these instructions.
You will also write a brief document describing your design. The official length is “as long as necessary”, but I expect about 2- 4 pages, including figures. (If yours is significantly longer than 4 pages, then you’re probably either giving more specific detail than I’m expecting or repeating too much general information about HDFS that I already know.) Your document should describe the architecture of your SUFS, although feel free to reference the HDFS architecture for aspects that are the same (or to contrast things that are different), and should also include the following information:
- Each system component will need some interface that the other components can use to interact with it. What are the APIs / protocols you designed? List each API call or message that you are using, it’s parameters / fields, and what it does. Also describe the interactions (i.e., when & how each API call or message will be used.) You should document how the following components will interact: o client – NameNode o client – DataNode o DataNode NameNode
- What technologies or tools are you incorporating into your project, and how are they being used? o You dont need to list development tools, just things that will be part of the SUFS design. If you do choose to list development tools as well, then please make that a separate list.
- What system parameters did you choose? i.e., What are the… o block size? o replication factor? o any other system parameters you had to choose a value for?
- Anything else I need to know to understand your design and how your system is going to work o but dont repeat too much information about HDFS from the web pages above or that was covered in class you may assume the reader already knows these basics; your report should instead be about the next level of detail in your design
WHAT TO SUBMIT & DEMO
You should write your report in either MS Word or PDF format. Then archive/compress your report, all source code, supporting build files (e.g., Makefiles, etc.) into either a .zip, .tar, .gz, or .tgz file and upload it to Canvas as your submission. (Only make one submission per team.)
For your demo, you will show your system running in AWS using EC2 instances. You will demonstrate normal behavior of creating/writing and reading files. Then you will demonstrate abnormal behaviors you have accounted for, such as DataNodes crashing (which you can simulate by terminating/stopping that EC2 instance).
TOOLS YOU MAY USE
You might need to programmatically control EC2, S3, or other AWS services. This is possible using the AWS APIs. There are also convenient SDKs available in a number of different languages. Here is the general EC2 (first link) and S3 (second link) API documentation… and you already saw the link for the SDKs above.
There are a number of AWS Developer Tools that you may find useful when working in a group. (This isnt a requirement; just an option that Im making you aware of.) For example, AWS CodeCommit is a source control repository (similar to GitHub) that can be accessed using the git client. Documentation about developer tools can be found at these links:
Finally, you are free to implement the communication between nodes and with the client using any reasonable mechanism you wish. You may design and implement a network protocol that runs on top of TCP/IP (although this is probably not the easiest option), or you may use some middleware to help you implement an API that remote systems can access. An API can be exposed as a web service, some other RESTful interface, via .Net Remoting, or using an RPC-like system (such as Apache Thrift) – how you choose to do it and the exact software you use to facilitate it is up to you.
The exact tools you choose to use are up to you. These are just some suggestions. But you should run some tests with your chosen tools early in the project to make sure theyre going to work the way you think they will and that you wont have problems down the line after youre committed to the chosen tool to change your mind.
In general, you should mimic the functionality and design of HDFS. Read and understand the HDFS documentation carefully to understand what those are. (Do pay attention to the “Assumptions and Goals” section!) You DO need to do the following:
- implement a NameNode, DataNodes, and a client program
- implement creating & writing a file (as a single operation) o the client program should read the file data out of an S3 file and then write the data into SUFS o you do need to divide the file into blocks that will be stored in SUFS; do not store files as a single chunk (unless they are smaller than the block size) you may choose the block size for your SUFS
- implement reading a file o the client program can take the file it read out of SUFS and write it to a file in the local filesystem
- implement DataNode fault tolerance o if up to N-1 data nodes fail concurrently, the system should continue to operate without data loss o new block replicas should be created when the number of block replicas falls below N you may choose the N (replication factor) for your SUFS
However, here are some details that you do not need to worry about:
- NameNode failure or Metadata Disk Failure only worry about DataNodes failing
- Deleting files , or Undeleting and the /trash folder
- Staging and Replication Pipelining o that is, you may have the client contact each DataNode to store the block instead of contacting just the first one and pipelining the writes
- Public Access o Your NameNode & DataNodes do not need to be accessible from outside AWS – so it’s okay if your client only works when run from inside AWS (e.g., on another EC2 instance), and within the same security Group. this could happen, for example, if you use the “private IP” instead of “public IP” for the nodes, or if you have Security Group settings that block access from outside the Security Group
- Authentication or Authorization o All files stored in SUFS are accessible to anyone who can connect to the NameNode & DataNodes.
- Configurable/Variable Block Sizes or Replication Factors o You may choose a block size and a replication factor and hard code that into your system. You do not need to allow these to be configurable nor to allow them to be different for different files
- Rack Awareness o Don’t worry about being “rack aware” – just treat all nodes as equal.
- Building the Directory System like a Real File System o You don’t have to use i-nodes and such to build the directory system on top of raw blocks (although you may if you choose to).
o The directory structure is stored only on the NameNode, and you may store it in any way you want
so long as it works.
o But you do need to make sure you store it on the NameNode's disk and not only in memory. (If the
NameNode gets rebooted, for example, it should be able to just pick up where it left off.)
- Using an EditLog on the NameNode o You do not need to use the logging mechanism described for the NameNode. o You do not need to create files named EditLog and FsImage, but may store the NameNode data on the disk however you want and using whatever filenames (or structure) that you want.
- Cluster Rebalancing o Make a reasonably intelligent decision when you place file blocks initially, and then just leave them in that location from then on (unless necessary for fault recovery purposes).
- Checksums o You don’t need to checksum blocks as described in the “Data Integrity” section of the HDFS documentation.
It’s up to you how you want to design your client program. A GUI is not required. If you want to keep it simple then you can just make a command line tool. However you do it, it should support the following operations:
- Create a new file in SUFS o when the file is created, an S3 object should be specified and the data from S3 should be written into the file
- Read a file o the file will be read from SUFS and a copy of the file is returned
- List the DataNodes that store replicas of each block of a file o be sure to keep the output from this somewhat neat – it could get long for large files (i.e., many blocks) – one suggestion is to output a separate line for each block and list the DataNodes for that block on that line
Note that you do not have to support the concept of a “current directory” or relative paths; you may instead require entering a complete absolute path every time.
Since SUFS is intended to store large files, you can use the data files that you used in the MapReduce assignment. These are conveniently already in S3, so your client program can just pull them from there.
Be sure you test at least one really large (> 1 GB) and at least one file small enough to fit in a single block. Also try at least one files that is an exact multiple of the block size and at least one file that isnt. Be sure the data you get from a read is bit-for-bit identical to the file that was initially written in.
Note: There is a fee for all data sent into and out of the AWS data center (which is small and normally not worth worrying about, but could add up quickly when we’re talking about many GB of file data). However, there is no fee for moving data around within AWS… So copying data from S3 to an EC2 instance avoids this fee, whereas downloading a file from S3 to your computer then uploading it from your computer to EC2 will hit you with an outgoing data fee and an incoming data fee. (Also, different ‘regions’ are considered different data centers, so fees apply between regions, but not when moving data around within a region. For this class, we’ll stick to the us-west- 2 Oregon region, so that should be okay.)
Of course, be sure to test not only the normal operation and the expected failure states (i.e., the ones for which you designed fault tolerance) but also any other failure states and various edge cases!
Also, be sure that the data you get back from a read match exactly bit-for-bit the data that was input when it was created!
This project is worth 35% of the course grade, divided as follows:
- 20% final submission & demo o team grade weighted by individual grade o grade is based on successful functioning
- 5% for submitting each of the two checkpoints o checkpoint grades are based on effort and providing the requested materials with sufficient details, not on successful functioning of the project o first checkpoint is a team grade only (individual feedback is for your information only) o second checkpoint is a team grade weighted by individual grade
- 5% for submitting the Team Charter and completing all peer/self-feedback forms o graded for completion and reasonable effort
The first checkpoint will be a team grade only. (A peer/self-feedback will be done but will not affect grades this is informative only and helps you address any concerns before it does affect your grade.) The second checkpoint and final submission will be weighted by an individual component.
Everyone on the team gets the same team grade out of 100% and each team member gets your own individual grade out of 100%. Your actual grade on the assignment is the two of these multiplied together, and rounded to the nearest 1% using natural rounding, for example:
- Team 90% and individual 100% = you get 90%
- Team 100% and individual 80% = you get 80%
- Team 90% and individual 90% = you get 81% (because 0.9 * 0.9 = 0.81)
- Team 90% and individual 95% = you get 86% (because 0.9 * 0.95 = 0.855, which rounds to 86%)
Individual grades will not be affected at all by how well your project does or does not work (thats what the team grade is for), theyre only determined by how well you participate and contribute to your team.
Submitting the Team Charter will be a team grade only. Submitting the peer/self-feedback will be graded individually and only graded for completing the feedback surveys.