CpSc 418 Homework 4
作业homework | express | report代做 | thread代做 | cuda代写 | lab – 这个题目属于一个cuda的代写任务, 涵盖了express/report/thread/cuda等方面, 这个项目是cuda并行代写的代写题目
Early-Bird: March 27, 2023, 11:59pm
90 points
Prelude
Please submit your solution using: handin cs-418 hw Your solution should contain three files:
henon.cu: Your code for Q1.
saxpy.cu: Your code for Q2. As usual, we will compile and test your code using the versions ofMakefile,
hw4_lib.cu, andhw4_lib.hthat are supplied with this assignment.
hw4.pdf: Your solutions to written questions.
The Questions
- Hnon Maps. 50 points The Hnon map is a two variable recurrence defined below:
xi +1 = 1 ax^2 i + yi
yi +1 = bxi
(1)
For this problem, we choose a = 1. 3 and b = 0. 3. The Hnon map is a chaotic dynamical system:
starting from two points that are initially close, the sequences will diverge from each other. On the
other hand, for initial points that arent too large, all subsequent points stay in a bounded region.
For our use, the Hnon map provides a computation that is purely floating-point arithmetic bound
the SMs can hold all of the values that they need in registers. We use this as a way to see how fast a
GPU can perform floating point operations.
The filehenon.cuprovides CPU and GPU implementations of the Hnon map. You can build the
executablehenonwith the command
make henon
given that youve copied all of the source files into your current directory. The providedMakefileis
configured for use on thelinXX.students.cs.ubc.camachines. Once you have an executable, you
can run it with the command:
./henon n_blocks threads_per_block n_iter n_trials
Each of these has a default value provided. If you only give some of the arguments, the remaining ones
will get their default values. Likewise, If you give a single - as an argument, that parameter will get
its default value. The defaults are (currently):
n_blocks= 100,threads_per_block= 1024,n_iter= 100000, andn_trials= 5.
The code generatesn_blocks*threads_per_blockrandom initial points for the Hnon map, and
performsn_iteriterations on each of these points using the GPU. The time for the cuda kernel
execution is measured. This is repeatedn_trialstimes, and the mean and standard deviation of the
execution time are reported. We also report the mean and standard deviation of the square of the
step-size for the iteration, where
step^2 i = ( xi xi 1 )^2 + ( yi yi 1 )^2
This should be robust even though the Hnon map sequence is chaotic.
With that out of the way, here are the questions:
(a) (15 points) Complete the body of the__global__functionhenon.
Hints:
- This is really easy. Most of the code can copied from the body ofhenon_cpu.
- You will need to compute this threads index for accessing the arraysx0,y0,s2, ands4. You can look at other GPU kernels, for example inexamples.cuto figure out what you need.
- Likewise, you will need to check to make sure that the index for this thread is in bounds. Again, you can look the kernels inexamples.cuto see examples.
- The GPU and CPU versions wont produce the exact same answers, because C (running on the x86) converts allfloats todoubles when doing arithmetic, and then rounds the final value when assigning to afloat. The GPU code does everything withfloats if the operands arefloats. The rounding error makes the two sequences of points diverge because the Hnon map is chaotic. The good news is that both versions converge to the same estimates of RMS step size and variance. I might find some other ways you can test your code or you can. (b) (5 points) The GPUs on thelinXX.students.cs.ubc.camachines have nVidia GTX 1060 GPUs. How many SMs does the GTX 1060 GPU have? Cite your source. Better yet, run the programcuprop(built fromcuprop.cuin the source directory for this assignment. Either way, explain how you got your answer. (c) (5 points) How many floating point operations are performed by GPUs when evaluating the Hnon map kernel? express your answer as a function ofn_blocks,threads_per_blockand n_iter. (d) (10 points) Experiment with different choices forn_blocks,threads_per_blockandn_iter to find a choice that maximizes the number of floating point operations performed per second with the reported mean for the execution time being less that 0.5 seconds. What is the maximum number of floating point operations per second that you were able to obtain? Make your timing measurements using one of thelinXXmachines. Please state which machine, and use the same machine for all questions about the Hnon map. Include a table with your measurements in your solution. This doesnt require running a huge number of trials. Try settingthreads_per_blockto some- thing reasonable (i.e. 512 or 1024). Then, experiment withn_blocks. I liken_trials= 10or 20, and settingn_iterto get an average run-time of 0_._ 01 to 0_._ 1 seconds. Dont go much longer than that because someone might be doing their graphics homework on thelinXXbox that you are using. I found that it was helpful to run a test two or three times if the value seemed out-of-line with the other measurements. Sometimes, the first run (after a pause) took 20% or so longer than the others, while the remaining runs were within 2% or 3% of each other. I dont know why. You should find that the execution time makes steps at critical values ofn_blocks. (e) (5 points) You should observe that
ExecutionTime
threads_per_block*n_iter
is a stair step function ofn_blocks. At what values ofn_blocksto the steps occur? Why?
Include a brief justification of your answer.
Hints:
- Think about scheduling blocks to run on the SMs of the GPU.
- The spacing of the steps is specific to the GTX 1060. Other GPUs would have their own spacings. (f) (15 points) What is the speed-up relative to the host CPU? Themainfunction reports the squared-step-size mean and standard deviation for bothhenon_cpuandhenon_gpuso you can compare the result. Report the timing data that you get and report the speed-up. (g) Just for fun. Doeshenon_gpuachieve the peak floating-point throughput of the GPU? Justify your answer. If it doesnt what are the bottlenecks? Can you modify the code to make it faster (while keeping it correct).
- Issaxpyfast? ( 40 points) The filesaxpy.cuprovides GPU and CPU implementations ofsaxpy. You can buildsaxpywith the sameMakefilehenon. Once you have an executable, you can run it with the command: ./saxpy n_data n_iter a Each of these has a default value provided. If you only give some of the arguments, the remaining ones will get their default values. Likewise, If you give a single – as an argument, that parameter will get its default value. The defaults are (currently): n_data= 1024,n_iter= 10000, anda= 1_._ 234.
- (5 points) complete the body ofmainso that it reports the same data for the runs of code- saxpy_cpu as is reported forsaxpy_gpu. Print the time elapsed for the CPU execution and the flops (floating-point operations per second). Hint: This is another read the code youre given, understand it, do a quick cut-and-paste, and make a few changes problem.
- (10 points) Run your code to get timing measurements forsaxpy_gpuandsaxpy_cpuwith n_data= 1024, 2048, 3072, 4096, 5120, 6144, 7168, 8192, 9216, and 10240, andn_iter= 10000. Im choosing multiples of 1024 (i.e. 210 ) to make the problem favorable for the GPU. Your table should have columns forn_data,t_gpu,t_cpu,flops_gpu, andflops_gpu. Make your timing measurements on one of thelinXXmachines and state which machine you used.
- (10 points) Letflop_count = 2n_datan_iter, i.e. the total number of floating point opera- tions performed by either the CPU or GPU. Do a linear fit oft_gpuas a function offlop_count. The slope of the line is the time spend by the GPU doing floating point operations plus the time for data transfers between the CPU and GPU. What is this slope? Your answer should be in units of seconds/floating-point-operation which you can invert to get floating point operations per second.
- (10 points) The intercept of the linear fit, i.e. the extrapolated value fort_gpuwhenflop_count = 0is the time to launch then_iterkernels. Based on your linear fit, what is the time to launch a kernel? You can now write a model of the form:
t_gpu = tlaunch n_iter+ t _ op n_datan_iter
where you estimated top in the previous part of this problem and tlaunch here. Try a few other
values ofn_iter, and report the predicted and measured values fort_gpu. Is the proposed model
a plausible explanation of the data? Why or why not?
- (5 points) Is a GPU fast when computingsaxpy? Answer yes or no and give a one sentence justification. Note: this example runssaxpyas a stand-alone kernel, like the examples in the textbook. In practice,saxpycould be called as just one part of a larger kernel. In that case, the overheads of copying data between the CPU and GPU and the time for the kernel launch would be less of
a concern. This connects real GPU performance back to the ideas we had in the performance
modeling unit.
Note: You could try much larger values forn_data, e.g. 1,000,000n_data10,000,000. This will
significantly change the measured values forflops_gpuandflops_cpu(but it doesnt change which
is faster). I thought of asking you to make the measurements and to explain what youre observing,
but then I remembered that most of you are about to graduate, and Im sure that most of you have
other things that you would rather do with your time.
Notes:
- The CUDA tools such asnvccare in /cs/local/lib/pkg/cudatoolkit-11.6.2/bin on thelinXX.students.cs.ubc.ca.
- The valid names for the undergraduate graphics lab linux machines arelin01.students.cs.ubc.ca throughlin24.students.cs.ubc.ca.nslookupfindslin25.
- Note that thelinXXmachines are also used for homework for graphics sources. Please useuptime to make sure that the machine you are using isnt heavily loaded, and try a different machine if it is. Keep your CUDA kernel execution times under a second.
- The source files for this homework are available at:hw4.html
Unless otherwise noted or cited, the questions and other material in this homework problem set is
copyright 2022 by Mark Greenstreet and are made available under the terms of the Creative Commons
Attribution 4.0 International licensehttp://creativecommons.org/licenses/by/4.0/