# 代写homework | 代写express | report | 代做thread | 代写cuda | 代写lab – CpSc 418 Homework 4

### CpSc 418 Homework 4

``````Early-Bird: March 27, 2023, 11:59pm
``````

90 points

### Prelude

``````henon.cu: Your code for Q1.
``````
``````saxpy.cu: Your code for Q2. As usual, we will compile and test your code using the versions ofMakefile,
hw4_lib.cu, andhw4_lib.hthat are supplied with this assignment.
``````
``````hw4.pdf: Your solutions to written questions.
``````

### The Questions

1. Hnon Maps. 50 points The Hnon map is a two variable recurrence defined below:
``````xi +1 = 1 ax^2 i + yi
yi +1 = bxi
``````

#### (1)

``````For this problem, we choose a = 1. 3 and b = 0. 3. The Hnon map is a chaotic dynamical system:
starting from two points that are initially close, the sequences will diverge from each other. On the
other hand, for initial points that arent too large, all subsequent points stay in a bounded region.
For our use, the Hnon map provides a computation that is purely floating-point arithmetic bound
the SMs can hold all of the values that they need in registers. We use this as a way to see how fast a
GPU can perform floating point operations.
The filehenon.cuprovides CPU and GPU implementations of the Hnon map. You can build the
executablehenonwith the command
make henon
given that youve copied all of the source files into your current directory. The providedMakefileis
configured for use on thelinXX.students.cs.ubc.camachines. Once you have an executable, you
can run it with the command:
Each of these has a default value provided. If you only give some of the arguments, the remaining ones
will get their default values. Likewise, If you give a single - as an argument, that parameter will get
its default value. The defaults are (currently):
n_blocks= 100,threads_per_block= 1024,n_iter= 100000, andn_trials= 5.
The code generatesn_blocks*threads_per_blockrandom initial points for the Hnon map, and
performsn_iteriterations on each of these points using the GPU. The time for the  cuda kernel
execution is measured. This is repeatedn_trialstimes, and the mean and standard deviation of the
execution time are reported. We also  report the mean and standard deviation of the square of the
step-size for the iteration, where
``````
``````step^2 i = ( xi  xi  1 )^2 + ( yi  yi  1 )^2
``````

This should be robust even though the Hnon map sequence is chaotic.

With that out of the way, here are the questions:

``````(a) (15 points) Complete the body of the__global__functionhenon.
Hints:
``````
• This is really easy. Most of the code can copied from the body ofhenon_cpu.
• You will need to compute this threads index for accessing the arraysx0,y0,s2, ands4. You can look at other GPU kernels, for example inexamples.cuto figure out what you need.
• Likewise, you will need to check to make sure that the index for this thread is in bounds. Again, you can look the kernels inexamples.cuto see examples.
``````ExecutionTime
``````
``````is a stair step function ofn_blocks. At what values ofn_blocksto the steps occur? Why?
Hints:
``````
• Think about scheduling blocks to run on the SMs of the GPU.
• The spacing of the steps is specific to the GTX 1060. Other GPUs would have their own spacings. (f) (15 points) What is the speed-up relative to the host CPU? Themainfunction reports the squared-step-size mean and standard deviation for bothhenon_cpuandhenon_gpuso you can compare the result. Report the timing data that you get and report the speed-up. (g) Just for fun. Doeshenon_gpuachieve the peak floating-point throughput of the GPU? Justify your answer. If it doesnt what are the bottlenecks? Can you modify the code to make it faster (while keeping it correct).
1. Issaxpyfast? ( 40 points) The filesaxpy.cuprovides GPU and CPU implementations ofsaxpy. You can buildsaxpywith the sameMakefilehenon. Once you have an executable, you can run it with the command: ./saxpy n_data n_iter a Each of these has a default value provided. If you only give some of the arguments, the remaining ones will get their default values. Likewise, If you give a single – as an argument, that parameter will get its default value. The defaults are (currently): n_data= 1024,n_iter= 10000, anda= 1_._ 234.
• (5 points) complete the body ofmainso that it reports the same data for the runs of code- saxpy_cpu as is reported forsaxpy_gpu. Print the time elapsed for the CPU execution and the flops (floating-point operations per second). Hint: This is another read the code youre given, understand it, do a quick cut-and-paste, and make a few changes problem.
• (10 points) Run your code to get timing measurements forsaxpy_gpuandsaxpy_cpuwith n_data= 1024, 2048, 3072, 4096, 5120, 6144, 7168, 8192, 9216, and 10240, andn_iter= 10000. Im choosing multiples of 1024 (i.e. 210 ) to make the problem favorable for the GPU. Your table should have columns forn_data,t_gpu,t_cpu,flops_gpu, andflops_gpu. Make your timing measurements on one of thelinXXmachines and state which machine you used.
• (10 points) Letflop_count = 2n_datan_iter, i.e. the total number of floating point opera- tions performed by either the CPU or GPU. Do a linear fit oft_gpuas a function offlop_count. The slope of the line is the time spend by the GPU doing floating point operations plus the time for data transfers between the CPU and GPU. What is this slope? Your answer should be in units of seconds/floating-point-operation which you can invert to get floating point operations per second.
• (10 points) The intercept of the linear fit, i.e. the extrapolated value fort_gpuwhenflop_count = 0is the time to launch then_iterkernels. Based on your linear fit, what is the time to launch a kernel? You can now write a model of the form:
``````t_gpu = tlaunch n_iter+ t _ op n_datan_iter
``````
``````where you estimated top in the previous part of this problem and tlaunch here. Try a few other
values ofn_iter, and report the predicted and measured values fort_gpu. Is the proposed model
a plausible explanation of the data? Why or why not?
``````
• (5 points) Is a GPU fast when computingsaxpy? Answer yes or no and give a one sentence justification. Note: this example runssaxpyas a stand-alone kernel, like the examples in the textbook. In practice,saxpycould be called as just one part of a larger kernel. In that case, the overheads of copying data between the CPU and GPU and the time for the kernel launch would be less of
``````a concern. This connects real GPU performance back to the ideas we had in the performance
modeling unit.
``````
``````Note: You could try much larger values forn_data, e.g. 1,000,000n_data10,000,000. This will
significantly change the measured values forflops_gpuandflops_cpu(but it doesnt change which
is faster). I thought of asking you to make the measurements and to explain what youre observing,
but then I remembered that most of you are about to graduate, and Im sure that most of you have
other things that you would rather do with your time.
``````

Notes:

• The CUDA tools such asnvccare in /cs/local/lib/pkg/cudatoolkit-11.6.2/bin on thelinXX.students.cs.ubc.ca.
• The valid names for the undergraduate graphics lab linux machines arelin01.students.cs.ubc.ca throughlin24.students.cs.ubc.ca.nslookupfindslin25.
• Note that thelinXXmachines are also used for homework for graphics sources. Please useuptime to make sure that the machine you are using isnt heavily loaded, and try a different machine if it is. Keep your CUDA kernel execution times under a second.
• The source files for this homework are available at:hw4.html
``````Unless otherwise noted or cited, the questions and other material in this homework problem set is
copyright 2022 by Mark Greenstreet and are made available under the terms of the Creative Commons