Description
CS610 – Assignment 4
Shivam Sharma (210983, sshivam21@iitk.ac.in)
The benchmarks I’ve done are on GPU03 server of CC IIT Kanpur, which has a Nvidia A100 while GPU3 of CSE has an A40. Timings should be somewhat higher on A40 but comparative performance seems to be the same.
For compilation, just enter the problem directory and run make.
1 Problem 1
1.1 Part 1
Implemented in problem1.cu(kernel1), takes about 2.2ms to run for N = 128,
1.2 Part 2
Implemented in problem1.cu(kernel2) by caching the entire grid and halo regions, takes about 0.29ms to run for N = 128.
BLK Time(ms)
8 0.29
4 1.72
2 0.43
1 1.8
Table 1: Results on A100
2 Problem 2
The implemented algorithm calculates the prefix sum in logn steps. The ith step(0 based indexing) consists of combining adjaced prefix sums of length 2i into a prefix sums of size 2i+1 using threads. To ensure correctness, it should be ensured that the ith step completes for all threads before any thread moves onto the next step. Since there is no convinient way to synchronize the entire grid from inside the kernel, the alogirithm is implemented using logn sequential kernel launches.
Below are the benchmarks, time of copying is not included for thrust or my implementation.
N CPU Time(ms) Thrust Time(ms) Custom Impl Time(ms)
220 1.59 3.7 34.75
224 26.18 2.52 48.28
228 369 5.2 134.53
Table 2: Results on A100
3 Problem 3
3.1 Part l
To resolve this, let’s make the kernels return 0/1(single byte) per thread, indicating wheather the search was successfull or not. Once we have all the points, the corressponding (q1, q2 … q10) can just be recalculated on the CPU itself. This works because such points are much less in number(around 11K)
There is still a problem, even with keeping 1 byte for every iteration point, total memory requirement for 1310 points would be about 128 GBs. To cut down on this, I have peeled out the outermost two loops outside the kernel. The iteration space per kernel now reduces to 138, requiring only 0.75 GBs. On the downside, this scheme requires 169 kernel invocations.
This is implemented in problem3-v1.cu and takes about 100s to run.
3.2 Part 2
Note that unlike in problem 2, there is no compulsion of running the grids sequentially, but since they use the same output buffer, it has to be copied back to the source first, which results in synchronization.
A solution to this is to use n device and host buffers, distributed in a round robin way to kernels. Now n kernels can run in an async manner. When all the buffers get used up, we move the data from all of them to host. This copying of data can also be made parallel using cudaMemcpyAsync, given that all kernels are running on different streams, so we create n CUDA streams as well.
This creates yet another bottleneck, The CPU now recieves 138 · n bytes of data and has to iterate through all of it to find successful searches before launching the next batch of kernels. Luckily, this too is easily parallelizable using CPU threads.
All of this is implemented in problem3-v2.cu and takes about 36s to run.
3.3 Part 3
Implemented in problem3-v3.cu and takes around 71s to run.
3.4 Part 4
Implemented in problem3-thrust.cu using thrust :: tabulate, takes about 140s to run.
4 Problem 4
All versions implemented in problem4.cu. N is the input side length and K is the kernel side length. In the optimized version, I keep the kernel in constant memory and cache the entire block and the required halo regions into shared memory.
2D Conv –
N K CPU Time(ms) Naive(ms) Optimized(ms)
213 5 1929 97 86
213 7 3326 111 84
213 13 12439 184 96
Table 3: 2D Conv Results
3D Conv –
2
N K CPU Time(ms) Naive(ms) Optimized(ms)
29 5 23620 409 207
29 7 61424 844 214
29 13 367157 4456 735
Table 4: 3D Conv Results
Clearly, the optimized version scales much better with K.
3




Reviews
There are no reviews yet.