Description
CS610 – Assignment 3
Shivam Sharma (210983, sshivam21@iitk.ac.in)
All benchmarking is done on csews31 of the KD Lab. To build the code, cd into the problemx directory and run make to build the executables.
1 Problem 1
For all vectorized variants, I’ve chosen ikj as the loop order, as this allows for efficient vectorisation. Before starting the j loop, A[i][k] is broadcasted into a vector register. In the j loop, B[k][j] to B[k][j + WIDTH] is loaded into another register.
With N = 2048,
Variant Time(ms) Speedup
Sequential 33262 –
Sequential Blocked(BLK = 64) 7362 4.51
SSE4 Unaligned 2199 15.12
SSE4 Aligned 2209 15.05
SSE4 Blocked(BLK = 128) 1787 18.61
AVX2 Unaligned 1807 18.40
AVX2 Aligned 1752 18.98
AVX2 Blocked(BLK = 128) 1160 28.67
Table 1: Matrix multiplication Results on csews31
As expected, the wider registers with cache optimization(blocking) gives the best speedup.
2 Problem 2
The AVX2 implementation is very similar to the given SSE4 implementation, the only challenge was that AVX2 does not provide an instruction to shift the entire 256 bit lane together. To solve this, I have calculated the prefix sums of individual 128 bits lanes using the same method as the SSE4, then I extract the 4th integer from the lower 128 bits and broadcast it into another 128 bit register. The higher 128 bits are added with this and concatenated back with the lower bits.
With N = (1 « 30),
Variant Time(us)
Serial 713449
OMP 707339
SSE4 717713
AVX2 714587
Table 2: Inclusive PS Results on csews31
3 Problem 3
The following changes have been made-
• pthread_create has been replaced by #pragma omp parallel with the required threads and shared data.
• pthread_barrier is now replaced by #pragma omp barrier
• All pthread_mutex, except the one used for the condition variable have been replaced by #pragma omp critical (name).
4 Problem 4
The unoptimized code runs in about 330s on csews31.
4.1 Part A
The innermost loop contains a lot of redundant computations which can be avoided by caching values in the upper loops. More formally, at each loop level i, we store aij = cj1 · x1 + cj2 · x2 + ….cjk · xk + …cji · xi with 1 <= j <= 10. This is implemented in problem4 − v1.c. This reduces the runtime to about 160s, which is a 2 times speedup
4.2 Part B
Parallelzing the outermost loop with OpenMP reduces the runtime to about 45s, which is a 7.3 times speedup. This is implemented in problem4 − v2.c.
2




Reviews
There are no reviews yet.