100% Guaranteed Results


ME 759 Solved
$ 20.99
Category:

Description

5/5 – (1 vote)

High Performance Computing for Engineering Applications
Assignment 10
Submit responses to all tasks which don’t specify a file name to Canvas in a file called assignment10.txt, docx, pdf, rtf, odt (choose one of the formats). Also, all plots should be submitted in Canvas. All source files should be submitted in the HW10 subdirectory on the master branch of your homework git repo with no subdirectories.
• #SBATCH –nodes=2 –cpus-per-task=20 –ntasks-per-node=1.
Please submit clean code. Consider using a formatter like clang-format.
1. In this task, you will explore the optimizations using ILP (instruction level parallelism) based on the code examples given in Lecture 26 slides 5-35. Some macros and utils functions are defined in the provided file optimize.h with the same naming fashion as the code examples in lecture slides. You will need to accomplish the following:
a) Write five optimization functions that each (either represents the baseline, or) uses a different technique to achieve ILP as the following:
• optimize1 will be the same as combine4 function in slide 9.
• optimize2 will be the same as unroll2acombine function in slide 20.
• optimize3 will be the same as unroll2aacombine function in slide 22.
• optimize4 will be the same as unroll2acombine function in slide 25.
• optimize5 will be similar to optimize4, but with K = 3 and L = 3, where K and L are the parameters defined in slide 28 and 29.
b) Write a program task1.cpp that will accomplish the following:
• Create and fill with data t type numbers however you like a vec v of length n where n is the first command line argument, see below.
• Call your optimizeX functions to get the results of OP operations and save it in dest.
• Print the result of dest.
• Print the time taken to run the optimizeX function in milliseconds.
• Compile: g++ task1.cpp optimize.cpp -Wall -O3 -o task1 -fno-tree-vectorize
• Run (where n is a positive integer):
./task1 n
• Example expected output:
3125 //from optimize1
0.706 //from optimize1 3125 //from optimize2
0.710 //from optimize2 3125 //from optimize3
0.353 //from optimize3 3125 //from optimize4
0.354 //from optimize4 3125 //from optimize5
0.236 //from optimize5
c) On an Euler compute node:
Table 1: Setting of macros for each file.
data t OP IDENT
task11.pdf int + 0
task12.pdf int * 1
task13.pdf float + 0.f
task14.pdf float * 1.f
• Run task1 for value n = 106, with the settings of data t, OP, and IDENT, and the naming of pdf files referring to Table 1. Each pdf should plot the time taken by all five of your optimizeX functions and one additional data point from SIMD version of optimize1 vs. X in linear-linear scale, where X = 1,…,6. Run the optimizeX function for 10 times and use the average time for plotting.
• Note for optimize.h file: You can change the definition of macros in optimize.h file to run tests for plotting, but your code should not depend on any changes in the provided optimize.h file in order to compile and run.
2. In this task, you will implement a parallel reduction (summation of an array) using hybrid OpenMP+MPI. You will use OpenMP to speed up the reduction, and use two MPI processes that each run on one node to execute the reduce function to add further parallelism. Figure 1 demonstrates the expected work flow of your program.

Figure 1: Schematic for the execution of reduction program.
a) Implement in a file called reduce.cpp with the prototype specified in reduce.h the function that uses OpenMP to speed up the reduction as much as possible (i.e., use simd directive).
b) Your program task2.cpp should accomplish the following:
• Create and fill with float-type numbers however you like an array arr of length n, where n is the first command line argument, see below. Note that n is half of the length of the array that we are doing reduction on.
• Initialize necessary variables for MPI environment.
• Set the number of OpenMP threads as t, where t is the second command line argument, see below.
• Call the reduce function and save the result in each MPI process’s local res as indicated in Figure 1.
• Use MPI Reduce to combine the local results and get the global res.
• Print the global res from one process.
• Print the time taken for the entire reduction process (including the call to reduce function and MPI Reduce) in milliseconds .
• Run (where n is a positive integer, t is an integer in the range [1, 20] ):
mpirun -np 2 –bind-to none ./task2 n t
• Example expected output:
3562.7
0.352
c) On an Euler compute node:
• Run task2 for n = 106, and t = 1,2,··· ,20. Generate a plot called task2.pdf that includes the run time of your program (the second output of your program) vs. t in linear-linear scale.
• (Optional, extra credit 10 points) Compare the timing you received from two MPI processes running on two nodes with pure OpenMP implementation that runs on one node. Make a plot of run time vs. t in linear-linear scale with these two patterns in task2 op.pdf. Submit your code for timing the pure OpenMP implementation as task2 op.cpp that should take the input arguments in the same way as task2.cpp. Note that here n should be 2 × 106 to compare with previous results. Discuss the differences between the two and the optimal choice of achieving parallelism/reducing run time for arrays of different sizes.

Reviews

There are no reviews yet.

Be the first to review “ME 759 Solved”

Your email address will not be published. Required fields are marked *

Related products