CS690 – Cilk: An Efficient Multithreaded Runtime System Solved

Description

5/5 – (1 vote)

Robert D. Blumofe Christopher F. Joerg Bradley C. Kuszmaul
Charles E. Leiserson Keith H. Randall Yuli Zhou MIT Laboratory for Computer Science
545 Technology Square
Cambridge, MA 02139

Cilk (pronounced “silk”) is a C-based runtime system for multithreaded parallel programming. In this paper, we document the efficiency of the Cilk work-stealing scheduler, both empirically and analytically. We show that on real and synthetic applications, the “work” and “critical path” of a Cilk computation can be used to accurately model performance. Consequently, a Cilk programmer can focus on reducing the work and critical path of his computation, insulated from load balancing and other runtime scheduling issues. We also prove that for the class of “fully strict” (well-structured) programs, the Cilk scheduler achieves space, time, and communication bounds all within a constant factor of optimal.
The Cilk runtime system currently runs on the Connection Machine CM5 MPP, the Intel Paragon MPP, the Silicon Graphics Power Challenge SMP, and the MIT Phish network of workstations. Applications written in Cilk include protein folding, graphic rendering, backtrack search, and the Socrates chess program, which won third prize in the 1994 ACM International Computer Chess Championship.
Multithreading has become an increasingly popular way to implement dynamic, highly asynchronous, concurrent programs [1, 8, 9, 10, 11, 12, 15, 19, 21, 22, 24, 25, 28, 33, 34, 36, 39, 40]. A multithreaded system provides the programmer with a means to create, synchronize, and schedule threads. Although the schedulers in many of these runtime systems seem to perform well in practice, none provide users with a guarantee of application performance. Cilk is a runtime system whose work-stealing scheduler is efficient in theory as well as in practice. Moreover, it gives the user an algorithmic model of application performance based on the measures of “work” and “critical path” which can be used to predict the runtime of a Cilk program accurately.
A Cilk multithreaded computation can be viewed as a directed acyclic graph (dag) that unfolds dynamically, as is shown schematically in Figure 1. A Cilk program consists of a collection of Cilk

This researchwas supported in part by the Advanced Research ProjectsAgency under GrantsN00014-94-1-0985andN00014-92-J-1310. RobertBlumofeissupportedinpartby an ARPA High-PerformanceComputing Graduate Fellowship. Keith Randall is supported in part by a Department of Defense NDSEG Fellowship.

: The Cilk model of multithreaded computation. Threads are shown as circles, which are grouped into procedures. Each downward edge corresponds to a spawn of a child, each horizontal edge corresponds to a spawn ofa successor, andeach upward, curved edge correspondstoadata dependency. The numbers in the figure indicate the levels of procedures in the spawn tree.
The execution time of any Cilk program on a parallel computer with processors is constrained by two parameters of the computation: the work and the critical path. The work, denoted , is the time used by a one-processor execution of the program, which corresponds to the sum of the execution times of all the threads. The critical path length, denoted , is the total amount of time required by an infinite-processor execution, which corresponds to the largest sum of thread execution times along any path. With processors, the execution time cannot be less than or less than . The Cilk scheduler uses “work stealing” [3, 7, 13, 14, 15, 19, 27, 28, 29, 34, 40] to achieve execution time very near to the sum of these two measures. Off-line techniques for computing such efficient schedules have been known for a long time [5, 16, 17], but this efficiency has been difficult to achieve on-line in a distributed environment while simultaneously using small amounts of space and communication.
The Cilk language is an extension to C that provides an abstraction of threads in explicit continuation-passing style. A Cilk program is preprocessed to C and then linked with a runtime library to run on the Connection Machine CM5 MPP, the Intel Paragon MPP, the Silicon Graphics Power Challenge SMP, or the MIT Phish [4] network of workstations. In this paper, we focus on the Connection Machine CM5 implementation of Cilk. The Cilk scheduler on the CM5 is written in about 30 pages of C, and it performs communication among processors using the Strata [6] active-message library.
The remainder of this paper is organized as follows. Section 2 describes Cilk’s runtime data structures and the C language extensions that are used for programming. Section 3 describes the work-stealing scheduler. Section 4 documents the performance of several Cilk applications. Section 5 shows how the work and critical path of a Cilk computation can be used to model performance. Section 6 shows analytically that the scheduler works well. Finally, Section 7 offers some concluding remarks and describes our plans for the future.
In this section we describe a C language extension that we have developed to ease the task of coding Cilk programs. We also explain the basic runtime data structures that Cilk uses.
In the Cilk language, a thread T is defined in a manner similar to a C function definition:
thread T (arg-decls ) stmts
The Cilk preprocessor translates T into a C function of one argument and void return type. The one argument is a pointer to a closure data structure, illustrated in Figure 2, which holds the arguments for T. A closure consists of a pointer to the C function for T, a slot for each of the specified arguments, and a join counter indicating the number of missing arguments that need to be supplied before T is ready to run. A closure is ready if it has obtained all of its arguments, and it is waiting if some arguments are missing. To run a ready closure, the Cilkschedulerinvokesthethreadasaprocedureusingtheclosureitself as its sole argument. Within the code for the thread, the arguments are copied out of the closure data structure into local variables. The closure is allocated from a simple runtime heap when it is created, and it is returned to the heap when the thread terminates.
The Cilk language supports a data type called a continuation, which is specified by the type modifier keyword cont. A continuation
waiting closure code

: The closure data structure.
is essentially a global reference to an empty argument slot of a closure, implemented as a compound data structure containing a pointer to a closure and an offset that designates one of the closure’s argument slots. Continuations can be created and passed among threads, which enables threads to communicate and synchronize with each other. Continuations are typed with the C data type of the slot in the closure.
At runtime, a thread can spawn a child thread by creating a closure for the child. Spawning is specified in the Cilk language as follows:
spawn T (args )
This statement creates a child closure, fills in all available arguments, and initializes the join counter to the number of missing arguments. Available arguments are specified as in C. To specify a missing argument, the user specifies a continuation variable (type cont) preceded by a question mark. For example, if the second argument is ?k, then Cilk sets the variable k to a continuation that refers to the second argument slot of the created closure. If the closure is ready, that is, it has no missing arguments, then spawn causes the closure to be immediately posted to the scheduler for execution. In typical applications, child closures are usually created with no missing arguments.
To create a successor thread, a thread executes the following statement:
spawnnext T (args )
This statement is semantically identical to spawn, but it informs the scheduler that the new closure should be treated as a successor, as opposed to a child. Successor closures are usually created with some missing arguments, which are filled in by values produced by the children.
A Cilk procedure does not ever return values in the normal way to a parent procedure. Instead, the programmer must code the parent procedure as two threads. The first thread spawns the child procedure, passing it a continuation pointing to the successor thread’s closure. The child sends its “return” value explicitly as an argument to the waiting successor. This strategy of communicating between threads is called explicit continuation passing. Cilk provides primitives of the following form to send values from one closure to another: sendargument (k, value)
This statement sends the value value to the argument slot of a waiting closure specified by the continuation k. The types of the continuation and the value must be compatible. The join counter of the waiting
thread fib (cont int k, int n) if (n<2) send argument (k,n)
else
cont int x, y; spawn next sum (k, ?x, ?y); spawn fib (x, n-1); spawn fib (y, n-2);
thread sum (cont int k, int x, int y)
send argument (k, x+y);
: A Cilk procedure, consisting of two threads, to compute the th Fibonacci number.
closure is decremented, and if it becomes zero, then the closure is ready and is posted to the scheduler.
Figure 3 shows the familiar recursive Fibonacci procedure written in Cilk. It consists of two threads, fib and its successor sum. Reflecting the explicit continuation passing style that Cilk supports, the first argument to each thread is the continuation specifying where the “return” value should be placed.
When the fib function is invoked, it first checks to see if the boundarycasehasbeenreached,inwhichcaseitusessendargument to “return” the value of n to the slot specified by continuation k. Otherwise, it spawns the successor thread sum, as well as two children to compute the two subcases. Each of these two children is given a continuation specifying to which argument in the sum thread it should send its result. The sum thread simply adds the two arguments when they arrive and sends this result to the slot designated by k.
Cilk supportsavarietyof featuresthatgive theprogrammergreater control over runtime performance. For example, when the last action of a thread is to spawn a ready thread, the programmer can use the keyword call instead of spawn that produces a “tail call” to run the new thread immediately without invoking the scheduler. Cilk also allows arrays and subarrays to be passed as arguments to closures. Other features include various abilities to override the scheduler’s decisions, including on which processor a thread should be placed and how to pack and unpack data when a closure is migrated from one processor to another.
Cilk’s scheduler uses the technique of work-stealing [3, 7, 13, 14, 15, 19, 27, 28, 29, 34, 40] in which a processor (the thief) who runs out of work selects another processor (the victim) from whom to steal work, and then steals the shallowest ready thread in the victim’s spawn tree. Cilk’s strategy is for thieves to choose victims at random [3, 27, 37].
At runtime, each processor maintains a local ready queue to hold
ready closures. Each closure has an associated level, which corresponds to the number of spawn’s (but not spawnnext’s) on the path from the root of the spawn tree. The ready queue is an array in which the th element contains a linked list of all ready closures having level .
Cilk begins executing the user program by initializing all ready queues to be empty, placing the root thread into the level- list of Processor ’s queue, and then starting a scheduling loop on each processor. Within a scheduling loop, a processor first checks to see whether its ready queue is empty. If it is, the processor commences “work stealing,” which will be described shortly. Otherwise, the processor performs the following steps:
1. Removethethreadattheheadofthelistofthedeepestnonemptylevel in the ready queue.
2. Extract the thread from the closure, and invoke it.
When a thread at level spawns a child thread , the scheduler executes the following operations:
1. Allocate and initialize a closure for T.
2. Copy the available arguments into the closure, initialize any continuations to point to missing arguments, and initialize the join counter to the number of missing arguments.
3. Label the closure with level .
4. If there are no missing arguments, post the closure to the readyqueue by inserting it at the head of the level- list.
Execution of spawnnext is similar, except that the closure is labeled with level and, if it is ready, posted to the level- list.
A processor that executes sendargument( , value) performs the following steps:
1. Find the closure and argument slot referenced by the continuation .
2. Place value in the argument slot, and decrement the join counter of the closure.
3. If the join counter goes to zero, post the closure to the readyqueue at the appropriate level.
When the continuation refers to a closure on a remote processor, network communication ensues. The processor that initiated the sendargument function sends a message to the remote processor to perform the operations. The only subtlety occurs in step 3. If the closure must be posted, it is posted to the ready queue of the initiating processor, rather than to that of the remote processor. This policy is necessary for the scheduler to be provably good, but as a practical matter, we have also had success with posting the closure to the remote processor’s queue, which can sometimes save a few percent in overhead.
If the scheduler attempts to remove a thread from an empty ready queue, the processor becomes a thief and commences work stealing as follows:
1. Select a victim processor uniformly at random.
2. If the victim’s ready queue is empty, go to step 1.
3. If the victim’s ready queue is nonempty, extract a thread fromthe tail of the list in the shallowest nonempty level of the ready queue, and invoke it.
Work stealing is implemented with a simple request-reply communication protocol between the thief and victim.
Why steal work from the shallowest level of the ready queue? The reason is two-fold. First, we would like to steal large amounts of work, and shallow closures are likely to execute for longer than deep ones. Stealing large amounts of work tends to lower the communication cost of the program, because fewer steals are necessary. Second, the closures at the shallowestlevel of the readyqueue are also theones that are shallowestin thedag, a key factprovenin Section6. Consequently, if processors are idle, the work they steal tends to make progress along the critical path.
This section presents several applications that we have used to benchmark the Cilk scheduler. We also present empirical evidence from experiments run on a CM5 supercomputer to document the efficiency of our work-stealing scheduler. The CM5 is a massively parallel computer based on 32MHz SPARC processors with a fat-tree interconnection network [30].
The applications are described below:
fib is the same as was presented in Section 2, except that the second recursive spawn is replaced by a “tail call” that avoids thescheduler. ThisprogramisagoodmeasureofCilkoverhead, because the thread length is so small.
queens is a backtrack search program that solves the problem of placing queens on a chessboard so that no two queens attack each other. The Cilk program is based on serial codebyR. Sargentof theMITMediaLaboratory. Threadlength was enhanced by serializing the bottom 7 levels of the search tree.
pfold is a protein-folding program [35] written in conjunction with V. Pande of MIT’s Center for Material Sciences and Engineering. This program finds hamiltonian paths in a threedimensional grid using backtrack search. It was the first program to enumerate all hamiltonian paths in a grid. We timed the enumeration of all paths starting with a certain sequence.
ray is a parallel program for graphics rendering based on the serial POV-Ray program, which uses a ray-tracing algorithm. The entire POV-Ray system contains over 20,000 lines of C code, but the core of POV-Ray is a simple doubly nested loop that iterates over each pixel in a two-dimensional image. For ray we converted the nested loops into a -ary divide-andconquer control structure using spawns. Our measurements do not include the approximately 2.4 seconds of startup time required to read and process the scene description file. knary(k,n,r)isasyntheticbenchmarkwhoseparameterscan be set to produce a variety of values for work and critical path. It generates a tree of branching factor and depth in which the first children at every level are executed serially and the remainder are executed in parallel. At each node of the tree, the program runs an empty “for” loop for 400 iterations.

Initially,the serialPOV-Ray programwas about 5 percentslowerthan the Cilk version running on one processor. The reason was that the divide-and-conquer decomposition performed by the Cilk code provides better locality than the doubly nested loop of the serial code. Modifying the serial code to imitate the Cilk decomposition improved its performance. Timings for the improved version are given in the table.
Table 4 shows typical performance measures for these Cilk applications. Each column presents data from a single run of a benchmark application. We adopt the following notations, which are used in the table. For each application, we have an efficient serial C implementation, compiled using gcc -O2, whose measured runtime is denoted serial. The work is the measured execution time for the Cilk program running on a single node of the CM5. The critical path length of the Cilk computation is measured by timestamping each thread and does not include scheduling or communication costs. The measured -processor execution time of the Cilk program running on the CM5 is given by , which includes all scheduling and communication costs. The row labeled “threads” indicates the number of threads executed, and “thread length” is the average thread length (work divided by the number of threads).
Certain derived parameters are also displayed in the table. The ratio serial is the efficiency of the Cilk program relative to the
C program. The ratio is the average parallelism. The value is a simple model of the runtime, which will be discussed in the next section. The speedup is , and the parallel efficiency is . The rowlabeled “space/proc.”indicates themaximum number of closures allocated at any time on any processor. The row labeled “requests/proc.” indicates the average number of steal requests made by a processor during the execution, and “steals/proc.” gives the average number of closures actually stolen.
The data in Table 4 shows two important relationships: one between efficiency and thread length, and another between speedup and average parallelism.
Considering the relationship between efficiency serial and thread length, we see that for programs with moderately long threads, the Cilk scheduler induces very little overhead. The queens, pfold, ray, and knary programs have threads with average length greater than 50 microseconds and have efficiency greater than 90 percent. On the other hand, the fib program has low efficiency, because the threads are so short: fib does almost nothing besides spawn and sendargument.
Looking at the speedup measured on 32 and 256 processors, we see that when the average parallelism is large compared with the number of processors, Cilk programs achieve nearly perfect linear speedup, but when the average parallelism is small, the speedup is much less. The fib, queens, pfold, and ray programs,

For the Socrates program, is not the measured execution time, but rather it is an estimate of the work obtained by summing the execution times of all threads, which yields a slight underestimate. Socrates is an unusually complicated application, because its speculative execution yields unpredictable work and critical path. Consequently, the measured runtime on one processor does not accurately reflect the work on processors.
fib queens pfold ray knary knary Socrates Socrates
(33) (15) (3,3,4) (500,500) (10,5,2) (10,4,1) (depth 10) (depth 10) (32 proc.) (256 proc)
(application parameters)
serial 8.487 252.1 615.15 729.2 288.6 40.993 1665 1665
73.16 254.6 647.8 732.5 314.6 45.43 3644 7023
serial 0.116 0.9902 0.9496 0.9955 0.9174 0.9023 0.4569 0.2371
0.000326 0.0345 0.04354 0.0415 4.458 0.255 3.134 3.24
224417 7380 14879 17650 70.56 178.2 1163 2168
threads 17,108,660 210,740 9,515,098 424,475 5,859,374 873,812 26,151,774 51,685,823
thread length 4.276 s 1208 s 68.08 s 1726 s 53.69 s 51.99 s 139.3 s 135.9 s
(32-processor experiments)
2.298 8.012 20.26 21.68 15.13 1.633 126.1 –
2.287 7.991 20.29 22.93 14.28 1.675 117.0 –
31.84 31.78 31.97 33.79 20.78 27.81 28.90 –
0.9951 0.9930 0.9992 1.0558 0.6495 0.8692 0.9030 –
space/proc. 70 95 47 39 41 42 386 –
requests/proc. 185.8 48.0 88.6 218.1 92639 3127 23484 –
steals/proc. 56.63 18.47 26.06 79.25 18031 1034 2395 –
(256-processor experiments)
0.2892 1.045 2.590 2.765 8.590 0.4636 – 34.32
0.2861 1.029 2.574 2.903 5.687 0.4325 – 30.67
253.0 243.7 250.1 265.0 36.62 98.00 – 204.6
0.9882 0.9519 0.9771 1.035 0.1431 0.3828 – 0.7993
space/proc. 66 76 47 32 48 40 – 405
requests/proc. 73.66 80.40 97.79 82.75 151803 7527 – 30646
steals/proc. 24.10 21.20 23.05 18.34 6378 550 – 1540
: Performance of Cilk on various applications. All times are in seconds, except where noted.
for example, have in excess of 7000-fold parallelism and achieve more than 99 percent of perfect linear speedup on 32 processors and more than 95 percent of perfect linear speedup on 256 processors. The Socrates program exhibits somewhat less parallelism and also somewhat less speedup. On 32 processors the Socrates program has 1163fold parallelism, yielding 90 percent of perfect linear speedup, while on 256 processors it has 2168-fold parallelism yielding 80 percent of perfect linear speedup. With even less parallelism, as exhibited in the knary benchmarks, less speedup is obtained. For example, the knary(10,5,2) benchmark exhibits only 70-fold parallelism, and it realizes barely more than 20-fold speedup on 32 processors (less than 65 percent of perfect linear speedup). With 178-fold parallelism, knary(10,4,1) achieves 27-fold speedup on 32 processors (87 percent of perfect linear speedup), but only 98-fold speedup on 256 processors (38 percent of perfect linear speedup).
Although these speedup measures reflect the Cilk scheduler’s ability to exploit parallelism, to obtain application speedup, we must factor in the efficiency of the Cilk program compared with the serial C program. Specifically, the application speedup serial is the product of efficiency serial and speedup . For example, applications such as fib and Socrates with low efficiency generate correspondingly low application speedup. The Socrates program, with efficiency and speedup on 256 processors, exhibits application speedup of . For the purpose of performance prediction, we prefer to decouple the efficiency of the application from the efficiency of the scheduler.
Looking more carefully at the cost of a spawn in Cilk, we find that it takes a fixed overhead of about 50 cycles to allocate and initialize a closure, plus about 8 cycles for each word argument. In comparison, a C function call on a CM5 processor takes 2 cycles of fixed overhead

In fact, the ray program achieves superlinear speedup even when comparing to the efficient serial implementation. We suspect that cache effects cause this phenomenon.
(assuming no register window overflow) plus 1 cycle for each word argument (assuming all arguments are transferred in registers). Thus, a spawn in Cilk is roughly an order of magnitude more expensive than a C function call. This Cilk overhead is quite apparent in the fib program, whichdoesalmostnothingbesidesspawnandsendargument. Based on fib’s measured efficiency of , we can conclude that the aggregateaveragecost of a spawn/sendargumentin Cilk is between 8 and 9 times the cost of a function call/return in C.
Efficient execution of programs with short threads requires a lowoverhead spawn operation. As can be observed from Table 4, the vast majority of threads execute on the same processor on which they are spawned. For example, the fib program executed over 17 million threads but migrated only 6170 (24.10 per processor) when run with 256 processors. Taking advantage of this property, other researchers [25, 32] havedevelopedtechniques for implementingspawnssuch that when the child thread executes on the same processor as its parent, the cost of the spawn operation is roughly equal the cost of a C function call. We hope to incorporate such techniques into future implementations of Cilk.
Finally, we make two observations about the space and communication measures in Table 4.
Looking at the “space/proc.” rows, we observe that the space per processor is generally quite small and does not grow with the number of processors. For example, Socrates on 32 processors executes over 26 million threads, yet no processor ever has more than 386 allocated closures. On 256 processors, the number of executed threads nearly doublestoover51million,butthespaceperprocessorsbarelychanges. In Section 6 we show formally that for Cilk programs, the space per processor does not grow as we add processors.
Looking at the “requests/proc.” and “steals/proc.” rows in Table 4, we observe that the amount of communication grows with the critical path but does not grow with the work. For example, fib, queens, pfold, and ray all have critical paths under a tenth of a second long and perform fewer than 220 requests and 80 steals per processor, whereas knary(10,5,2)and Socrates have critical paths more than 3 seconds long and perform more than 20,000 requests and 1500 steals per processor. The table does not show any clear correlation between work and either requests or steals. For example, ray does more than twice as much work as knary(10,5,2),yet it performs two orders of magnitude fewer requests. In Section 6, we show that for “fully strict” Cilk programs, the communication per processor grows linearly with the critical path length and does not grow as function of the work.
In this section, we further document the effectiveness of the Cilk scheduler by showing empirically that it schedules applications in a near-optimal fashion. Specifically, we use the knary synthetic benchmark to show that the runtime of an application on processors can be accurately modeled as , where
A good scheduler should to run an application with work in time on processors. Such perfect linear speedup cannot be obtained whenever , since we always have , or more generally, . The critical path
is the stronger lower bound on whenever exceeds the average parallelism , and is the stronger bound otherwise. A goodschedulershouldmeeteachoftheseboundsascloselyaspossible.
In order to investigate how well the Cilk scheduler meets these two lower bounds, we used our knarybenchmark (describedin Section 4), which can exhibit a range of values for work and critical path.
Figure 5 shows the outcome of many experiments of running knary with various values for , , , and . The figure plots the speedup for each run against the machine size for that run. In order to compare the outcomes for runs with different parameters, we have normalized the data by dividing the plotted values by the average parallelism . Thus, the horizontal position of each datum is , and the vertical position of each datum is . Consequently, on the horizontal axis, the normalized machine-size is when the average available parallelism is equal to the machine size. On the vertical axis, the normalized speedup is when the runtime equals the critical path, and it is when the runtime is 10 times the critical path. We can draw the two lower bounds on time as upper bounds on speedup. The horizontal line at is the upper bound on speedup obtained from the critical path, and the 45-degree line is the upper bound on speedup obtained from the work per processor. As can be seen from the figure, on the knary runs for which the average parallelism exceeds the number of processors (normalized machine size ), the Cilk scheduler obtains nearly perfect linear speedup. In the region where the number of processors is large compared to the average parallelism (normalized machine size ), the data is more scattered, but the speedup is always within a factor of 4 of the critical-path upper bound.
The theoretical results from Section 6 show that the expected running time of an application on processors is
. Thus, it makes sense to try to fit the data to a curve of the form . A least-squares fit to the data to minimize the relative error yields and
with percent confidence. The correlation coefficient of the fit is , and the mean relative error is percent. The curve fit is shown in Figure 5, which also plots the simpler curves and for comparison. As can be seen from the figure, little is lost in the linear speedup range of the curve by assuming that . Indeed, a fit to yields with and a mean relative error of percent, which
is in some ways better than the fit that includes a term. (The measure is a little worse, but the mean relative error is much better.)
It makes sense that the data points become more scattered when is close to or exceeds the average parallelism. In this range, the amount of time spent in work stealing becomes a significant fraction of the overall execution time. The real measure of the quality of a scheduler is how much larger must be than before
shows substantial influence from the critical path. One can see from Figure 5 that if the average parallelism exceeds by a factor of , the critical path has almost no impact on the running time.
To confirm our simple model of the Cilk scheduler’s performance on a real application, we ran Socrates on a variety of chess positions. Figure 6 shows the results of our study, which confirm the results from the knary synthetic benchmarks. The curve shown is the best fit to , where
and with percent confidence. The
correlation coefficient of the fit is , and the mean relative error is percent.
Indeed, as some of us were developing and tuning heuristics to increase the performance of Socrates, we used work and critical path as our measures of progress. This methodology let us avoid being trapped by the following interesting anomaly. We made an “improvement” that sped up the program on 32 processors. From our measurements, however, we discovered that it was faster only because it saved on work at the expense of a much longer critical path.
Using the simple model , we concluded that on a 512-processor machine, which was our platform for tournaments, the “improvement” would yield a loss of performance, a fact that we later verified. Measuring work and critical path enabled us to use experiments on a 32-processor machine to improve our program for the 512-processor machine, but without using the 512-processor machine, on which computer time was scarce.
In this section we use algorithmic analysis techniques to prove that for the class of “fully strict” Cilk programs, Cilk’s work-stealing scheduling algorithm is efficient with respect to space, time, and communication. A fully strict program is one for which each thread sends arguments only to its parent’s successor threads. For this class of programs, we prove the following three bounds on space, time, and communication:
Space The space used by a -processor execution is bounded by
, where denotes the space used by the serial execution of the Cilk program. This bound is existentially optimal to within a constant factor [3].
Time With processors, the expected execution time, including scheduling overhead, is bounded by .
Sinceboth and arelowerboundsfor any -processor execution, our expected time bound is within a constant factor of optimal.
Communication The expected number of bytes communicated during a -processor execution is max , where max denotes the largest size of any closure. This bound is existentially optimal to within a constant factor [41].
The expected time bound and the expected communication bound can be converted into high-probability bounds at the cost of only a small

: Normalized speedups for the knary synthetic benchmark using from 1 to 256 processors. The horizontal axis is and the vertical axis is the speedup , but each data point has been normalized by dividing the these parameters by .

: Normalized speedups for the Socrates chess program.

: The closures at some time during a -processor execution. Datadependency edges are not shown. The black nodes represent ready closures, the gray nodes represent waiting closures, and white nodes represent closures that have already been executed. The black and gray closures are allocated and consume space, but the white closures have been deallocated. Gray, curved edges represent the additional edges in that do not also belong to .
additive term in both cases. Proofs of these bounds use generalizations of the techniques developed in [3]. We defer complete proofs and give outlines here.
The space bound follows from the “busy-leaves” property which characterizes the allocated closures at all times during the execution. At any given time during the execution, we say that a closure is a leaf if it has no allocated child closures, and we say that a leaf closure is a primary leaf if, in addition, it has no left-sibling closures allocated. In Figure 7, which shows the allocated closures at some time during an execution, closure is the only primary leaf. Closure is a leaf, but it is not primary, since it has left siblings and closure is not a leaf, because and its two siblings are counted as children of . The busy-leaves property states that every primary leaf closure has a processor working on it. To prove the space bound, we show that Cilk’s scheduler maintains the busy-leavesproperty, and then we show that the busy-leaves property implies the space bound.
For any fully strict Cilk program, if is the space used to execute the program on processor, then with any number of processors, Cilk’s work-stealing scheduler uses at most space.
Proof: We first show by induction on execution time that Cilk’s work-stealing scheduler maintains the busy-leaves property. We then show that the busy-leaves property implies the space bound.
To see that Cilk’s scheduler maintains the busy-leaves property, we consider the three possible ways that a primary-leaf closure can be created. First, when a thread spawns children, the leftmost of these children is a primary leaf. Second, when a thread completes and its closure is freed, if that closure has a right sibling and that sibling has no children, then the right-sibling closure becomes a primary leaf. And third, when a thread completes and its closure is freed, if that closure has no allocated siblings, then the leftmost closure of its parent’s successor threads is a primary leaf. The induction follows by observing that in all three of these cases, Cilk’s scheduler guarantees that a processor works on the new primary leaf. In the third case we use the fact that a newly activated closure is posted on the processor that activated it and not on the processor on which it was residing.
The space bound is obtained by showing that every allocated closure can be associated with a primary leaf and that the totalspaceofallclosuresassignedtoagivenprimaryleafisatmost . SinceCilk’sschedulerkeepsallprimaryleavesbusy,with processors we are guaranteed that at every time during the execution, at most primary-leaf closures can be allocated, and hence the total amount of space is at most .
We now give the theoremsbounding executiontime andcommunication cost. Proofs for these theorems generalizethe results of [3] for a more restricted model of multithreaded computation. As in [3], these proofs assume a communication model in which messages are delayed only by contention at destination processors, but no assumptions are madeabout theorderin whichcontendingmessagesaredelivered[31]. The bounds given by these theorems assume that no thread has more than one successor thread.
The proofs of these theorems are analogous to the proofs of Theorems 12 and 13 in [3]. We show that certain “critical” threads are likely to be executed after only a modest number of steal requests, and that executing a critical thread guarantees progress on the critical path of the dag.
We first construct an augmented dag that will be used to define the critical threads. The dag is constructed by adding edges to the original dag of the computation. For each child procedure of a thread , we add an edge to from the first thread of to the first thread of the next child procedure spawned by after is spawned. We make the technical assumption that the first thread of each procedure executes in zero time since we can add a zero-time thread to the beginning of each procedure without affecting work or depth. An example of the dag is given in Figure 7, where the additional edges are shown gray and curved. We draw the children spawned by a node in right-to-left order in the figure, because the execution order by the local processor is left to right, corresponding to LIFO execution. The dag is constructed for analytic purposes only and has no effect on the scheduling of the threads. An important property of is that its critical path is the same as the critical path of the original dag .
During the execution of any fully strict Cilk program for which no thread has more than one successor thread, any critical thread must be the shallowest thread in a ready queue. Moreover, the critical thread is also first in the steal order.
Proof: For a thread to be critical, the following conditions must hold for the ready queue on the processor in which is enqueued:
1. No right siblings of are in the ready queue. If a right sibling procedure of were in the ready queue, then the first thread of would not have been executed, and because the first thread of is a predecessor of in , would not be critical.
2. No right siblings of any of ’s ancestors are in the ready queue. This fact follows from the same reasoning as above.
3. No left siblings of any of ’s ancestors are in the ready queue. This condition must hold because all of these siblings occur before ’s parent in the local execution order, and ’s parent must have been executed for to be critical.
4. No successor threads of ’s ancestors are enabled. This condition must be true, because any successor thread must wait for all children to complete before it is enabled. Since has not completed, no successor threads of ’s ancestors are enabled. This condition makes use of the fact that the computation is fully strict, which implies that the only thread to which can send its result is ’s parent’s unique successor.
A consequence of these conditions is that no thread could possibly be above in the ready queue, because all threads above are either already executed, stolen, or not enabled. In ’s level, is first in the work-stealing order, because it is the rightmost thread at that level.
For any number of processors and any fully strict Cilk program in which each thread has at most one successor, if the program has work and critical path length , then Cilk’s workstealing scheduler executes the program in expected time
. Furthermore, for any , the execution time is with probability at least
.
Proof: This proof is just a straightforward application of the techniques in [3], using our Lemma 2 as a substitute for Lemma 9 in [3]. Because the critical threads are first in the work-stealing order, they are likely to be stolen (or executed locally) after a modest number of steal requests. This fact can be shown formally using a delay sequence argument.
For any number of processors and any fully strict Cilk program in which each thread has at most one successor, if the program has critical path length and maximum closure size max, then Cilk’s work-stealing scheduler incurs expected communication max . Furthermore, for any , the communication cost is max with probability at least .
Proof: This proof is exactly analogous to the proof of Theorem 13 in [3]. We observe that at most steal attempts occur in an execution, and all communication costs can be associated with one of these steal requests such that at most max communication is associated with each steal request. The high-probability bound is analogous.
To produce high-performance parallel applications, programmers often focus on communication costs and execution time, quantities that are dependent on specific machine configurations. We argue that a programmer should think instead about work and critical path, abstractions that can be used to characterize the performance of an algorithm independent of the machine configuration. Cilk provides a programming model in which work and critical path are observable quantities, and it delivers guaranteed performance as a function of these quantities. Work and critical path have been used in the theory community for years to analyze parallel algorithms [26]. Blelloch [2] has developed a performance model for data-parallel computations based on these same two abstract measures. He cites many advantages to such a model over machine-based models. Cilk provides a similar performance model for the domain of asynchronous, multithreaded computation.
Although Cilk offers performance guarantees, its current capabilities are limited, and programmers find its explicit continuationpassing style to be onerous. Cilk is good at expressing and executing dynamic, asynchronous, tree-like, MIMD computations, but it is not yet ideal for more traditional parallel applications that can be programmedeffectivelyin, forexample, amessage-passing, data-parallel, or single-threaded, shared-memory style. We are currently working on extending Cilk’s capabilities to broaden its applicability. A major constraint is that we do not want new features to destroy Cilk’s guarantees of performance. Our current research focuses on implementing “dag-consistent”sharedmemory, whichallowsprogramsto operateon shared memory without costly communication or hardware support; on providing a linguistic interface that produces continuation-passing code forour runtimesystem from amore traditionalcall-returnspecificationofspawns; andon incorporatingpersistentthreadsandlessstrict semantics in ways that do not destroy the guaranteed performance of our scheduler. Recent information about Cilk is maintained on the World Wide Web in page http://theory.lcs.mit.edu/˜cilk.
[6] Eric A. Brewer and Robert Blumofe. Strata: A multi-layer communicationslibrary. TechnicalReporttoappear, MIT Laboratory for Computer Science. Available as ftp://ftp.lcs.mit.edu /pub/supertech/strata/strata.tar.Z.
[7] F. Warren Burton and M. Ronan Sleep. Executing functional programs on a virtual tree of processors. In Proceedings of the 1981 Conference on Functional Programming Languages
[26] Richard M. Karp and Vijaya Ramachandran. Parallel algorithms for shared-memory machines. In J. van Leeuwen, editor, Handbook of Theoretical Computer Science—Volume A: Algorithms and Complexity, chapter 17, pages 869–941. MIT Press, Cambridge, Massachusetts, 1990.
[35] Vijay S. Pande, Christopher F. Joerg, Alexander Yu Grosberg, and Toyoichi Tanaka. Enumerations of the hamiltonian walks on a cubic sublattice. Journal of Physics A, 27, 1994.

CS690 – Cilk: An Efficient Multithreaded Runtime System Solved

Description

Reviews

Related products

CS690 – Cilk: An Efficient Multithreaded Runtime System Solved

Description

Reviews

Related products

CS690 – Lab 1 Solved