Chapter 4 Nested Parallelism
This chapter discusses the features of OpenMP nested parallelism.
4.1 The Execution Model
OpenMP uses a fork-join model of parallel execution. When a thread encounters
a parallel construct, the thread creates a team composed of itself and some
additional (possibly zero) number of threads. The encountering thread becomes
the master of the new team. The other threads of the team are called slave
threads of the team. All team members execute the code inside the parallel
construct. When a thread finishes its work within the parallel construct,
it waits at the implicit barrier at the end of the parallel construct. When
all team members have arrived at the barrier, the threads can leave the barrier.
The master thread continues execution of user code beyond the end of the parallel
construct, while the slave threads wait to be summoned to join other teams.
OpenMP parallel regions can be nested inside each other. If nested parallelism
is disabled, then the new team created by a thread encountering a parallel
construct inside a parallel region consists only of the encountering thread.
If nested parallelism is enabled, then the new team may consist of more than
one thread.
The OpenMP runtime library maintains a pool of threads that can be used
as slave threads in parallel regions. When a thread encounters a parallel
construct and needs to create a team of more than one thread, the thread will
check the pool and grab idle threads from the pool, making them slave threads
of the team. The master thread might get fewer slave threads than it needs
if there is not a sufficient number of idle threads in the pool. When the
team finishes executing the parallel region, the slave threads return to the
pool.
4.2 Control of Nested Parallelism
Nested parallelism can be controlled at runtime by setting various environment
variables prior to execution of the program.
4.2.1 OMP_NESTED
Nested parallelism can be enabled or disabled by setting the OMP_NESTED environment variable or calling omp_set_nested().
The following example has three levels of nested parallel constructs.
Example 4–1 Nested Parallelism Example
#include <omp.h>
#include <stdio.h>
void report_num_threads(int level)
{
#pragma omp single
{
printf("Level %d: number of threads in the team - %d\n",
level, omp_get_num_threads());
}
}
int main()
{
omp_set_dynamic(0);
#pragma omp parallel num_threads(2)
{
report_num_threads(1);
#pragma omp parallel num_threads(2)
{
report_num_threads(2);
#pragma omp parallel num_threads(2)
{
report_num_threads(3);
}
}
}
return(0);
}
|
Compiling and running this program with nested parallelism enabled produces
the following (sorted) output:
% setenv OMP_NESTED TRUE
% a.out
Level 1: number of threads in the team - 2
Level 2: number of threads in the team - 2
Level 2: number of threads in the team - 2
Level 3: number of threads in the team - 2
Level 3: number of threads in the team - 2
Level 3: number of threads in the team - 2
Level 3: number of threads in the team - 2
|
Compare with running the same program but with nested parallelism disabled:
% setenv OMP_NESTED FALSE
% a.out
Level 1: number of threads in the team - 2
Level 2: number of threads in the team - 1
Level 3: number of threads in the team - 1
Level 2: number of threads in the team - 1
Level 3: number of threads in the team - 1
|
4.2.2 SUNW_MP_MAX_POOL_THREADS
The OpenMP runtime library maintains a pool of threads that can be used
as slave threads in parallel regions. Setting the SUNW_MP_MAX_POOL_THREADS environment variable controls the number of threads in the pool.
The default value is 1023.
The thread pool consists of only non-user threads that the runtime library
creates. It does not include the initial thread or any thread created explicitly
by the user’s program. If this environment variable is set to zero,
the thread pool will be empty and all parallel regions will be executed by
one thread.
The following example shows that a parallel region can get fewer threads
if there are not sufficient threads in the pool.The code is the same as above.
The number of threads needed for all the parallel regions to be active at
the same time is 8. The pool needs to contain at least 7 threads. If we set SUNW_MP_MAX_POOL_THREADS to 5, two of the four inner-most parallel
regions may not be able to get all the slave threads they ask for. One possible
result is shown below.
% setenv OMP_NESTED TRUE
% setenv SUNW_MP_MAX_POOL_THREADS 5
% a.out
Level 1: number of threads in the team - 2
Level 2: number of threads in the team - 2
Level 2: number of threads in the team - 2
Level 3: number of threads in the team - 2
Level 3: number of threads in the team - 2
Level 3: number of threads in the team - 1
Level 3: number of threads in the team - 1
|
4.2.3 SUNW_MP_MAX_NESTED_LEVELS
The environment variable SUNW_MP_MAX_NESTED_LEVELS controls
the maximum depth of nested active parallel regions that require more than
one thread.
Any active parallel region that has an active nested depth greater than
the value of this environment variable will be executed by only one thread.
A parallel region is considered active if it it has no IF clause,
or if it has an IF clause that evaluates to true.
The default maximum number of active nesting levels is 4.
The following code will create 4 levels of nested parallel regions.
If SUNW_MP_MAX_NESTED_LEVELS is set to 2, then nested parallel
regions at nested depth of 3 and 4 are executed single-threaded.
#include <omp.h>
#include <stdio.h>
#define DEPTH 5
void report_num_threads(int level)
{
#pragma omp single
{
printf("Level %d: number of threads in the team - %d\n",
level, omp_get_num_threads());
}
}
void nested(int depth)
{
if (depth == DEPTH)
return;
#pragma omp parallel num_threads(2)
{
report_num_threads(depth);
nested(depth+1);
}
}
int main()
{
omp_set_dynamic(0);
omp_set_nested(1);
nested(1);
return(0);
}
|
Compiling and running this program with a maximum nesting level of 4
gives the following possible output. (Actual results will depend on how the
OS schedules threads.)
% setenv SUNW_MP_MAX_NESTED_LEVELS 4
% a.out |sort
Level 1: number of threads in the team - 2
Level 2: number of threads in the team - 2
Level 2: number of threads in the team - 2
Level 3: number of threads in the team - 2
Level 3: number of threads in the team - 2
Level 3: number of threads in the team - 2
Level 3: number of threads in the team - 2
Level 4: number of threads in the team - 2
Level 4: number of threads in the team - 2
Level 4: number of threads in the team - 2
Level 4: number of threads in the team - 2
Level 4: number of threads in the team - 2
Level 4: number of threads in the team - 2
Level 4: number of threads in the team - 2
Level 4: number of threads in the team - 2
|
Running with the nesting level set at 2 gives the following as a possible
result:
% setenv SUNW_MP_MAX_NESTED_LEVELS 2
% a.out |sort
Level 1: number of threads in the team - 2
Level 2: number of threads in the team - 2
Level 2: number of threads in the team - 2
Level 3: number of threads in the team - 1
Level 3: number of threads in the team - 1
Level 3: number of threads in the team - 1
Level 3: number of threads in the team - 1
Level 4: number of threads in the team - 1
Level 4: number of threads in the team - 1
Level 4: number of threads in the team - 1
Level 4: number of threads in the team - 1
|
Again, these examples only show some possible results.
Actual results will depend on how the OS schedules threads.
4.3 Using OpenMP Library Routines Within Nested Parallel
Regions
Calls to the following OpenMP routines within nested parallel regions
deserve some discussion.
- omp_set_num_threads()
- omp_get_max_threads()
- omp_set_dynamic()
- omp_get_dynamic()
- omp_set_nested()
- omp_get_nested()
The 'set' calls affect future parallel regions at the same or inner
nesting levels encountered by the calling thread only. They do not affect
parallel regions encountered by other threads.
The 'get' calls return the values set by the calling thread. When a
thread becomes the master of a team executing a parallel region, all other
members of the team inherit the values of the master thread. When the master
thread exits a nested parallel region and continues executing the enclosing
parallel region, the values for that thread revert to their values in the
enclosing parallel region just before executing the nested parallel region.
Example 4–2 Calls to OpenMP Routines Within Parallel Regions
#include <stdio.h>
#include <omp.h>
int main()
{
omp_set_nested(1);
omp_set_dynamic(0);
#pragma omp parallel num_threads(2)
{
if (omp_get_thread_num() == 0)
omp_set_num_threads(4); /* line A */
else
omp_set_num_threads(6); /* line B */
/* The following statement will print out
*
* 0: 2 4
* 1: 2 6
*
* omp_get_num_threads() returns the number
* of the threads in the team, so it is
* the same for the two threads in the team.
*/
printf("%d: %d %d\n", omp_get_thread_num(),
omp_get_num_threads(),
omp_get_max_threads());
/* Two inner parallel regions will be created
* one with a team of 4 threads, and the other
* with a team of 6 threads.
*/
#pragma omp parallel
{
#pragma omp master
{
/* The following statement will print out
*
* Inner: 4
* Inner: 6
*/
printf("Inner: %d\n", omp_get_num_threads());
}
omp_set_num_threads(7); /* line C */
}
|
/* Again two inner parallel regions will be created,
* one with a team of 4 threads, and the other
* with a team of 6 threads.
*
* The omp_set_num_threads(7) call at line C
* has no effect here, since it affects only
* parallel regions at the same or inner nesting
* level as line C.
*/
#pragma omp parallel
{
printf("count me.\n");
}
}
return(0);
}
|
Compiling and running this program gives the following as one possible
result:
% a.out
0: 2 4
Inner: 4
1: 2 6
Inner: 6
count me.
count me.
count me.
count me.
count me.
count me.
count me.
count me.
count me.
count me.
|
4.4 Some Tips on Using Nested Parallelism
-
Nesting parallel regions provides an immediate way to allow
more threads to participate in the computation.
For example, suppose
you have a program that contains two levels of parallelism and the degree
of parallelism at each level is 2. Also, suppose your system has four cpus
and you want use all four CPUs to speed up the execution of this program.
Just parallelizing any one level will use only two CPUs. You want to parallelize
both levels.
-
Nesting parallel regions can easily create too many threads
and oversubscribe the system. Set SUNW_MP_MAX_POOL_THREADS and SUNW_MP_MAX_NESTED_LEVELS appropriately to limit the number of threads
in use and prevent runaway oversubscription.
-
Creating nested parallel regions adds overhead. If there is
enough parallelism at the outer level and the load is balanced, generally
it will be more efficient to use all the threads at the outer level of the
computation than to create nested parallel regions at the inner levels.
For example, suppose you have a program that contains two levels of
parallelism. The degree of parallelism at the outer level is 4 and the load
is balanced. You have a system with four CPUs and want to use all four CPUs
to speed up the execution of this program. Then, in general, using all 4 threads
for the outer level could yield better performance than using 2 threads for
the outer parallel region, and using the other 2 threads as slave threads
for the inner parallel regions.