Inom
Hitta mer dokumentation
Supportresurser som ingår
| Ladda ner denna bok i PDF
Performance Tuning
A
- This appendix presents information about performance tuning. Tuning code for performance can be broken down into two distinct parts: finding the performance critical paths and tuning those paths.
- This appendix details methodologies for finding performance problems and describes both high-level and low-level techniques for alleviating them. The following topics are covered:
-
- Finding the performance critical paths
- Selecting good benchmarks
- Tuning the performance critical paths
- Tips and techniques for faster code
Finding the Performance Critical Paths
- Being able to find the performance critical paths is as important as tuning them. However, finding these paths is not always easy. Your intuition about where the performance problems lie can mislead you. Unless you are personally familiar with a particular section of code, it is best to approach this process with no preconceptions and to gather profile information from an application to direct your investigation.
- There are currently three ways you can gather profile information. These methods are introduced here and described more fully in "Tuning Performance Critical Paths" on page 403.
-
- Build profile libraries.
Libraries built with the -pg option produce gprof output. This output gives you a very close approximation of how much time was spent in each function of the library, an exact count of how many times each function was executed, and a function call graph. The disadvantages of profile libraries are that they must be compiled using special flags, they must be built statically (this restriction may be removed at a later time), they don't measure system time, they require re-linking the application with -pg, and they don't provide any information about the memory system (for example, page faults). Although profile libraries have disadvantages, they are currently the only standard mechanism capable of providing function call counts and the function call graph. These capabilities make profile libraries a very attractive analysis tool for in-depth performance tuning.
- Use the performance collector and analyzer tools included in ProWorks.
The collector is used from the debugger to gather information about a program while it is running, and the analyzer is a user interface that sorts and displays that information in various ways. This tool has the benefit of being able to measure any code that hasn't been stripped. No special compile flags are needed, and it doesn't matter whether libraries are dynamically or statically linked or even dlopen'd. It also is aware of page faults.
- One attractive feature of the analyzer is that it shows you the total amount of time spent in each shared library. This is useful for doing a overall analysis of your program. For example, if you're spending a lot of time in libc, you are probably doing a lot of malloc,free, or signal handling. Even though the analyzer can't show you what routine is calling these routines, you as a library developer may immediately know where to start looking.
- However, this tool is not able to show the function call graphs or counts on the number of times a function was executed.
-
- Use the Shared Library Interposer.
The Shared Library Interposer (SLI) installs hooks to trap function calls, which is a Sun OS 5.x special feature. However, it can't catch C++ virtual functions or static functions. The SLI can work on any number of shared libraries at the same time. A disadvantage of SLI is that it requires additional interposing libraries to work. If you're just interested in measuring the performance of your API without the details of the code underneath, then these interposing libraries can be constructed once and easily be referenced later on. If you want the details of all the internal functions that were called, then you need to point SLI at your source tree and construct a new interposing library any time functions are created, destroyed or renamed (but not if just the body of functions were changed). This rebuilding takes approximately ten minutes for large libraries. Unlike all of the above options, SLI gets exact time for each function. It does this by bracketing each function call with gethrtime(). All the other schemes interrupt the process every 10 milliseconds or so and note which function they are in. Therefore, all of the non-SLI schemes only generate statistical approximations to how much time was spent in each function. Assuming your application runs for at least a few seconds, this statistical approximation is quite close to reality. SLI does not work with static libraries or applications, nor does it know about the memory system. SLI has a GUI to allow easy interpretation of the gathered data. SLI also has the ability to log all the API calls and their arguments made during a session for later playback. The functionality has to be coded into SLI; therefore, it only works on a small number of libraries.
At-a-Glance Comparison of Performance Tools
-
Table A-1 compares the different performance tools used to gather profile information.
-
Table A-1
| Features | -pg | Collector | SLI |
| Time spent in each function | Y | Y | Y |
| Call counts for each function | Y |
| Y |
| Function call graph | Y |
| Y |
| Measures system time |
| Y |
|
| Page fault aware |
| Y |
|
| Library can be dynamic |
| Y | Y |
| Library can be static | Y | Y |
|
| Handles multiple libraries | Y | Y | Y |
| Does not need special compile flags |
| Y | Y |
| No recompilation/linking of application needed |
| Y | Y |
| Works on dlopen()'d libraries |
| Y | Y |
| Has a GUI for display |
| Y | Y |
| Needs no additional libraries | Y | Y |
|
| Internal library functions measured | Y | Y | 1 Y |
| Virtual/static functions profiled | Y | Y |
| Supports playback of library calls |
|
| 2 Y |
- 1. If functions are created/destroyed/renamed, then the new interposing libraries need to be created for SLI. This does not take a tremendous amount of time, but it is an additional step.
- 2. Only a handful of libraries support this feature.
Recommendations for Performance Tools
- Choosing a performance analysis method is a matter of individual preference. This section provides recommendations, and you can determine which methods you are most comfortable with.
- If you are interested in giving one or two areas a boost in performance, but the areas are not critical, use the collector and see if it can give you the information you need. If you're going to be spending a lot of time tuning code or if the collector does not meet your needs, then it's worth the effort to build a profile library.
- SLI is not recommended for library developers because SLI's strength is in logging the library API calls for statistical analysis of how the library is used. As such, it is more useful for tuning the application to use the library more efficiently than it is for tuning the library itself.
- For serious performance tuning, the profile library is recommended over the collector tool. This is because the collector does not produce the function call graphs or function call counts which are crucial for finding and tuning your critical paths.
- All of the above schemes work at a functional level. If you are interested in finding out how many times a line of code is executed within a function, see the tcov man pages.
Selecting Good Benchmarks
- When searching for performance bottlenecks, it is important to use the right benchmarks. It is often easy to find and fix a bottleneck that makes benchmarks run faster, while the performance of real customer applications remains unchanged. In an ideal world, all performance tuning would be guided by real customer workload. In our less than ideal world, we must use approximations to real customer workload.
- It is becoming more and more of a requirement for vendors to run a customer's application as part of the sales process. Improving your marketing-oriented benchmarks undoubtedly catches the customer's eye. However, improving your customer-oriented benchmarks will help close the sale and generate future business from your satisfied customers. In today's market, both types of
- benchmarks need to be used to maximize your company's profit. The advantages and disadvantages of the three types of benchmarks are discussed below.
-
- Shared Library Interposer - Customer oriented
The Shared Library Interposer (SLI) is an excellent tool for logging and playing back library calls from real customer applications. Not only will SLI help you find exactly what needs to be tuned for a given application, it will allow you to give feedback to the application writers on how to use your library more effectively. Unfortunately, there is currently no industry standard way of reporting performance of real customer applications.
- Raw primitive benchmarks - Marketing oriented
Be cautious about using raw primitive benchmarks to guide tuning efforts. Although these benchmarks are good tools to measure peak performance, they produce results that are the least likely to match real customer workloads. However, reporting peak performance numbers is still the most common way for vendors to market their products. These benchmarks are well-suited for tuning the inner loops of a particular primitive's rendering code, and they can help identify library overhead for poorly-batched primitives (like single vector polylines).
- GPC Picture Level Benchmark - Customer and marketing oriented
The GPC Picture Level Benchmark's (PLB) exploits the strengths of real customer applications and raw primitive benchmarks. It is currently the closest industry standard benchmark to real customer applications. Because it's a standard, it allows you to compare your product against the competition. Improved results will translate well to customer-visible performance improvements.
Tuning Performance Critical Paths
- Performance tuning can occur on three different levels. The first level involves looking for a central body of code, in which the application spends most of its time. The second level of performance tuning consists of algorithmic improvement, and the third level involves tuning assembly language.
Locating the Central Body of Code
- The first level of performance tuning involves looking at your profile output and checking for any obvious problems. For example, there might be a transform evaluation on every primitive, or new and delete might be called for every primitive. Fixing these types of problems usually requires little work. A simple yet extremely useful performance technique for this is to cache values in software for later use. You may also need to add an if test in several places and restructure the code, but the basic algorithm can remain intact. The difficult part in fixing these bugs is finding them. If a lot of time is spent in some functions that you know shouldn't be called frequently, then the collector will point you to the problem. If this isn't the case, then you may need to use the profile libraries so that you get the count and call graph information. Finally, gprof output may show you that you are calling a function many more times than you expected.
Changing the Underlying Algorithm
- The next level of performance tuning involves changing the underlying algorithm. Some examples of this are speeding up the special cases of a general algorithm, using fixed point arithmetic instead of floating point, using software caching schemes, and reducing the number of malloc/free calls from many to one. This type of tuning is frequently needed when a feature of the library that was previously deemed unimportant turns out to be useful to customers. Because the feature had a low priority, not much time was originally spent implementing it efficiently. Now it's necessary to go back and perform an in-depth analysis to make it run fast.
- Coming up with a new algorithm requires designers to have a clear understanding of what is fast and what is slow on the current hardware. Some performance techniques are widely known (square root is slower than addition, hash tables speed up linked list searches, multiplication is faster than
- division, and so on), but there are dozens of techniques that you can used to make algorithms more efficient. See "Tips and Techniques for Faster Code" for a discussion of these techniques.
Tuning at the Assembly Language Level
- To really tune a chunk of code well, you must look at the assembly language output of the compiler1. At some point, the algorithm you're working on must be turned into machine readable format. You need to ensure that all the effort you've put into avoiding expensive operations isn't being overshadowed by some unintelligent process the compiler is doing. This type of tuning should be reserved for performance critical paths. Spending time tuning your code to produce near-optimal assembly output isn't free. Reality dictates that you spend your time using the most cost-effective philosophy. It is good practice to frequently look at the assembly output. As you gain experience, looking at this information can become part of your development process, and it becomes a good way to verify your design.
Tips and Techniques for Faster Code
- As you read through the suggested techniques in this section, you will note that a knowledge of assembly language can be useful. Many of the techniques presented here can be applied without inspecting the assembly output, but a knowledge of assembly language becomes more essential as you progress through the tuning techniques suggested here.
Tune the Innermost Loops First
- Once you have identified your performance critical paths, you need a starting point to begin your tuning. The way to get the most cost-effective performance is to tune the innermost loops first. These loops are executed many times (potentially hundreds of times) for every iteration of an outer loop. Once the
- 1. There is a big difference between reading the assembly output from the compiler and writing assembly language routines. It's very similar to being able to read a foreign language versus being able to speak it. If you're not totally familiar with a particular instruction, you can make an educated guess at what it is or look it up in a manual. This is much easier than creating an assembly language routine from scratch.
- absolute innermost loop has been tuned, start expanding your view outward to the next innermost loops. Time permitting, continue out to your library entry points.
- By tuning your innermost loops first, you can substantially increase the performance of your moderately and highly batched cases. The performance of your poorly batched cases may improve slightly, but not by much. The performance of poorly batched cases tend to be limited by start up costs and has little to do with the inner loops. Improving the performance of poorly batched cases is a more difficult task than tuning highly batched performance, and it requires applying virtually all of the tips in this appendix. If you are interested in tuning for poorly batched cases, assume that every loop is executed once and start counting CPU cycles in your assembler output. You will need to look all the way from your library entry point down to the lowest level function.
- A simple source code transformation to improve inner loops is moving loop invariant code outward. Suppose you want to construct the vertices of a sphere. The straightforward implementation of this is as follows:
-
for ( theta=0.0 ; theta<2.0*PI ; theta+=theta_step ) {
for ( omega=-0.5*PI ; omega<0.5*PI ; omega+=omega_step ) {
pt->x = cos(theta)*cos(omega);
pt->y = sin(theta)*cos(omega);
pt->z = sin(omega);
pt++;
}
}
|
- This could easily be changed to:
-
for ( theta=0.0 ; theta<2.0*PI ; theta+=theta_step ) {
cos_theta = cos(theta);
sin_theta = sin(theta);
for ( omega=-0.5*PI ; omega<0.5*PI ; omega+=omega_step ) {
cos_omega = cos(omega);
pt->x = cos_theta*cos_omega;
pt->y = sin_theta*cos_omega;
pt->z = sin(omega);
pt++;
}
}
|
- You have reduced the number of calls to cos() and sin() in our inner loop from 5 to 2. Assuming that omega_step is small enough that the inner loop executes a large number of times relative to the outer loop, you should see a performance increase by a factor of 2.5. You could try exploiting the relationship cos2 + sin2 = 1, but that would depend on the relative speeds of cos/sin and sqrt. If the hardware supports sqrt, use it.
- The above example was a fairly simple one. Less obvious cases are more prevalent. Below is an example for copying one string to another. Assume the string structure has a length field and a character array with sufficient space to hold the string:
-
dest->length = src->length;
for ( i=0 ; i<src->length ; i++ ) {
dest->string[i] = src->string[i];
}
|
- Since you are operating through pointers, the compiler cannot assume that src->length isn't changed during each iteration of the loop. To get around this, keep the string length in a local variable as shown below:
-
length = dest->length = src->length;
for ( ; length>=0 ; length-- ) {
dest->string[length] = src->string[length];
}
|
- On the SPARC architecture, the upper loop takes six instructions per iteration while the bottom loop takes four instructions. This simple example shows us the most important lesson that can be learned about performance tuning and that is, don't trust the compiler. No matter how efficient the compiler gets, it cannot surpass a knowledgeable programmer.
- As you move loop invariant code outward, you'll notice a proliferation of local variables. This is perfectly acceptable. These local variables can be thought of as a cache of values created by the programmer. While there are cases where too many local variables hurt performance, they are rare and their penalties are low in comparison to the much more likely gains they offer. It is quite common for local variables to map directly to hardware registers and never get stored to memory. One way to help the compiler to realize this is to declare local variables within the smallest scope they will be used in.
Don't Optimize Uncommon Cases at the Expense of Common Cases
- Although this rule is intuitively obvious, it is perhaps the easiest to forget. It is often quite tempting to add code that makes a seldom used operation run faster. At times you will need to add a little logic someplace else to make this optimization work. Note how this affects your common cases, and if it does, make sure that the performance trade-offs you are making are good ones.
- This rule could also be called a "keep it simple" rule. To a first approximation, the more complicated and convoluted the code, the slower it will run. If you're just getting started in performance tuning, then go for simplicity. After you've had a chance to get familiar with the types of trade-offs that are made in the name of performance, you'll be in a better position to estimate the consequences of additional complexity.
Special-Case the Common Cases
- Most libraries have a set of attributes that can be changed by the user. Libraries will need to switch on the attribute or employ a hierarchical set of if tests to decode the attribute. In either case, it can be worthwhile to special case a small number of the most frequently set attributes (like line color). By having an if test that succeeds most of the time, you can decrease the average amount of time spent setting attributes for most applications. Certainly an application that never sets the line color will run slower, but the difference is likely to be quite small since attributes will have to go through the full decode cycle.
Choose Your Software Layers Carefully
- You need to define software layers that don't limit performance. An example is when A calls B(X). The function A knows some property of X which B does not, so B spends time checking for the property, or it simply does things more generally (and less efficiently). Either A and B should be in the same software layer, or perhaps a special case version of B can be written which assumes the knowledge of the properties for X.
- An example is memcpy(), which assumes only character-aligned data. If you are copying word-aligned data (or double-word aligned data), you can copy faster than memcpy() with a simple loop.
Move If Tests Outward
- Although if tests are certainly necessary for programming, it is advantageous to remove as many of them as possible from your performance critical paths. In today's high clock rate, super-scalar RISC chips, each branch in the code carries along with it the possibility of dozens of wasted CPU cycles.
- Removing if tests can often mean replicating code. A simple example of this is shown below for drawing a polyline on a hardware device where the first vertex must be handled differently from all subsequent vertices:
-
first_vertex = 1;
for ( i=0 ; i<num_pts ; i++ ) {
vertex_registers[X_OFFSET] = pt->x;
vertex_registers[Y_OFFSET] = pt->y;
vertex_registers[Z_OFFSET] = pt->z;
pt++;
if (!first_vertex) {
// wait for DRAW operation to finish
while(vertex_registers[DRAW_STATUS] != ALL_DONE) ;
}
first_vertex = 0;
}
|
- This code can be restructured as:
-
vertex_registers[X_OFFSET] = pt->x;
vertex_registers[Y_OFFSET] = pt->y;
vertex_registers[Z_OFFSET] = pt->z;
for ( i=1 ; i<num_pts ; i++ ) {
pt++;
vertex_registers[X_OFFSET] = pt->x;
vertex_registers[Y_OFFSET] = pt->y;
vertex_registers[Z_OFFSET] = pt->z;
// wait for DRAW operation to finish
while(vertex_registers[DRAW_STATUS] != ALL_DONE) ;
}
|
- By replicating the code which sends the vertex information to the hardware, an if test on every vertex was removed.
- The above example was a simple one because it dealt with a very small and manageable portion of code. As you expand your focus outward from the innermost loops, it gets more and more difficult to replicate code. You start to
- have huge chunks of code, each of which does basically the same thing. This becomes a maintenance problem. One technique for overcoming this is to have source code files with #ifdefs that are included by other files. Suppose you wanted to augment the above line renderer to handle polylines with color at each vertex. The straightforward way to do this is with an if test inside the inner loop:
-
vertex_registers[X_OFFSET] = pt->x;
vertex_registers[Y_OFFSET] = pt->y;
vertex_registers[Z_OFFSET] = pt->z;
for ( i=1 ; i<num_pts ; i++ ) {
pt++;
vertex_registers[X_OFFSET] = pt->x;
vertex_registers[Y_OFFSET] = pt->y;
vertex_registers[Z_OFFSET] = pt->z;
if (pt_type & VERTEX_COLOR) {
vertex_registers[R_OFFSET] = pt->color.r;
vertex_registers[G_OFFSET] = pt->color.g;
vertex_registers[B_OFFSET] = pt->color.b;
}
// wait for DRAW operation to finish
while(vertex_registers[DRAW_STATUS] != ALL_DONE);
}
|
- This keeps your maintenance costs down but at the expense of performance. If things like line patterning, homogeneous coordinates, or vertex flags are added, you will end up with a large number of if tests performed for every vertex. Fortunately, you can keep the maintenance costs down and still have optimal performance by keeping the code below in a separate file called PolylinesProto.h:
-
{
// setup code here (probably with #ifdef's in it)
vertex_registers[X_OFFSET] = pt->x;
vertex_registers[Y_OFFSET] = pt->y;
vertex_registers[Z_OFFSET] = pt->z;
for ( i=1 ; i<num_pts ; i++ ) {
pt++;
vertex_registers[X_OFFSET] = pt->x;
vertex_registers[Y_OFFSET] = pt->y;
vertex_registers[Z_OFFSET] = pt->z;
# ifdef(VERTEX_COLOR)
vertex_registers[R_OFFSET] = pt->color.r;
vertex_registers[G_OFFSET] = pt->color.g;
vertex_registers[B_OFFSET] = pt->color.b;
# endif
// wait for DRAW operation to finish
while(vertex_registers[DRAW_STATUS] != ALL_DONE) ;
}
}
|
- The PolylinesProto.h file is included in another file as shown below:
-
#define FUNCNAME PolylinesXyz
#undef VERTEX_COLOR
#include PolylinesProto.h
#undef FUNCNAME
#define FUNCNAME PolylinesXyzRgb
#define VERTEX_COLOR
#include PolylinesProto.h
#undef FUNCNAME
|
- Thus, whenever a bug is filed, you fix it once for all polyline renderers. Unlike embedding if(constant_expression)'s in a macro definition, this technique allows the debugger to step through the #include'd code. This technique was used for the Sun XGL GX pipeline, which has nearly one hundred special purpose polyline renderers.
Unroll Loops Where Appropriate
- It was shown above that it is worthwhile to minimize the number of if tests on your performance critical paths. This rule applies to loops as well. If you have some knowledge about how your loop is going to be used, then you can exploit that knowledge to reduce the number of branches in your code along with your loop overhead. Just like the examples in the preceding section, this means replicating code. Unlike the preceding section, loop unrolling is only effective inside loop constructs and, therefore, is really only applicable in your innermost loops.
- One example of where this can be used is in an xgl_multi_simple_polygon() rendering routine. You could have a specialized renderer which handles the case where the SIDES_ARE_3 flag is set. Instead of having an outer loop for each polygon and an inner loop for each vertex, the inner loop can be completely unrolled to send down three vertices at a time. This way, you have reduced the number of if tests per polygon from four to one and saved other per loop-iteration overhead such as incrementing loop variables. A similar optimization could be used for SIDES_ARE_4 (possibly be used in conjunction with XGL_FACET_FLAG_SHAPE_CONVEX).
- Another common area for loop unrolling is in memory copy operations. The canonical copy operation shown below takes six instructions to copy each word of data when compiled at -O2 on SPARC (-O4 does some loop unrolling for you). Of these six instructions, only two are actually useful (the loading of the src value and the storing of that value to dst). The other four instructions are purely loop overhead (testing size, incrementing dst, incrementing src, and decreasing size).
-
for ( ; size>0 ; size-- ) {
*dst++ = *src++;
}
|
- By unrolling the loop once, you can get the loop to use nine instructions to copy two words of data. The example of unrolling the loop once is as follows:
-
for ( ; size>1 ; size-=2) {
dst[0] = src[0];
dst[1] = src[1];
dst += 2;
src += 2;
}
if (size) {
*dst = *src; // in case "size" is odd
}
|
- Unrolling the loop again produces 13 instructions to handle four words of data. Remembering that two instructions per word is the least possible, the efficiency has improved from 33% (2/6) to 61% (8/13). The example of unrolling the loop again is as follows:
-
for ( ; size>3 ; size-=4) {
dst[0] = src[0];
dst[1] = src[1];
dst[2] = src[2];
dst[3] = src[3];
dst += 4;
src += 4;
}
if (size<2) {
if (size==1) {
dst[0] = src[0];
}
} else {
if (size==2) {
dst[0] = src[0];
dst[1] = src[1];
} else {
dst[0] = src[0];
dst[1] = src[1];
dst[2] = src[2];
}
}
|
- Unfortunately, as the loop gets more and more unrolled, the cleanup code after the loop gets more and more complicated. For massively unrolled loops, the cleanup code might best be handled with a switch statement:
-
for ( ; size>15 ; size-=16 ) {
dst[0] = src[0];
dst[1] = src[1];
...
dst[15] = src[15];
dst += 16;
src += 16;
}
switch(size) {
case 15: dst[14] = src[14];
case 14: dst[13] = src[13];
...
case 1: dst[0] = src[0];
}
|
- You will need to keep in mind just how the code will be used. If you plan to copy large amounts of data (like on the order of Kwords), then it's perfectly reasonable to unroll your loops 16 or 32 times. If you plan to only copy a handful of words, then extreme loop unrolling can actually hurt your performance. The loop should only be unrolled to the extent that a typical use will execute at least a few iterations.
Reduce the Cost of Multiple Clause If Tests
- In C/C++, if tests treat the logical operations && and || specially when performing expression evaluation. The compiler produces code that will only continue to evaluate the expression as long as the result is not known. For example, the code below tests for b==3 which will only occur if a==1. If a!=1, then the expression cannot possibly be true regardless of what b is.
-
if ( (a==1) && (b==3) ) {
/* do something */
}
|
- To the compiler, the code above looks like:
-
if (a==1) {
if (b==3) {
/* do something */
}
}
|
- Likewise, the following code:
-
if ( (a==1) || (b==3) ) {
/* do something */
}
|
- looks to the compiler as follows:
-
if (a==1) {
/* do something */
} else if (b==3) {
/* do something */
}
|
- So when using &&, you should put the sub-expression most likely to fail at the beginning of the expression. When using ||, put the sub-expression most likely to succeed at the beginning. This will reduce the average number of if tests your code executes.
- Sometimes you will have a long list of && separated == expressions which you think will frequently succeed. This commonly happens when you are checking current state versus cached state. By avoiding the use of &&, you can reduce the number of if tests by using integer math. For example, the following code:
-
-
if ( (v1->a == v2->a) && (v1->b == v2->b) && ... )
- can be transformed to:
-
-
if ( !( (v1->a - v2->a) | (v1->b - v2->b) | ... ) )
- It's worth noting that the recommended code will be slower than the original code if, for example, v1->a != v2->a. This technique should only be used when all clauses are expected to frequently be true.
- Similar techniques can be used with || separated if tests. For example, the following code:
-
-
if ( (v1->a != v2->a) || (v1->b != v2->b) || ... )
- can be transformed to:
-
-
if ( (v1->a - v2->a) | (v1->b - v2->b) | ... )
- Using fast greater-than or less-than operators takes a bit more effort but is still useful. For example, the following code:
-
-
if ((x>=xmin) && (x<=xmax) && (y>=ymin) && (y<=ymax))
- can be transformed to:
-
if (!(((x-xmin) |
(xmax-x) |
(y-ymin) |
(ymax-y)) >> 31))
or:
if (!(((x-xmin) |
(xmax-x) |
(y-ymin) |
(ymax-y)) & 0x80000000))
|
- This takes advantage of the sign bit from the subtractions to do branch free comparisons. Again, these techniques should only be used when all clauses of an if test are likely to be evaluated.
Optimize for the Common Code Path
- When you write code, there are often many places where optional cases or error conditions are tested for. Whenever possible, try to make the common case the main path, as this causes fewer branches to be taken. Straight-line code has a better cache hit rate and also helps processors prefetch and execute more useful instructions in a given number of clock cycles. The disadvantage is that the code can sometimes be harder to read because the test and the action for each exception is separated. A simple example is shown below; note that more benefit is gained when multiple options and error conditions are being handled.
-
/* do something */
if ( special case )
/* handle special case */
else
/* handle typical case */
return
|
- This code can be rewritten as:
-
/* do something */
if ( !special case)
/* handle typical case */
return
else
/* handle special case */
return
|
Avoid Using Malloc/Free and New/Delete
- Allocating and freeing memory is an expensive operation. If a section of code on a performance critical path requires its own temporary space, try to either allocate it on the stack or cache it somewhere.
- Allocating space on the stack is easiest when you know in advance how much space you will need and the amount is reasonably small (like space for a handful of 4x4 transforms). If you don't know how much space you will need at compile time, but you do know that it's small, you can try using alloca() instead of malloc(). The alloca() function gives you an amount of memory by bumping the stack pointer. This method of memory allocation should be used with caution since it will fail if you exceed the stack limit1. Since this method of allocation uses the stack, there is an implicit free when you leave the calling function.
- If you need to malloc/new space, try to cache a pointer to that space in whatever structure is both handy and likely to be around the next time through the code. You will need to check each time that you have enough space (and if you don't, free the old space and allocate a bigger chunk), but this is very cheap when compared to the costs of memory allocation.
- Sometimes neither of these schemes is appropriate. If this is the case, try to minimize the number of allocations you do. Calculate the total size you will need, allocate one big chunk, and set your pointers to the appropriate offsets. It's worth noting that this performance recommendation is in direct opposition with object-oriented design principles. You will need to decide before you begin which is more important to you.
- 1. Not only is alloca()'s behavior on failure undefined, the man page strongly discourages its use.
Cache Whatever You Need
- Another basic technique is caching values that you need. If you think it's likely for the same path to be taken through the code many times in a row, then look for calculated or constructed values that can be cached for future use. This applies particularly to requests that involve context switching (for example, system calls or Xlib inquires). Although caching is a useful technique, you need to keep in mind the complexity of invalidating your caches. Don't leave this crucial aspect out of your design phase.
Preserve Batching
- When a library is handed data from the application, it is almost always a bad idea for the library to break it into smaller pieces. Not only does breaking up data make the library code more complicated and harder to understand, it is virtually guaranteed to reduce performance1. The only way to increase the batching factor beyond what the application gives you is to go through a copy operation. A low-level routine should not have to perform a copy just because a high-level routine broke up data.
Keep Parallelism As High As Possible
- In an immediate mode accelerated graphics environment, it is critical to consider the parallelism between the CPU and the accelerator to get anywhere near maximum performance. Some accelerator attributes will require the hardware pipeline to be empty while other attributes will not. You must look closely at the attributes that require the pipeline to be empty. As soon as one of these attributes come down, flush any outstanding data to the accelerator. Attempt to delay sending the attribute as long as possible. That may mean you should return to the application after setting some state to indicate that an attribute change is pending. On the next call to your library, wait for the accelerator to become idle, send the attribute, and finally process whatever the current call was.
- 1. An exception to this rule is if you have a multiprocessor system. In this case, it may be best to hand whatever data you've got to an idle processor. Even in cases such as this, before and after measurements should take place to ensure that the performance does actually go up.
Avoid Using Global Variables
- Because this is the age of dynamic libraries, all code must be relocatable. This also applies to global variables (local variables are kept on the stack and are therefore easy to locate). This need forces global variables to be referenced via a table indirection. Depending on whether your library is compiled -pic or -PIC, this will either be a single level or double level indirection. In a multi-threaded environment, global variables have the additional overhead of needing protection via locks. If you must use global variables, minimize their usage by creating local variables that point to the globals. The following code:
-
int pt_size = global_data.pt_size;
...
float *pt = global_data.pt;
...
float *colors = global_data.vertex_colors;
|
- would run faster as:
-
gd *lgd = &global_data;
int pt_size = lgd->pt_size;
...
float *pt = lgd->pt;
...
float *colors = lgd->vertex_colors;
|
- In the second case, the indirection is only performed once per function instead of every time the global data is referenced.
Reduce Function Call Depth
- Depending on your hardware, function calls can be relatively cheap or expensive. Regardless, they are never free. An effort should be made to keep the function call depth to a reasonable limit. Not only can this result in less instructions executed, but it will reduce the number of code pages you touch so that your working set will be smaller. However, it is not recommended that you have complicated functions in order to reduce the number of code pages. As you are designing your code, just bear in mind the cost of function calls.
- SPARC is optimized around register windows. Programs that are well-behaved in their function call depth will benefit from this and those that are not will suffer when their register windows frequently spill to memory.
- x86 has relatively expensive function calls. Even ignoring the issues of pushing and popping parameters from the stack, the instructions call, leave, and ret take 3, 5, and 5 cycles, respectively on a 486 (the cycles are 7, 4 and 10 on a 386).1
Use Fixed Point Arithmetic
- Depending on the following listed criteria, it may be advantageous to convert your floating point data to fixed point. These criteria are as follows:
-
- Relative speeds of integer and floating point code on your hardware
- Whether your hardware uses super-scalar technology
- How much precision you need
- This is particularly true if you are walking scanlines and you would like to avoid a floating point to integer cast for every pixel. Addition and subtraction are the two most common operations on fixed point numbers (since floating point multiplication/division is usually faster than integer multiplication/division).
Exploit the Math That Your Hardware Does Well
- If you know which specific platform your code will be running on, you can exploit the hardware to its fullest. If your code needs to run well on a variety of hardware, you may be forced to use the lowest common denominator. If this is the case, avoid using square root, integer
- multiplication/division/remainder, and to a lesser extent, floating point division. It is safe to assume that you will have fast addition and subtraction of both integer and floating point data. In addition, floating point multiplication is fast.
- Some examples of things you can do are the following:
-
- Avoid using square root if you don't need to. For example, don't normalize vectors if you only need them for backface culling. Put an if test in your vector code to normalize only when necessary.
- 1. Cycle counts from Microsoft's 80386/80486 Programming Guide, Ross P. Nelson, 1991.
-
- If a variable will be used to divide two or more other variables, calculate its reciprocal once and use that to multiply with the other variables. Sometimes you can avoid both integer and floating point division of variables by multiplying other variables. For example:
-
-
if (a/10 >= b) and if (a/10.0 >= b)
- can be replaced with:
-
-
if (a >= b*10) and if (a >= b*10.0)
- Notice that the integer technique works with ">=" but not with ">".
-
- If you know something about one of the operands of an integer multiplication, you may be able to use shifts and adds to get the result. For example, if you know that one operand will always be between 0 and 15, then use a switch with 16 cases that multiplies the other operand by a constant. Be sure to check to make sure the compiler turns this into a series of shifts and adds.
- The only kind of fast integer division and remainder is when the divisor is a power of 2. If you have such a divisor, then code using either shifts and ands, or verify that the compiler is smart enough to notice.
Use Single Precision Floating Point Constants
- ANSI C/C++ dictates that floating point constants by default are double precision. This affects your code in several ways:
-
- Two words of data are loaded from memory instead of one for every floating point constant.
- Variables and temporary values may undergo a conversion to double precision (for example, more instructions).
- Fewer floating point registers are available because you have double precision copies of single precision data.
- Expression evaluation is done using double precision instructions, which are potentially slower than single precision instructions.
- Fortunately ANSI allows you to keep all your floating point constants in single precision by adding an f suffix. For example, change:
-
-
if (val < 0.0)
- to:
-
-
if (val < 0.0f)
- to get 0.0 to be single precision. Note that if you are using cc, you will also need to apply the -fsingle compile line option to get single precision expression evaluation. You can think of the f suffix as merely registering with the compiler that the constant's data type is float and not double.
Avoid Careless Use of the Stack
- Use the stack sparingly. RISC CPUs tend to have an abundance of general purpose registers that are quite effective in increasing performance. Keeping your function's local variables in these registers can dramatically increase the speed of your code. On SPARC, look for references to the frame pointer (%fp) in your performance critical functions. Pay close attention to the references inside your inner loops. For the most performance-critical functions, it is an achievable goal to have absolutely no references to %fp.
- Removing references to %fp is not easy. You may find that you need to break up a function into many smaller, specialized functions. Frequently, however, you will be able to tune the code so that it can be easily processed by the compiler. Creating new local variables can be used to move %fp references outside of a loop.
- Be advised that declaring variables to be of type register does not guarantee that they will actually be in a register. The register keyword is only a hint to the compiler. Different compilers (and even different versions of the same compiler) will consider this keyword differently. It is perfectly legal for a compiler to completely ignore this hint.
- Experiment with changing the code to see just what it is that your compiler needs. This level of tuning requires looking at the assembly output of your compiler. Every compiler has its quirks, and your task is to figure out these quirks.
Optimized Leaf Functions
- CPUs with register windows typically have a much lower function call cost than CPUs that don't have register windows. Above and beyond this, register windows support an even faster kind of mini-function called an optimized leaf function. The idea is that if a function only uses windowed out registers (no local or floating point registers, and no stack or frame pointers), and calls no
- other functions, then the function can operate using the caller's register window and stack frame. The benefits of optimized leaf procedures is a savings of one or two instructions per call plus the possibility of not overflowing the register windows.
Try to Minimize Loads
- On high-clock rate RISC machines, loads are much more expensive than stores. This is because loads require a round trip message. First, the request is made by the CPU for some data, and at some later time the data is given to the CPU. Stores are faster because the CPU simply issues a request for the data to be stored, and the memory subsystem worries about the rest. Loads can also be slower because they may have to wait for the store buffer to drain (see "Cluster Loads and Cluster Stores" on page 423). CPU caches exist to alleviate the problems associated with loads, but the cache will never get a 100% hit rate, nor will it help with uncachable data like a device's registers. The penalty for a cache miss is high enough to factor it into a design. As CPU speed continues to go up, this penalty will get higher.
- Techniques for minimizing loads have already been brought up, but it is worth repeating here. Make sure that loops have local variables that directly reference the data you are interested in. Don't have code like:
-
for ( i=0 ; i<size ; i++ ) {
...
sum += a->b[i];
...
}
|
- This should be changed as follows:
-
bptr = a->b;
for ( i=0 ; i<size ; i++ ) {
...
sum += bptr[i];
...
}
|
- Also keep pointers to global variables around if possible. Look for stack references and minimize their number. Loading data from across a bus is particularly expensive, so you should try to limit this process as much as possible.
Cluster Loads and Cluster Stores
- Current RISC hardware handles back-to-back loads and back-to-back stores well, but does not handle load-store-load-stores well. This is partly due to the cache. Every time you store a new value into the cache, you run the risk of invalidating some data that you're about to load. As caches get more associative, this risk goes down, but it never goes away.
- Most processors also have what is known as a store buffer. This is typically a small FIFO queue that fills up with data to store if the memory subsystem is busy. A CPU may need to wait for the store buffer to empty before a load can be issued. The problem here is basically parallelism. The ideal is to have your CPU and memory subsystem doing productive work at all times.
- If you were to rewrite the body of the 4-way unrolled copy loop below:
-
for ( ; size>3 ; size-=4) {
dst[0] = src[0];
dst[1] = src[1];
dst[2] = src[2];
dst[3] = src[3];
dst += 4;
src += 4;
}
|
- to cluster loads and cluster stores, it would look like:
-
for ( ; size>3 ; size-=4) {
t0 = src[0];
t1 = src[1];
t2 = src[2];
t3 = src[3];
dst[0] = t0;
dst[1] = t1;
dst[2] = t2;
dst[3] = t3;
dst += 4;
src += 4;
}
|
- This compiles to the same number of assembly instructions, but the loads and stores are handled in blocks rather than interleaved for every word copied. As tends to happen, this code uses more local variables (and also more registers) to force the compiler to do what you want in the order you want. Even though
- you are programming in C, the source code compiles almost line for line to assembly code, and the local variables tend to map one to one with hardware registers.
Use Double Word Loads and Stores
- SPARC supports the concept of 64-bit quantities through C and C++. This can be advantageous for setting and moving blocks of data. Each double word load/store instruction takes the place of two single word load/stores. Depending on the characteristics of the memory subsystem, you may be able to achieve an almost perfect 2:1 speedup (going over an I/O bus is such a case).
- Double word load/stores can only be used on double word-aligned data. For setting or clearing a block of data, this is easily handled by testing the starting address and possibly writing out a single word of data before entering the main double word loop. For copy operations, there is the additional complexity of the source address being double-aligned, but the destination address is not (and vice versa). In general, there is no way in C to exploit double word load/stores in this situation. You would need to have an assembly language routine to do it. However, if you are writing data to a device input buffer, you may be able to write a one word NO_OP directive to get the source and destination address alignments synchronized.
- To convert the body of our unrolled loop to use double word load/stores requires casting things properly, as shown below:
-
for ( ; size>3 ; size-=4) {
d0 = *(double*)(src+0);
d1 = *(double*)(src+2);
*(double*)(dst+0) = d0;
*(double*)(dst+2) = d1;
dst += 4;
src += 4;
}
|
- Of course, you should have your src and dst declared to be double (or long) to improve the code readability. Also, the ProWorks compiler needs to have the -dalign flag to use double word load/stores.
Be Cache-Aware
- Certain types of algorithms need to take into account the hardware caches present in the systems they will run on. If you are writing code that accesses large amounts of data (memory rasters, Z-buffers, texture maps), you should bear in mind how the hardware cache will affect the performance of your code. Try to keep your data references bounded within small local regions. Also, when allocating space for data structures, try to keep adjacent data from mapping to the same cache line. For example, if you need 1024-byte scanlines, allocate 1152 bytes per scanline so that pixel X,Y doesn't map to the same cache line as pixel X,Y+1.
Compiler Options
- A general recommendation is to know the optimizations that your compiler supports. Although compiler options will vary depending on your code and the system, some possible options are listed below. Check your reference pages for more information.
-
Table A-2
| cc | CC |
| -xcgXX | -cgXX |
| -dalign | -dalign |
| -fast | -fast |
| -fsingle | -ispace vs -ispeed |
| -xlibmil | -libmil |
| -native | -native |
| -xO | -O |
-K pic vs -K PIC
-xunroll | -pic vs -PIC
-Qoption fbe -cgXX |
|
|