Scheme for Optimizing Calls to Functions through Pointers on Pentium Processors
Articles and Tips: article
Software Engineer
Java Technology Group
Dana Henriksen
Software Engineer
Novell Core Operating Systems
Allan Neill
Software Engineer
Java Technology Group
01 Aug 1998
Defines and illustrates a method used by NetWare engineers to eliminate branch misprediction as much as possible in optimizing NetWare. While profiling the uniprocessor context switch code in the NetWare 5 kernel in an attempt to accelerate it, many branch mispredictions were observed.
- Introduction
- Illustration
- Application
- Investigation
- Generalization of the Stub Fix-up
- Implementation
- Details
- Summary
- Acknowledgements
- Appendix A FixupStubToFunc and FixupStubToData
- Appendix B FixupCallToFunc and FixupCallToData
Introduction
While profiling the uniprocessor context switch code in the NetWare 5 kernel in an attempt to accelerate it, a great number of branch mispredictions were observed. A branch misprediction on Pentium processors is associated with the phenomenon of multiple instructions being executed simultaneously over an instruction pipeline on the processor. An instruction enters the pipeline at one end and its execution is completed when it leaves the other. In order to keep the processor's pipeline full and therefore most effectively used, the processor somewhat arbitrarily guesses the outcome of each branch (jump from one instruction to another, often remote instruction) and begins executing at that instruction as if that would be the right one. When the branch instruction's execution is completed, the processor then knows whether or not it guessed right. If it guessed wrong, it needs to flush the entire pipeline and start filling it with the right instructions. Eliminating branch misprediction as much as possible became an important effort in optimizing NetWare. This Developer Note defines and illustrates one method used by NetWare engineers in this effort.
Illustration
In layperson's terms, if a car arrives at a fork in the road, a turn one way or the other must be chosen. If the road chosen doesn't ultimately lead where the car was to go, then the car must be restarted back at the fork before continuing the proper journey. The time it takes to put the car back to go down the right road is the penalty for misguessing added to the wrong journey already taken. The Pentium processor doesn't have to back trace all the instructions executed in imitation of the car retracing its path, but it must nevertheless rid itself of the false path and fetch the new one that will be executed.
This wrong guess is called a branch misprediction and it stalls the processor, costing a great deal of time. Using a profiling tool, StatsM.NLM, written by Dana Henriksen, the main cause of the branch misprediction, the use of function pointers in C, was identified.
When this situation is corrected by coding a direct call rather than the indirect one (function pointer) that was there before, the very location of the target procedure being known beforehand tells the processor where the branch will go. Therefore, the processor starts reading and decoding instructions at that address. It can also remember where the call came from and can correctly predict the ret branch instruction when it encounters it. When the call instruction finally leaves the pipeline, the instructions in the target procedure are already being executed and the processor is able to continue executing multiple instructions in parallel.
It is as if our car journey can restart at the correct place and resume the correct fork without the additional loss of finding the fork in the road again. The processor merely ignores the false journey and begins the true one. The processor does not have to be stalled in order to reset to the branch place again.
Application
Getting a time stamp had been optimized by Intel for the Pentium and Pentium Pro processors so that it would use the internal time stamp counter of the processor. Conversely, on older x86 processors, the process of getting a time stamp required accessing the timer chip on the motherboard. Because NetWare must support both methods of getting a time stamp because it still runs on x86 processors, the operating-system engineers added a level of indirection by creating a function pointer initialized at startup time to point to the appropriate method for getting a time stamp, and, in this way, abstracted the process from the underlying hardware. Therefore, to get a time stamp, NetWare simply made an indirect call through the function pointer to access the appropriate source for the time stamp.
In similar fashion, the C runtime and other libraries for NetWare were also using function pointers as a means of abstracting themselves from the underlying version of the NetWare OS on which they were loaded. In this way, the library developers are able to support several versions of NetWare using not only a single code base, but also a single binary. On x86 processors (prior to Pentium), this had only the effect of adding a level of indirection to the overhead of abstraction considered acceptable given the gains made in not having to support multiple code bases, numerous preprocessor distinctions, separate builds to produce the library binaries, or complicated software distribution. Dana first identified this problem and performed the initial work using the case of the time stamp call to verify the principle of the effect of using function pointers in C upon branch misprediction.
Because developers must react to continually changing software versions and environments, we think the opportunity to apply this method of optimization is as widespread in all code as it is in NetWare. The limitation of its present implementation lies in the Intel-specific instruction set, but the principles might well apply within the framework of other processors, though it is not presently our concern to explore that.
Investigation
When the profiling tool identified this indirect call to get the time stamp as the source of branch mispredictions, it became immediately obvious that the processor wouldn't know where the indirect call would go until it actually got there unless there was an entry in the branch target buffer (BTB), the processor's small cache of recently-taken branches, a sort of history that can be used to help predict branches using recent behavior as a guide. In our case, this wasn't happening. The only way to discover the target of the call would be to read the value of the pointer unlike other branches where there is information about the branch target imbedded in the code that can be used to predict the branch.
To eliminate the problem, the time stamp call was changed back to a direct rather than indirect call. Then, to solve again the problem of not knowing which source to use for the time stamp abstraction of this source code was written to run at initialization time which would modify itself to choose the appropriate source of the time stamp depending on the platform loading NetWare.
After the change, the code resulting from the self-modification was profiled and it was found that the cost of each branch misprediction had been around 15 clocks. The speed improvement was easily measured using the aforementioned profiling tool.
We also noticed that not only the branch mispredictions for the time stamp call disappeared, but also a number of other branch mispredictions, leading us to conclude that branch mispredictions on calls also disrupt the processor's ability to properly predict the returns on the stack. Each indirect call through the time stamp routine function pointer had caused multiple branch mispredictions, each at a cost of around 15 clocks.
This led to the further conclusion that indirect calls through function pointers are not only unfortunate but especially degrading when done in frequently-used code paths. That is what got us started in looking for a more general solution to supporting a level of indirection for portability abstraction without suffering the cost paid hitherto by the function pointer method. The result was a method of creating stubs that would be fixed up at initialization time.
Generalization of the Stub Fix-up
The first stub fix-up was to choose the correct function or data item to access which would furnish, for the underlying platform, the desired result. This was satisfactory, but enthusiasm with the original success at fix-up led Allan Neill and Russell Bateman to speculate as to how the indirection of making the stub call in the first place might be eliminated as well by replacing the stub call with a call to the right function desired all along. The next logical step was to replace the function call itself with a simple mov instruction in the case of a data item which would no longer require a stub function to return the dereferenced data, but the scheme would provide it directly. This was possible because the coded stub call requires 5 bytes and the replacement instruction, be it another call or mov instruction, also only requires 5 bytes, as will be illustrated later. Thus, the stub fix-up approach was generalized not only to remove the additional level of function call overhead, but also to access data items directly.
Implementation
The solution was studied in the framework of the C runtime libraries. Dana and Russ wrote the initial level of fix-ups and then Russ added and refined the approach to include referencing data directly and, later, overcame the problem of fix-ups executing in a user-address space (ring 3) made necessary by the fact that the majority of the runtime library code loads into user space as well as the kernel. This work was performed live in the substantial code base that would become NetWare 5 Beta 3 and shipped in that product. In parallel, Allan used this work in another product under development at Novell.
Details
The method settled upon finally included the definition of a number of stubs representing a metalayer between the library and the operating system it interfaces. Examples include getting the identity of the currently-executing process, getting a pointer to the currently-executing thread's private data area, a thread yielding control to the processor, requesting a mutually-exclusive lock (mutex), and so on. For the purpose of this paper, we will illustrate the entire solution adopted by Novell's C runtime libraries for NetWare using a few sample functions that need this treatment. (Nevertheless, Novell does not guarantee that these functions here-named are real nor represent real application.) The functions are the following:
CurrentProcess GetContext _LockMutex
Once the metacalls or stubs were defined, they were coded in assembly using nop. The amount of room for rewrite by the stub initialization functions was calculated as being 15 bytes maximum regardless of the stub future as a function call or data item. Attempts to write these stubs in C proved uninteresting because it was necessary to experiment with "tricking" the compiler optimizer into leaving sufficient room. The resulting code was never anything like a usable default so it was decided that the nop instructions looked better and were a quick, decisive sign in the debugger that fix-up had not yet been performed. The tool used for this is the PharLap 386 assembler. For example,
CurrentProcessproc nop ; +00 nop ; +01 . .. nop ; +0e CurrentProcessend
Actually, since all the stubs would be identically coded, macros were set up to make it easy for any engineer to add a new stub later without exercising any mental effort (in other words, a no-brainer):
STUB CurrentProcess
With the metacalls defined, the stub initialization was written to associate, differently for each platform abstracted, one stub with its correct underlying function or data item. (The address of the target function or data item is usually discovered by dynamic import.) This stub initialization was done by calling a different fix-up function depending on whether the result was to be a function call or a data move. Some items actually vacillated between the two depending on the underlying platform. An example of this is the library retrieving a thread's library data context which is maintained on a per-CPU, context-switch basis by NetWare 5's multiprocessing kernel (MPK), but which had been merely an offset in the process-control block (PCB) on earlier versions of NetWare. Since MPK made the data easy to retrieve by symbol name, no function to retrieve a pointer to the PCB and add an offset to it before arriving at the pointer was necessary.
void ImplementLibraryStubs( int version ) { if (version < 400) /* NetWare v3.12 */ { FixupStubToData(CurrentProcess, RunningProcess, FixupCallToData); FixupStubToFunc(GetContext, __312GetContext,FixupCallToFunc); FixupStubToFunc(_LockMutex, __UPKLockMutex, FixupCallToFunc); } else if (version < 500) /* NetWare v4.10 and NetWare v4.11 */< { FixupStubToData(CurrentProcess, RunningProcess, FixupCallToData); FixupStubToFunc(GetContext, __4xGetContext, FixupCallToFunc); FixupStubToFunc(_LockMutex, __UPKLockMutex, FixupCallToFunc); } else /* NetWare 5 */ { FixupStubToData(CurrentProcess, RunningProcess, FixupCallToFunc); FixupStubToData(GetContext, MPKThreadLibraryContext, FixupCallToData); FixupStubToFunc(_LockMutex, NWMPKMutexLock, FixupCallToFunc); } }
Next, after coding the calls to associate stubs with appropriate platform resources, the fix-up code was written to modify the stubs during library initialization. Different approaches were tried, but with the necessity of supporting self-modifying code in user space (ring 3), the final solution was to set the fix-up code in the kernel (ring 0) and use one function to rewrite function stubs and a different one to rewrite data stubs even though the distinction between the result of both is a slim one.
The first-stage fix-up code was written in assembly since it was much easier to do so and it runs in two stages. First is the initial fix-up already noted and which runs at library initialization time. Second, the resulting fixed-up code, when called during library execution, itself calls fix-up code to rewrite the calling code. Another method might have been to locate all the places in the library where coded stub calls are found and fix them up directly at initialization, but for various reasons including reliability, portability and applicability to general solution, the live or incremental fix-up method was chosen. Besides, it was amusing to step through the code in the early days and see it fix itself up.
The effective prototype for the initial-stage fix-up functions called from ImplementLibraryStubs is:
void FixupCallToFunc/Data (void *targetFunc/DataAddr, void *callerReturnAddr);
This code is reproduced in appendix A. Assumed variables for both functions are:
EAX: the stub function address (through which the stub is recoded) ECX: the target function/data address EDX: the fix-up function address
Pushed on the stack before these calls are significantly:
ESP+C: fixup-function ESP+8: target-function/data ESP+4: stub-function
As noted, the purpose is to overwrite the nop code in the stub function to look like this pseudo x86 code (with additional assumptions noted):
; (return in caller already pushed) MOV EAX, <target-func/data<< PUSH EAX ; push target address CALL <fixup-func< ; and call the fix-up function< POP EAX ; get back target and (optionally)... JMP [EAX] ; ...if function call fix-up, invoke it! ; ...or... MOV EAX, [EAX] ; ...return with desired data in EAX! RET
For the function call case, the following code, which has an identical effect but is more succinct, will be used in place of the pop and jmp illustrated immediately above:
RET ; returns from stub, but cleverly through the target!
As examples, the nop stub becomes, for the function case, the code reproduced here:
mov eax, [Threads.NLM|__4xGetContext] push eax call Lib0.NLM|FixupCallToFunc nop nop nop
(The function case requires three fewer bytes of code than the fifteen allocated.) For the data case:
mov eax, [Server.NLM|RunningProcess] push eax call Lib0.NLM|FixupCallToData pop eax mov eax, [eax] ret
These examples are taken from Threads.NLM loaded on a NetWare 4.11 server.
The actual calls made here, FixupCallToFunc and FixupCallToData have not yet been introduced. They are written in C since writing them in assembly brought no real advantages. The C code may be found in appendix B and is adequately commented to show how this works. When the stub is called, the stub has already been fixed up to call the second-stage fix-up code. The first-stage fix-up also returns the result of the original function call since the stub was called in the first place in an effort to get the underlying operating system function or data item it abstracts. When the second-stage fix-up code is called, at execution time rather than initialization, the function uses the return address in the grandparent to replace the original stub call with either the real, target function or rewrite the grandparent's call instruction with a mov into eax of the real, data item. From then on, that point in the library will never call the stub again as long as the library remains loaded.
Summary
At the writing of this article, this fix-up optimization has been in place for the better part of two months. A number of bugs and even design flaws have been found and fixed. The sample code reproduced herein reflects the research, implementation, and testing of the scheme. Some profiling has been done and the results are very encouraging. However, no numbers are available that can be placed in the public domain. Reporting results exceeds the purpose of this paper, but obviously, 15 clocks per access is a very significant savings even when compared to other notably expensive operations like crossing the protection boundary between a user-address space and the kernel (some 40 clocks). This much time is saved plus additional units of 15 in the cases where, as noted, branch misprediction resulted on the return from call through function pointers. In addition to saving the extra time wasted which is only due to the behavior of the Pentium processor, one or two levels of function calls at up to a half-dozen or so clocks per call overhead are economized which would make the libraries perform slightly better on Pentium's forebearers, the 80486 and 80386. And Pentium also benefits from these. The fact that many of the abstracted items were accessed frequently, often as many as one or more times per invoked library function, the importance of this optimization cannot be overstated.
Acknowledgments
We wish to express our appreciation to Rob Carlson for the idea of publishing this procedure as well as to Greg Hundley and Thane Diamond for their occasional assembly expertise and suggestions during the implementation phase.
Appendix A FixupStubToFunc and FixupStubToData
FROM_STUB equ 0Bh ; useful manifest constant SETUP macro ; registers in both functions set up thus: mov eax, dword ptr [esp + 4] ; hold stub function address in EAX mov ecx, dword ptr [esp + 8] ; hold target function address in ECX mov edx, dword ptr [esp + 0Ch] ; and address of fix-up function in EDX sub edx, FROM_STUB ; calculate and zap the fix-up... sub edx, eax ;...with a relative offset to it endm REWRITE_TO_CALL macro mov byte ptr [eax], 0B8h ; opcode for MOV EAX, immediate value... mov dword ptr [eax+1], ecx ; ...which is the target function address mov byte ptr [eax+5], 050h ; opcode for PUSH EAX mov byte ptr [eax+6], 0E8h ; opcode for CALL instruction and... mov dword ptr [eax+7], edx ; ...calculated offset of fix-up function endm RELOAD_EDX macro mov edx, dword ptr [esp + 0Ch] ; reload EDX with address of fix-up endm FixupStubToFunc proc SETUP; REWRITE_TO_CALL; ; (next byte to overwrite will be eax+0Bh) RELOAD_EDX; mov byte ptr [eax+0Bh], 0C3h ; opcode for simple RET ; equivalent to: mov byte ptr [eax+0Bh], 058h ; opcode for POP EAX mov word ptr [eax+0Ch], 020FFh; opcode for JMP [EAX] ret FixupStubToFunc endp FixupStubToData proc SETUP; REWRITE_TO_CALL; ; (next byte to overwrite will be eax+0Bh) RELOAD_EDX; mov byte ptr [eax+0Bh], 058h ; opcode for POP EAX mov word ptr [eax+0Ch], 008Bh ; opcode for MOV EAX, [EAX] mov byte ptr [eax+0Eh], 0C3h ; opcode for RET ret FixupStubToData endp
Appendix B FixupCallToFunc and FixupCallToData
#define kCallInstruction 0xE8 #define kMovInstruction 0xA1 /* (not steak sauce, but MOV EAX, [addr]) */ typedef unsigned char BYTE; typedef unsigned long DWORD; void FixupCallToFunc ( void *targetFunctionAddr, /* function to call forevermore */ void *callerReturnAddr /* return address in grandparent */ ) { BYTE *caller = (BYTE *) callerReturnAddr; /* ** The address to return to in the caller (in fact, the grandparent since this ** call comes through a stub) is the width of the call instruction past the ** call (duh!). Check to ensure that there was a call instruction and do this ** fixup only if we were got to that way. If so, then zap in place of the ** original stub function address (at 4 bytes before the caller return) the ** target to be called from now on. The grandparent will never call the stub ** (or us) again. */ if (*(caller - 5) == kCallInstruction) *(DWORD *) (caller - 4) = (BYTE *) targetFunctionAddr - caller; } void FixupCallToData ( void *targetDataAddr, /* data address to use rather than stub */ void *callerReturnAddr /* return address in grandparent */ ) { /* ** This fix-up works identically to the above one for functions except that ** the stub calling us is to be fixed up merely to a data item whose address ** we know. */ if (*((BYTE *) callerReturnAddr - 5) == kCallInstruction) { *((BYTE *) callerReturnAddr - 5) = kMovInstruction; *(DWORD *) ((BYTE *) callerReturnAddr - 4) = (DWORD) targetDataAddr; } }
* Originally published in Novell AppNotes
Disclaimer
The origin of this information may be internal or external to Novell. While Novell makes all reasonable efforts to verify this information, Novell does not make explicit or implied claims to its validity.