Using Stack-Walking to Troubleshoot a NetWare Abend

Articles and Tips: article

PAUL COLETTI
NetWare Consultant
Coletti Publishing Ltd.

01 Jun 1999

For veteran troubleshooters comes this daring excursion into the mystic realm of "stack-walking", a little-known technique to help identify the cause of a server abend.

Introduction
Stack-Walking Theory
ABENDEMO Example
Conclusion

Introduction

NetWare 5 contains improvements over previous NetWare versions in terms of protecting applications from corrupting each other or the NetWare kernel itself. Nevertheless, as a network technician in today's critical environments, you probably recognize that even the most tested of software and hardware can cause a server to experience an abnormal end (abend) error. In many cases, the server displays a stack dump screen that few technicians know what to do with. Fortunately, NetWare has a built-in debugger that can help you in trouble- shooting halted servers.

A previous AppNote on troubleshooting abends (see "ABEND Recovery Techniques for NetWare 3 and 4" in the June 1995 issue) talked about the types of abends and presented several abend recovery techniques. It also mentioned that complex abends which occur at interrupt time could be resolved via a technique known as stack-walking , or manual traversal of the stack.

For persistent and unresolvable abends, Novell Technical Services (NTS) may determine that it will be necessary to capture an image of server memory (known informally as a "coredump") and send it to Novell for analysis. Skilled NTS engineers will examine the memory image to ascertain the cause of the abend.

In some cases, however, you may want to take action on your end to assist in the troubleshooting process. This AppNote is for experienced troubleshooters who want to use stack-walking to uncover clues as to the cause of an abend in the period immediately after the server has halted. It describes how to use real-time debugging to trace the stack and try to identify the cause of an abend.

Stack-Walking Theory

Anyone who has ever coded on the NetWare platform will be familiar with the practice of stack-walking. Although some stack basics will be covered in this AppNote, the information is aimed at server technicians who already possess a good understanding of the NetWare OS, its symbols, and its use of multiple memory maps. It is assumed that these technicians are reasonably confident with traditional programming techniques including reading Assembler (not as difficult as coding in Assembler) and identifying common data types and sizes.

It Helps to Have ESP

Before you can start walking the stack, you have to know how the NetWare OS references the stack. It does this by using a special-purpose CPU register called ESP (Extended Stack Pointer). When an abend occurs, you can use NetWare's internal debugger to view the contents of the ESP register. This will contain a memory address that you will use to begin your quest.

To enter the NetWare debugger from an abend screen, hold down the following four keys simultaneously:

<left-Shift< <Esc< <right-Shift< <Alt<

Note: This key sequence is unavailable if the server console has been secured using the SECURE CONSOLE console command.

To view the contents of the ESP and all the other registers, use the "r" command as shown in Figure 1.

Figure 1: The output from the "r" debugger command displays the contents of the server's CPU registers, including ESP.

# r



EAX = 00EA7E30 EBX = 0087A380 ECX = 00000060 EDX = 00B3C340



ESI = 00000000 EDI = 00000000 EBP = 00879E00 ESP = 00879DF8



EIP = 00000000 FLAGS = 00007297 (CF PF AF SF IF NT)

Next, you need to dump a portion of the stack using the following command:

dd esp 40

This will produce output similar to that shown in Figure 2.

Figure 2: This "dd" command displays the memory contents starting at the location pointed to by ESP.

# dd esp 40



00879DF8   F10C4326 00B3C340-00879E0C F10DD07C  &C..@.......|...&&

00879E08   00B3C340 0087A3CE-F12367B6 0087A350  @........g#.P...



00879E18   00B3C340 00000000-00879EA0 00003046  @...........F0..



00879E28   F802B36B 00000000-00015882 000114AE  k........X......

What this command does is dump the next 64 bytes (40 hex = 64 decimal) of memory, starting at the memory location pointed to by ESP. Incidentally, you could have obtained the same result by typing the contents of ESP directly in the "dd" command, as follows:

dd 00879DF8 40

In Figure 2, the numbers in the left-most column are the stack addresses. What follows immediately to the right of each address is a line of 16 bytes that the NetWare debugger has divided into 4-byte chunks (longs) to make it easier to read. The right-most column contains the ASCII representation of these 16 bytes.

As you move down to the next line, notice that the stack address differs by 10 hex (16 decimal) from the address in the line above. The address numbers continue to increase by the same amount as you move down the list. The critical concept to keep in mind here is that in NetWare the stack grows down through memory. Therefore, the most recent address in this stack example is F10C4326.

It is crucial to understand this fact so that as you go deeper into a stack you can correctly identify areas where the stack has been grown (ESP decremented) or shrunk (ESP incremented) in order to make way for or to regain space from a function's local variables.

Return Address Requested

Stack-walking is all about finding return addresses. It is your job to examine the values on the stack and determine whether each one is a return address or not. Return addresses are put onto the stack whenever a CALL function is executed in the instruction immediately prior to the return address. If a value is not a return address, it is a data value, another address, or simply "garbage" left over from whatever process last used that portion of memory.

If this sounds complicated, perhaps an example will make it easier to understand. Consider the following code snippet:

F10DD076 50             PUSH    EAX

F10DD077 E89F72FEFF     CALL    F10C431B

F10DD07C 83C404         ADD     ESP, 00000004

F10DD07F 5D             POP     EBP

Before the OS executes the CALL instruction, it will ascertain the address of the next instruction and place this address in the memory location pointed to by ESP. It will then decrement ESP by 4 (in modern programming, return addresses are often "longs"). In other words, the OS pushes the address onto the stack and then adjusts the stack pointer accordingly. In this example, the value F10DD07C will be pushed onto the stack.

What you need to do is take each value in the stack listing and unassemble the instructions immediately prior to it, looking for a CALL command. Fortunately the NetWare debugger comes with a "u" (unassemble) command to make this easier. In the example from Figure 2 above, you need to see the instructions immediately prior to F10C4326"the last value on the stack prior to the abend. The command to enter is as follows:

# u f10c4326 -9

The minus 9 in this command tells the debugger you want to start unassembling at a location 9 bytes just before location F10C4326. How far back you choose to unassemble depends on the instruction opcode sizes and how they are aligned on memory boundaries. Often you need to unassemble 5, 12, or more bytes before the address in order to see valid instructions.

It may be that 9 bytes before location F10C4326 lands you right in the middle of a long opcode belonging to another instruction. The debugger will try its best to unassemble this partial instruction, but it will probably fail, in which case it will display question marks. Or it might succeed with the undesired side-effect of "skewing" the unassembly of subsequent instructions.

Listed in Figure 3 is the partial result of the above unassemble command. The left-most column (column 1) shows the address of the instruction. Column 2 shows the machine code representation of the instruction, the opcode. Columns 3 and 4 show the unassembled human-readable form of the opcode.

Figure 3: The partial results of the "u" command, showing the individual code instructions.

# u f10c4326 - 9



F10C431D E5FF           IN      EAX, FF

F10C431F 7508           JNZ     F10C4322

F10C4321 E8EAE12FFF     CALL    F8028FDF

F10C4326 83C404         ADD     ESP, 00000004

F10C4329 5D             POP     EBP

In the unassembly listing, you should expect to see two things:

At some point in the lines of code, the address you are interested in (it should be listed in the left-most column)
In the line immediately before that, a CALL instruction of some kind

In Figure 3, the address F10C4326 is indeed listed in the left-most column, and right above it is a CALL instruction. You can therefore safely conclude that F10C4326 is a return address.

You now need to ask yourself the following questions:

To which portion of code was the call made?
In what portion of code is the return address, or from where did it make the call?

This is where the "?" debugger command comes in handy. This command takes as its input an address and queries that address to find out what code "owns" that address. If no address is given, it assumes that you are querying the EIP (Extended Instruction Pointer).

To find out the answer to the first question, query the address on the CALL instruction as shown:

# ? f8028fdfAddress in SERVER.NLM at code start +00028FDFh

From the result of this command, you can conclude that the call was made to a function inside SERVER.NLM.

To find out the answer to the second question, query the return address itself:

# ? f10c4326Address in DS.NLM at code start +00000326h

Based on this information, you know that some function in DS.NLM made the call to a function inside SERVER.NLM.

At this point you have successfully matched your first call-return pair. The goal of stack-walking is to traverse the stack and find as many call-return pairs as you can and then link them up.

Who Are You Calling?

Sometimes it is not immediately clear what address the CALL is calling. Look at the following stack fragment, for example:

#00879E68 00879E8C 00000008-F90A7670 F80CE026 ........pv..&...00879E78 F90A7670 00879E88-00007246 F80B5A6D pv......Fr..mZ..

Suppose you are analyzing the address F80CE026. Unassembling it reveals that it is indeed a return address. However, as shown in the unassembled listing below, a register name is being used in the preceding CALL instruction.

# u f80ce026 - 12



F80CE014 7520           JNZ     F80CE036

F80CE016 FF751C         PUSH    dword ptr [EBP+1C]

F80CE019 FF7518         PUSH    dword ptr [EBP+18]

F80CE01C FF7514         PUSH    dword ptr [EBP+14]

F80CE01F 53             PUSH    EBX

F80CE020 FF9388000000   CALL    dword ptr [EBX+00000088]

F80CE026 . . .

To resolve [EBX+00000088] into an address, you need to find out what the value of EBX was at the time of the call and then add 88h to it. The clues you need to do this are all here. If you look closely, you will see that immediately prior to the CALL instruction, EBX is itself pushed onto the stack. In the sample stack fragment, this stack value is at location 00879E78 and contains the value F90A7670. So you might suspect that this CALL was to location F90A76F8 (F90A7670h + 88h).

Actually, it is not that simple because the square brackets around the operand tell us that a technique called indirection is being used. This means that the call was not to F90A76F8 but was to the value contained at memory location F90A76F8. You can find out what this value is by dumping the contents of F90A76F8, using the "dd" command as shown:

# dd f90a76f8 4F90A76F8 00019471

You can now correctly conclude that the CALL statement "CALL dword ptr [EBX+00000088]" is resolved to "CALL 00019471".

Incidentally, to find out the actual names of the any internal kernel functions involved requires access to source code and OS symbols. This is why the best persons to debug abended servers are NTS engineers who have close links to the engineering departments actually responsible for the OS and its components.

ABENDEMO Example

This section presents the process involved in tracing the cause of a real NetWare 4.11 server abend. The abend in this example is triggered on purpose through the ABENDEMO.NLM program, available from DeveloperNet at:

http://www.novell.com/coolsolutions/tools/13516.html

Before you use ABENDEMO.NLM, you need to set the following parameter:

set auto restart after abend = 0

When troubleshooting abends, it is also essential to set the following parameters:

set page fault emulation = offset read fault emulation = off

Without these parameter settings, you can never be totally sure you are troubleshooting the real abend.

To trigger an abend, load ABENDEMO.NLM and choose the "Generate a Stack Overflow" option. Once the abend occurs, enter the NetWare debugger by pressing "left Shift" "Esc" "right Shift" "Alt" simultaneously.

The first task is to dump through the stack, starting with the address in the ESP register. It usually works well to display 40h bytes at a time. The output of the "dd" command looks something like the following:

#dd esp 40

00CB0CE0   F802A68A F907FF70-00007246 00CA9010  ....p...Fr......

00CB0CF0   00000000 F907FF70-00CB0D04 F80D7952  ....p.......Ry..

00CB0D00   F907FF70 F907FF70-00000000 F907FF70  p...p.......p...

00CB0D10   F8167FAF F907FF70-00CB0D2C 00CB0D38  ....p...,...8...

The first column on the left shows the stack addresses, while the symbols on the far right are the ASCII representation of the stack values. You are primarily interested in the stack values shown above in bold.

Note: If you are doing a stack trace on your own server with ABENDEMO.NLM, your addresses may differ from those listed here. The amount of RAM installed, when you loaded ABENDEMO.NLM, and many other factors may contribute to address differences.

By default, ESP represents the stack pointer of the running process at the time of the ABEND. This is usually the process you are interested in when debugging an abended server. When you are analyzing a hung server, it is often helpful to identify the other non-running processes that were present at the time the server froze up. You can use the " . p" command to get a list of processes and their pointers.

When analyzing the stack of an abended server, you are going to need some method of recording the stack values you see on the screen. You're only interested in the values, not the stack addresses and ASCII representation. One option is to hand-write the values on a piece of paper. I prefer to use a laptop computer and type the stack values directly into the Windows Notepad. Either way, be sure to allow ample space beneath each row for noting calls and returns.

Below is a sample list of the stack values displayed on the NetWare 4.11 server:

F802A68A     F907FF70     00007246     00CA9010



00000000     F907FF70     00CB0D04     F80D7952



F907FF70     F907FF70     00000000     F907FF70



F8167FAF     F907FF70     00CB0D2C     00CB0D38

Remember that in NetWare the stack grows down through memory, so the most recent stack values are the ones at the lowest addresses. In this example, F802A68A is the last value on the stack before the server abended -- this may or may not prove useful.

You next need to start looking for return addresses and noting them in your list as you find them. You can abbreviate using C for CALL TO and R for RETURNING TO, as shown in the sample below:

F802A68A       F907FF70     00007246    00CA9010

C=SERVER.NLM   EBX

R=SERVER.NLM



00000000       F907FF70     00CB0D04    F80D7952

                    C=SERVER.NLM

                    R=SERVER.NLM



F907FF70       F907FF70     00000000    F907FF70



F8167FAF       F907FF70     00CB0D2C    00CB0D38

Recall from earlier in this AppNote the method of ascertaining whether a stack value is a return address or not: use the "u" (unassemble) debugger command and examine the code immediately before it to see if the address is preceded by a CALL instruction. As an example, for address F802A68A the results of the unassemble command are displayed below:

# u f802a68a -11

F802A679 70E0           JO      F802A65B

F802A67B 0100           ADD     [EAX], EAX

F802A67D 007511         ADD     [EBP+11], DH

F802A680 834B2402       OR      [EBX+24], 00000002

F802A684 53             PUSH    EBX

F802A685 E81167FFFF     CALL    F8020D9B

F802A68A 5B             POP     EBX

Since the instruction at the address immediately preceding F802A68A is indeed a CALL, you can safely assume that this is a return address.

While you are here, note that the instruction at F802A684 is a PUSH of register EBX. This is another piece of the stack that you can identify. At the time of the abend, the value of EBX was the penultimate value on the stack (location 00CB0CE4). Always try to fill in as much of the puzzle as you can— you never know what may prove useful later.

Going back to our example, suppose you have determined that the first two lines of the stack dump contain two return addresses, and that close to the point of the abend SERVER.NLM (the NetWare OS) was performing some processing. This is not really helpful, as it will be the case in many NetWare- detected abends. What you really want to know is if any other NLMs contributed to the abend. This is where you'll encounter your first stumbling block.

There is a slight problem with the next return address on the stack. Analyzing F8167FAF yields the following:

# u f8167faf - 5

F8167FAA E8A9B525F8     CALL    F03C3558

F8167FAF 83C418         ADD     ESP, 00000018

F8167FB2 57             PUSH    EDI

# ? f03c3558

Address in UNKNOWN memory area

Current:   00000000  F03C3558

This information tells you that the call address F03C3558 is located in the code area of no known application, so you are unable to locate to which code space the call was made. However, if you unassemble the call address directly, you will see the following:

#u F03C3558

F03C3558 E9EB42D107     JMP     F80D7848

F03C355D 60             PUSHAD

.

.

.

Observe that the CALL is immediately resolved into a JMP. Jump tables like this are a common practice in assembly programming, so watch out for them. The JMP address can now be identified the usual way:

? F80D7848

Address in SERVER.NLM at code start +000D7849h

Previous:  -000A7C34  F802FC14 EXPRTSYM.NLM|VMGetDirectoryEntry

Current:    00000000  F80D7848

Now you know that F8167FAF represents a call to an OS function in SERVER.NLM, and you can fill in a bit more of your stack tracing:

F802A68A       F907FF70     00007246     00CA9010

C=SERVER.NLM   EBX

R=SERVER.NLM



00000000       F907FF70     00CB0D04    F80D7952

                    C=SERVER.NLM

                    R=SERVER.NLM



F907FF70       F907FF70     00000000    F907FF70



F8167FAF       F907FF70     00CB0D2C    00CB0D38

C=SERVER.NLM

R=

To find out where the code returned to, the "?" command comes in handy again:

# ? f8167faf

Address in NWSNUT.NLM at code start +00001FAFh

Previous:  -000000B1  F8167EFE NWSNUT.NLM|NWSGetRawKey

Current:    00000000  F8167FAF

Next:      +000000CD  F816807C NWSNUT.NLM|NWSRawToNutKey

#

From this, you can tell it was an NWSNUT.NLM function that accessed the OS. NWSNUT is the NetWare utility user interface, so any applications using the old pre-NetWare 5 "C-Worthy" interface can immediately be added to your list of suspects.

Notice that on line three of the stack dump the value F907FF70 appears twice. Although this might appear to be a return address at first, closer inspection reveals otherwise. The unassembled code below shows not only the lack of a CALL instruction, but also a rather unlikely flurry of ADD instructions. With today's compiler-optimized code, this sequence is far more likely to be the debugger's attempt to unravel data or garbage than actual code.

# u f907ff70 - 4

F907FF6C B0FA           MOV     AL, FA

F907FF6E 0300           ADD     EAX, [EAX]

F907FF70 8912           MOV     [EDX], EDX

F907FF72 56             PUSH    ESI

F907FF73 3430           XOR     AL, 30

F907FF75 52             PUSH    EDX

F907FF76 07             POP     ES

F907FF77 F9             STC

F907FF78 0000           ADD     [EAX], AL

F907FF7A 0000           ADD     [EAX], AL

F907FF7C 0000           ADD     [EAX], AL

F907FF7E 0000           ADD     [EAX], AL

F907FF80 0000           ADD     [EAX], AL

F907FF82 0000           ADD     [EAX], AL

With a little practice, you can become adept at spotting bogus return addresses such as this one.

Your objective now is to discover what piece of code called NWSNUT.NLM. Of course, you already know it's ABENDEMO.NLM, but in real life it is not until much later that you will finally find the culprit. For the purpose of this AppNote, we will leave out some of the laborious details and skip ahead to the next meaningful discovery. (If you would like to see the whole stack trace, you can download it from http://easyweb.easynet.co.uk/~coletti/stack.htm.)

Further down (technically that should be further up) the stack, you see that a function called NWSMenu, belonging to NWSNUT.NLM, gets involved. Some NLMs export their symbols (function names) and some may not. When they do, you can sometimes take advantage of this to narrow your search. It all depends on whether the programmers named the functions or methods in a meaningful way. In this case they did, so you can assume with reasonable confidence that the culprit code is not only an NLM using an NWSNUT interface, but that it is also utilizing a menu of some kind. You would record this information as follows:

00CB0F80   00000000 F8169542-0000000B 00000000

                    |        ECX

                    |

                    c=NWSNUT.NLM|NWSList

                    r=NWSNUT.NLM|NWSMenu

If you have downloaded the entire stack, you should be able to follow that by linking up the call-return pairs, the following code path has been identified:

Abend called by ->

server.nlm internal functions were called by ->

NWSSelectFromList called by ->

NWSLlist called by ->

NWSList called by ->

NWSMenu called by ?

There are other return values on the stack in addition to those listed above. As the stack is expanded and shrunk, it is constantly in a state of flux. Unrelated call-return pairs are often lying around like traps for the unwary stack tracer. Knowing when to ignore these takes some practice.

At this point, you have surmised that some kind of menu or list was being utilized by an NLM, and that some processing caused by the selection of one of the list options resulted in an abend. You do not have to go too much further in your stack trace to finally identify the culprit. Eventually you will discover the following:

00CB0FC0   00000019 0009C399-00CAAD90 00CE4960

00CB0FD0   FB001260 F101D412-0000000B 00000000

                    |

                    c=NWSNUT.NLM|NWSMenu

                    r=ABENDEMO|DemoMain

As was expected, NWSMenu was called by a module named ABENDEMO. You have found your culprit!

If you continue to dump the stack past this point, you will eventually receive a message indicating that a memory limit has been exceeded. This means there is no more stack left to trace.

Conclusion

The ABENDEMO example shown in this AppNote is admittedly contrived. In reality, you will rarely get a stack trace that is as conclusive, concise, or easy to follow as the one in this example. Applications are typically much more complex than ABENDEMO.NLM, and you should not entertain the notion that the success rate for debugging a live server is high.

You may sometimes find that the stack itself is corrupted after an abend, and that time factors are a heavy influence on how useful the stack information will be. Add to this the pressure that you as a server technician will usually be under when troubleshooting a non-functioning server. You can begin to realize that stack-walking is really for those rare ultra-critical situations where the abend is interfering with the organization's business. Often applying service packs and unloading suspect software one module at a time provides quicker results.

The best you should hope for with a stack trace is to get a better idea of what might be causing the abend. You can then be more selective in identifying which components of the server you should be examining.

* Originally published in Novell AppNotes

Disclaimer

The origin of this information may be internal or external to Novell. While Novell makes all reasonable efforts to verify this information, Novell does not make explicit or implied claims to its validity.