Abend Recovery Techniques for NetWare 3 and 4 Servers

Articles and Tips: article

DANA HENRIKSEN
Lead Engineer
NetWare Operating System Group

RON LEE
Senior Research Engineer
Systems Research Department

01 Jun 1995

This AppNote describes how to use NetWare's internal debugger to recover gracefully from server Abends. This recovery process includes detailed information on how to determine which component detected the fault, how to identify processor exceptions, how to identify the operating system's execution state, how to use the debugger to gather helpful troubleshooting information, and two debug techniques to restart a server. This process is designed to help you preserve cached data at both client and server, and reduce the possibility of lengthy down times due to file system corruption in the aftermath of a server Abend.

Previous AppNotes in This Series Feb 95 "Resolving Critical Server Issues" Aug 91 "Using the NetWare v3.x Internal Debugger"

Introduction
Anatomy of an Abend
The Abend Recovery Process
Using NetWare's Internal Debugger to Restart NetWare
Conclusion

Introduction

During an Abend, when a production server abruptly halts, it can be left with an unknown quantity of pending disk I/O requests - both user data and file system I/Os. If you reboot the server in this state, these valuable I/Os are lost and the file system may be corrupted. Fortunately, a system reboot is not the only alternative.

Using NetWare's internal debugger, you can frequently restart an abended server and recover gracefully by shutting down the system properly. Depending on the circumstances, this restart and shutdown process can save cached user data, flush pending file system I/Os to disk, and avoid time-consuming repairs to the file system with VREPAIR.

Anatomy of an Abend

The term "Abend" is an acronym for "ABnormal END." An Abend condition is abnormal for two reasons. First, an Abend represents an unexpected critical error that prevents any further processing. Second, because the error condition is unexpected, the system is left in an abnormal state in which many tasks are left undone. Or worse, some tasks may be left partially done, causing file and database corruption as well as mirror mismatches in NetWare's redundant Directory Entry Tables (DET) and File Allocation Tables (FAT).

Abends can signal the failure of a low-level software consistency check on a data structure, such as a stack overflow. But they are most commonly caused by software and hardware exceptions caught by the server CPU, such as:

Page faults
General Protection Processor Exceptions (GPPE)
Nonmaskable Interrupts (NMI)
Machine checks
Invalid opcodes

Abends always signal a faulty system component--either a bad hardware component or a faulty software module. When you encounter an abended server, you should always take copious notes that include the entire Abend display.

The Abend Display

Listing 1 is a sample Abend message displayed on a NetWare 3.12 server console. The message includes the date and time of the event (line 1), a description of the Abend's cause (line 2), the version of NetWare running on the server (line 3), the name of the process that was running when the Abend occurred (line 4), the current value of the CPU's instruction pointer (line 5), a 48-byte dump of the stack's contents (lines 6 through 8), some instructions to the system administrator to generate a core dump for troubleshooting by Novell Services engineers (line 9), and finally instructions to reboot the system (line 10).

Listing 1: A NetWare 3.x server Abend message.

1:  System halted Thursday  May 18, 1995  10:31:43 pm
2:  Abend: General Protection Processor Exception (Error code 00000000)
3:      OS version: Novell NetWare 3.12 (250 user) 8/12/93
4:      Running Process: Polling Process
5:      EIP: 0000A3B7
6:      Stack: DD DD DD DD 08 00 00 00 46 72 01 00 5D 21 1A 00
7:             44 DB 1E 00 00 00 00 00 00 00 00 00 A8 D7 1E 00
8:             CC DA 1E 00 78 56 34 12 00 00 00 00 00 01 49 64
9:  Press "Y" to copy diagnostic image to disk.  Otherwise
10: Power Off and back on to restart.

The term "restart" used in the NetWare 3 Abend message (line 10) following a power-off-and-back-on routine is really a reboot. We'll use the term "restart" in this AppNote to refer to several techniques that allow you to get your abended server running without the power-off-and-back-on routine.

Listing 2 is a sample NetWare 4.1 Abend message. It is very similar to the NetWare 3.12 Abend message above. With the exception of the missing CPU instruction pointer (EIP), the two Abend messages are almost identical.

Listing 2: A NetWare 4.x server Abend message.

1:  System halted Thursday, May 4, 1995   2:10:25 pm MDT
2: 
3:  Abend: Page Fault Processor Exception (Error code 00000002)
4:      OS version: Novell NetWare 4.10 November 8, 1994
5:      Running Process: ProDemo Process
6:      Stack: AC 1F 65 01 E7 66 03 F1 50 CA 65 01 03 00 00 00
7:             D0 1F 65 01 09 00 00 00 B0 81 01 F9 54 CE 65 01
8:             39 67 03 F1 0B CB 65 01 B4 D0 65 01 B0 81 01 F9
9:  Press "Y" to copy diagnostic image to disk.
10: Otherwise press "X" to exit.

Throughout this AppNote, we'll look at a variety of Abend messages to teach you how to interpret their descriptions. We'll also show you how to determine the cause of an Abend, as well as the server's resulting state.

As you become more familiar with Abend messages, you'll notice several variations from the samples above, including:

Abend descriptions (line 3 in listing 2) often take up more than one line.
Abends generated at interrupt-time always include the name of the interrupted process (we'll discuss the meaning of "interrupt-time" later in the AppNote).

Abend Processing

When an Abend message appears on the server console, either NetWare or the server CPU has detected a critical error condition (fault) and jumped into NetWare's fault handler. This handler idles NetWare and displays the Abend message on the server console for immediate action by the server administrator. We'll refer to all errors detected by the CPU as processor exceptions. We'll refer to NetWare-detected errors as software exceptions.

Processor Exception = Any error detected by the CPUSoftware Exception = Any error detected by NetWare

NetWare's fault handler (function) is declared as a public function by the operating system so it can be used by any operating system module or Novell NLM. However, because this function isn't documented and therefore not used by third-party developers, you can safely assume that software exceptions resulting in Abends have been trapped by NetWare.

The Abended State and Side-Effects of a Reboot

Following an Abend, a server is left idling in an unknown state. At this point, your main concern should lie with the data in NetWare's cache as well as your clients' caches. NetWare's cache may include an unknown quantity of pending disk I/Os - both user data and file system I/Os. Client applications, such as word processors, may also be keeping data in client cache that hasn't yet been sent to the server for processing. This data is also in danger of being lost. If you follow the instructions in the Abend message to reboot the server or exit to DOS in this "ABnormally ENDed" state, all pending I/Os and client cached data are lost and the file system may be corrupted.

This potential for damage to file systems and other data structures can be reduced in two ways. The first depends on the hardware platform you're using for the server. A fast disk channel tends to have very few outstanding I/Os. If your server is idle or at a point where few, if any, pending I/Os exist, an Abend may not have a negative effect on the concurrency of your data and underlying file system. On the other hand, if a server is consistently busy with a good number of outstanding I/Os, the desirability of a graceful restart-and-shutdown technique following an Abend is high.

The other mitigating factor is NetWare's Transaction Tracking System (TTS). Operating system functions such as Directory Services and third-party NLMs that make use of TTS are shielded from the sometimes corrupting results of the Abend-and-reboot process. When a rebooted server comes up after an Abend, TTS recognizes any partial transactions and backs each partial transaction out in its entirety. This backout process assures the concurrency of TTS-supported data structures.

In every case you need to determine a proper course of action that allows you to isolate the fault, restart the system and gracefully shut down the server (if possible). You can then troubleshoot and replace the responsible software module or hardware component.

The Abend Recovery Process

The flow chart in Figure 1 describes the different courses of action you can take, depending on the type of Abend and the state of the server. The chart leads you through several decisions and then provides you with some general troubleshooting and system restart recommendations. The process includes five steps:

Step 1: Determine who detected the fault Step 2: Identify the processor exception Step 3: Gather troubleshooting information Step 4: Identify the execution state Step 5: Try to restart NetWare

At the top of the chart, you're required to determine who detected the fault (NetWare or the CPU). We'll call this "fault detection." Once you've determined the source of the Abend, you can proceed to the software or hardware sides of the chart. If the fault was detected by the CPU (hardware), you have to determine the kind of fault detected. Next, you have to determine the fault's context - whether the fault occurred during process-time or interrupt-time. And finally, we provide you with a recommended approach to isolate the offending module or component and a recommended technique for restarting the system and recovering gracefully.

Fault detection: Was the fault detected by NetWare or the CPU?Fault identification: What type of fault was detected by the CPU?Fault context: Did the fault occur during process-time or interrupt-time?

Figure 1: The Abend recovery process.

Step 1: Determine Who Detected the Fault

Following an Abend, your first task is to determine whether the fault was detected by NetWare or by the CPU. The Abend message syntax simplifies this step by including the phrase "Processor Exception" when the fault is detected by the CPU. If the phrase "Processor Exception" does not appear in the Abend description text, you know the Abend condition is a software exception detected by NetWare.

Processor Exception Example. Listing 3 is an Abend message displayed on the server console that was generated after the CPU detected a fault and transferred control to the NetWare fault handler. Lines 3 and 4 make up the Abend description. On line 3, the phrase "Processor Eception" tells you that this fault was detected by the CPU. In this case, you should follow the recovery path in Figure 1 for Processor Exceptions.

Listing 3: An Abend generated by the CPU - a "processor exception."

1:  System halted Tuesday, May 9, 1995 11:54:46 am MDT
2:  
3:  Abend: Nonmaskable Interrupt Processor Exception (Error code 00000030)
4:  Parity error was generated by the system board.
5:      OS version: Novell NetWare 4.10  November 8, 1994
6:      Running Process: Server 01 Process
7:      Stack: 02 70 00 00 00 20 02 f8 01 00 00 00 00 00 00 00
8:             00 60 00 fb 73 73 73 73 f8 33 63 f0 60 38 00 00
9:             20 8f 09 00 00 00 00 00 00 00 00 00 00 00 00 00
10: Press "Y" to copy diagnostic image to disk.
11: Otherwise press "X" to exit.

Software Exception Example. Listing 4 is an Abend message generated by NetWare after the NetWare kernel detected a stack overflow. The absence of the phrase "Processor Exception" in the Abend description (line 3) tells you that the Abend was due to a software exception.

Listing 4: An Abend generated by a software exception.

1:  System halted Friday, May 5, 1995   3:09:47 pm MDT
2:
3:  Abend: SERVER-4.10-288 Stack overflow detected by kernel.
4:      OS version: Novell NetWare 4.10 November 8, 1994
5:      Running Process: Server 00 Process
6:      Stack: 16 1A 02 F8 84 42 6A 01 90 19 82 01 00 00 00 00
7:             00 20 00 FB 73 73 73 73 CC 32 63 F0 20 00 00 00
8:             B0 F8 03 00 11 46 52 45 4E 43 48 20 2D 20 43 41
9:  Press "Y" to copy diagnostic image to disk.
10: Otherwise press "X" to exit.

If this kind of Abend occurs, you should follow the recovery path in Figure 1 for software exceptions. Skip Step 2 (for hardware-detected faults) and proceed to Step 3.

Step 2: Identify the Type of Processor Exception

This step is for processor exceptions only. If your Abend indicates that the fault was detected by software, skip this step and proceed to Step 3.

Recovery from processor exceptions requires you to identify the source of the error. The Intel Pentium can generate 17 different types of processor exceptions. The Intel 80486 can generate 16. In this AppNote, we provide restart techniques for the four most common of those exceptions: page fault, general protection processor exception, nonmaskable interrupt or machine check, and invalid opcodes.

Page Fault. In NetWare 4.x servers, memory paging is enabled. If an invalid address is used by any Novell or third-party software module, a Processor Exception is generated by the CPU. Because NetWare uses the CPU's memory paging and mapping facilities, invalid addresses include any address that is either protected or outside the module's registered address space.

General Protection Processor Exception (GPPE). In NetWare 3.11 and 3.12 servers, memory addressing is limited to the address space bounded by physical memory (unless you load a device driver that uses shared memory at an address above physical memory in which case the bounds of available memory are moved up to the top of the device driver's shared memory) . If NetWare or a third-party module tries to access any memory address beyond these bounds, a GPPE is generated by the CPU.

NonMaskable Interrupts (NMI) and Machine Checks. NMIs are almost always produced by parity errors. These memory-related errors can occur in main memory on the system board, in add-in memory boards, and in shared-memory areas of I/O cards. Machine checks are produced by the Intel Pentium chip when an internal hardware error is detected.

Invalid Opcodes. This exception occurs when NetWare or a third-party module tries to execute code that does not contain valid opcodes (instructions that the CPU recognizes).

Although Processor Exceptions are detected by the CPU, most of these exceptions are caused by bugs in the software. For instance, page faults, GPPEs, and invalid opcodes are almost always caused by the software even though the error is detected and reported by the CPU. The NMI and Machine Checks are the exception - both exceptions stem from hardware errors that must be corrected, usually by replacing memory (in the case of an NMI) or by replacing the CPU and its cache (in the case of a Machine Check).

Identifying Processor Exceptions: Examples. Abends due to page faults, like the one in listing 5, will always include the phrase "page fault" in the Abend description.

Listing 5: Abend due to a page fault detected by the CPU while running NetWare 4.x.

1:  System halted Thursday, May 4, 1995   2:10:25 pm MDT
2: 
3:  Abend: Page Fault Processor Exception (Error code 00000002)
4:      OS version: Novell NetWare 4.10 November 8, 1994
5:      Running Process: ProDemo Process
6:      Stack: AC 1F 65 01 E7 66 03 F1 50 CA 65 01 03 00 00 00
7:             D0 1F 65 01 09 00 00 00 B0 81 01 F9 54 CE 65 01
8:             39 67 03 F1 0B CB 65 01 B4 D0 65 01 B0 81 01 F9
9:  Press "A" to copy diagnostic image to disk.
10: Otherwise press "X" to exit.

The GPPE in Listing 6 is also easily identified.

Listing 6: Abend due to a general protection fault detected by the CPU while running NetWare 3.x

1:  System halted Thursday  May 18, 1995  10:31:43 pm
2:  Abend: General Protection Processor Exception (Error code 00000000)
3:      OS version: Novell NetWare 3.12 (250 user) 8/12/93
4:      Running Process: Polling Process
5:      EIP: 0000A3B7
6:      Stack: DD DD DD DD 08 00 00 00 46 72 01 00 5D 21 1A 00
7:             44 DB 1E 00 00 00 00 00 00 00 00 00 A8 D7 1E 00
8:             CC DA 1E 00 78 56 34 12 00 00 00 00 00 01 49 64
9:  Press "Y" to copy diagnostic image to disk.  Otherwise
10: Power Off and back on to restart.

Listing 7 is the Abend generated by NetWare when the CPU received an NMI. Lines 3 and 4 make up the multi-line Abend description. Line 3 tells you that the CPU received an NMI and line 4 provides more information by noting that a parity error was detected by the system board.

Listing 7: Abend reporting an NMI detected by the CPU

1:  System halted Tuesday, May 9, 1995 11:54:46 am MDT
2:  
3:  Abend: Nonmaskable Interrupt Processor Exception (Error code 00000030)
4:  Parity error was generated by the system board.
5:      OS version: Novell NetWare 4.10  November 8, 1994
6:      Running Process: Server 01 Process
7:      Stack: 02 70 00 00 00 20 02 f8 01 00 00 00 00 00 00 00
8:             00 60 00 fb 73 73 73 73 f8 33 63 f0 60 38 00 00
9:             20 8f 09 00 00 00 00 00 00 00 00 00 00 00 00 00
10: Press "Y" to copy diagnostic image to disk.
11: Otherwise press "X" to exit.

Listing 8 is a sample Abend message encountered when a software module tried to execute an invalid opcode.

Listing 8: Abend due to an Invalid Opcode detected by the CPU.

1:  System halted Wednesday, May 10, 1995   1:07:54 pm MDT
2:
3:  Abend: Invalid Opcode Processor Exception
4:      OS version: Novell NetWare 4.10 November 8, 1994
5:      Running Process: Server 00 Process
6:      Stack: 00 00 00 00 00 20 00 FB 73 73 73 73 CC 32 63 F0
7:             20 00 00 00 B0 F8 03 00 11 46 52 45 4E 43 48 20
8:             2D 20 43 41 4E 41 44 49 41 4E 00 FF 77 AA CC FF
9:  Press "Y" to copy diagnostic image to disk.
10. Otherwise press "X" to exit.

Step 3: Gather Troubleshooting Information

Before restarting or rebooting the server, there are several bits of information that you can gather that will help you troubleshoot the Abend condition. This information includes:

The entire Abend message
The name of the running process
The values of each of the CPU's registers
The values of each of the CPU's control registers
The execution state of the operating system
The current location of the instruction pointer

The Abend Recovery Process in diagram 1 lists the specific information that may be helpful depending on the Abend type and context. Don't forget this important step because this information is lost once you restart the operating system.

Step 4: Identify the Execution State

Abend messages include information that can help you determine the execution state of the operating system - whether the fault occurred during process-time or interrupt-time. This differentiation is important because recovery from Abends that occur during interrupt-time is complex and isn't covered in this AppNote. This AppNote only covers restart techniques for Abends that occur during process-time.

The terms process-time and interrupt-time are used to describe two states that the OS can run in. The state is interrupt-time any time an (asynchronous) interrupt service routine (ISR) is running. The state is process-time the rest of the time, when normal (synchronous) processes are running, scheduled via the run queue.

Interrupt-time = When any ISR is running.Process-time = All times an ISR is not running.

Listing 9 is an Abend displayed due to a software exception during interrupt-time. In this Abend message, your immediate clue that you're execution state is interrupt-time is on line 3: the phrase "during interrupt time." But a more reliable source is found in line 5, which will always read "Interrupt service routine" and will always be followed by another line that lists the name of the interrupted process (line 6).

Listing 9: Abend following a software exception at interrupt-time.

1: System halted Friday, May 5, 1995   7:41:03 pm MDT
2:
3: Abend: SERVER-4.10-289 Kernel detected a process switch during interrupt time.
4:      OS version: Novell NetWare 4.10 November 8, 1994
5:      Running Process: Interrupt service routine (nested count 1)
6:      Interrupted Process: Server 00 Process
7:      Stack: 4F 28 02 F8 02 72 00 00 00 00 00 00 01 00 00 00
8:             00 00 00 00 01 00 00 00 00 00 00 00 00 20 00 FB
9:             73 73 73 73 CC 32 63 F0 20 00 00 00 B0 F8 03 00
10: Press "Y" to copy diagnostic image to disk.
11: Otherwise press "X" to exit

By default, any software-exception Abend without a reference to an Interrupt Server Routine, or ISR-related process name, has left the server in a process-time execution state.

The Abend in listing 10 shows a process-time exception. Line 3 tells you the Abend was detected by the kernel without any reference to an ISR. Line 5 tells you that the running process was a service process, still without any reference to an ISR. Due to the lack of the ISR references and the reference to known processes, you can safely assume that this Abend occurred at process-time and has left the server in a process-time execution state.

Listing 10: Abend following a software exception at process-time.

1:  System halted Friday, May 5, 1995   3:09:47 pm MDT
2:
3:  Abend: SERVER-4.10-288 Stack overflow detected by kernel.
4:      OS version: Novell NetWare 4.10 November 8, 1994
5:      Running Process: Server 00 Process
6:      Stack: 16 1A 02 F8 84 42 6A 01 90 19 82 01 00 00 00 00
7:             00 20 00 FB 73 73 73 73 CC 32 63 F0 20 00 00 00
8:             B0 F8 03 00 11 46 52 45 4E 43 48 20 2D 20 43 41
9:  Press "Y" to copy diagnostic image to disk.
10: Otherwise press "X" to exit.

Step 5: Try to Restart the Server

After writing down the Abend message and gathering any available troubleshooting information, you can use NetWare's internal debugger to try to restart NetWare. Your goal is to restart NetWare and allow it to run for a minute or two. Even several seconds is better than nothing because it allows the disk I/Os to flush. During this time, client applications can retry their connections and allow their applications to save their data and exit gracefully; pending I/Os at the server are also serviced by the disk channel. After-wards, you can shut the server down gracefully with the DOWN console command. The term "gracefully" refers to the normal closure of files and completion of DET and FAT operations that occur during server shutdown.

We provide you with two techniques:

Thread Quarantine
Trace-N-Go

Thread Quarantine. This technique is used when the current thread-of-execution, whether it belongs to NetWare or a third-party NLM, is thought to be the cause of the Abend. Using the thread quarantine process, you use the debugger to remove the thread from the run queue indefinitely and restart NetWare.

Trace-N-Go. This technique is used when an Abend reports an NMI or machine check. Using the trace-n-go process, you use the debugger to change the CPU's state and restart NetWare.

Sadly, these restart techniques don't always work. When you try to restart NetWare you're going on several assumptions:

The quarantined thread isn't required for the pending I/Os to execute
The quarantined thread isn't holding onto important resources that are blocking the normal execution of other important functions.
The NMI or machine check is intermittant and will allow NetWare to continue to run
The NMI or machine check didn't corrupt any data that may be written to disk during the restart and graceful shutdown process.

You should understand that these cases provide the possibility for additional errors and repeated Abends during the restart process. But these risks rarely outweigh the benefits of recovering pending I/Os and leaving the server with a stable file system.

In the case of parity errors (reportedas NMIs), after using the Trace-N-Go restart technique youshould shut down the server as soon as possible. Because ofthe large memory space required by the volume FATs, there is ahigh probability that the parity error lies within the memory-residentFAT. If you allow disk space to be allocated based on faulty FATinformation, you risk additional losses.

Note: We don't provide a technique for handling Abends that occur at interrupt-timebecause of the complexity of the operation. Interrupt-time recoverytechniques are beyond the scope of this AppNote because they requirestack-walking (manual traversal of the stack), an understandingof OS symbols, and an understanding of multiple memory maps.

Using NetWare's Internal Debugger to Restart NetWare

NetWare 3.x and 4.x include a diagnostic tool called a debugger. The debugger allows an OS or NLM programmer to insert breakpoints and obtain register and memory dumps at selected points during operating system execution. The internal debugger can also be used to modify the state of an abended server and restart the system. (For information on the internal debugger beyond the scope of this AppNote, including a programming example and quick reference, see "Using NetWare's Internal Debugger" in the August 1991 Novell Application Notes.)

Following an Abend, you can enter the debugger by pressing <left-shft><esc><right-shft><alt>.

Note: This key sequence is unavailable if the server console has beensecured using the SECURE CONSOLEconsole command.

Sometimes the debugger key sequenceis typed <shft><shft><alt><esc>, and sometimes<shft><alt><shft><esc>. Each of theseare correct but can be confusing if you'renot used to four-key sequences. Use your thumb and index fingeron your right hand to press the <shft< and <alt< keyson the right side of the server keyboard. Finally, use your leftthumb to press the <shft< key and your middle or index fingerto press the <esc> key on the left side of the server keyboard.

You can enter the debugger at any point of server execution. However, we recommend this process only for abended servers because the NetWare operating system is halted during debugger execution.

Dropping Into the Debugger

Listing 11 shows the server console display after an Abend has occured (lines 1 through 11). Lines 10 and 11 offer the traditional options usually taken by server administrators. Line 13 is the key sequence we pressed to drop into the debugger.

Listing 11: Entering the debugger after an Abend.

1:  System halted Tuesday, May 9, 1995 11:54:46 am MDT
2:  
3:  Abend: Nonmaskable Interrupt Processor Exception (Error code 00000030)
4:  Parity error was generated by the system board.
5:      OS version: Novell NetWare 4.10  November 8, 1994
6:      Running Process: Server 01 Process
7:      Stack: 02 70 00 00 00 20 02 f8 01 00 00 00 00 00 00 00
8:             00 60 00 fb 73 73 73 73 f8 33 63 f0 60 38 00 00
9:             20 8f 09 00 00 00 00 00 00 00 00 00 00 00 00 00
10: Press "Y" to copy diagnostic image to disk.
11: Otherwise press "X" to exit.
12:
13: <shft><esc><shft><alt>
14:
15: Novell 386 Debugger
16: (C) Copyright 1987-1993 Novell, Inc.
17: All Rights Reserved
18: 
19: Abend: Nonmaskable Interrupt Processor Exception (Error code 00000030)
20: Parity error was generated by the systems board.
21: EAX = 00000005 EBX = 00000000 ECX = 00000000 EDX = 00001C8E
22: ESI = 00000000 EDI = F802A843 EBP = 00000001 ESP = 00035CE8
23: EIP = F8021DA5 FLAGS = 00007206 (PF IF NT)
24: F8021DA5 C705EOFF5FF007 MOV     [F05FFFEO]=00000006,00000007
25:          000000
26: #

Beginning at line 15, we enter the debugger and are automatically switched from the console screen to a separate debug screen. If the debugger has not been used since the server came up, the debugger displays a preamble like the one in lines 15 - 17. This is followed by a copy of the Abend description from the console screen on line 19, a dump of the CPU's registers on lines 21 through 23, a decode of the next instruction on lines 24 and 25, and the debugger prompt (#) on line 26.

Debugger Options

During your work inside the debugger, you'll need to use several debug options to gather troubleshooting information and restart the server. Because there are so many options and scenarios in which to use those options, we'll only introduce you to the options that are necessary to carry out the recovery process described in this AppNote. These include how to:

Get on-line help
Display the Abend message
Display the CPU's registers and flags
Display the CPU's control registers
Discover which NLM the instruction pointer is referencing
Discover which NLM function the instruction pointer is referencing

On-Line Help. There are four on-line help screens available to the debug user:

H for the general debugger help screen .H for the dot help screen .HE for help with debugger expressions .HB for help with breakpoints

We include the two most frequently used help screens here so you can see what is available before you are called upon to handle a server Abend. Listing 12 is the the on-line help displayed when you enter the "H" command. Listing 13 is the help displayed when you enter the ".H" or Dot-H command.

Listing 12: The H command general debugger help screen.

# h
B                    Breakpoint commands (see HB help screen)
C address            Change memory in interactive mode
C address=number(s)  Change memory value to the specified number(s)
C address=@text@     Change memory to the spcified text ASCII values
D{D} address {length}Dump memory for optional length 
D{D}L{+linkOffset} address {length}
                     Dump memory starting at address for optional length and
                     traverse a linked list (default address is ESP
                     Use <ENTER> to dump the next link node<
DDS address {length} Dump symbols
on stack, default address is ESP
REG=value            Change the specified register to the new value
                     REG is EAX, EBX, ECX, EDX, ESI, EDI, EBP, EIP, ESP OR EFL
F FLAG=value         Change the FLAG bit to value (0 or 1)
                     Where FLAG is CF, AF, ZF, SF, IF, TF, PF, DF or OF
G {break address(s)} Begin execution at current EIP and set optional temporary
                     breakpoint(s)
H, HB, HE, .H        Display help screens
I{B;W;D} PORT        Input bye, word, or dword from PORT (default is byte)
M start {L length} pattern-byte(s)
                     Search memory for pattern (L length is optional and if not
                     specified, the rest of memory will be searched)
N symbolName address Define a new symbol name at address
N -symbolName        Remove defined symbol name (n-- remove all symbols)
O{B;W;D} PORT=value  Output byte, word, or dword value to PORT
P                    Proceed over the next instruction
Q                    Quit and exit back to DOS
R                    Display registers and flags
RC                   Display control registers
RSOFF                Turn off segment register display mode
RSON                 Turn on segment register display mode
T or S               Single step
U address {count}    Unassemble count instructions starting at address
V                    View server screens
X                    Exchange processor stack frames
Z{D;U;O} expression  Evaluates the expression (See HE help screen)
? {address}          If symbolic information has been loaded, the closest
                     Symbols to address (default is EIP) are displayed

Listing 13: The Dot-H command general debugger help screen.

# .h
  .A            Display the abend or break reason
  .C            Do a diagnostic core dump to disk
  .D            Display page directory map for current debugger domain
  .D <address<  Display page entry map for current debugger domain<
  .F            Toggle ON/OFF the developer option flag
  .G            Display the GDT
  .H            Display this dot help screen
  .I            Display the IDT
  .I2           Display the IDT for Processor 2
  .M            Display loaded module names and adresses
  .L offset <offset< Display linear address given page map offsets<
  .LA <linear-address< [<cr3<] Find all aliases of linear-address<
  .LP <physical-address< [<cr3<]<
                Find all linear mappings of physical-address
  .LX address   Display page offsets and values used for translations
  .P            Disiplay all process anmes and adresses
  .P[L] <address< Display <address< as a process control block<
  .R            Display the running process control block
  .S            Display all screen names and adresses
  .S <address<  Display address as a screen structure<
  .T <address<  Display address as a TSS structure<
  .TS<segnum<   Display GDT[segnum] as a TSS structure<
  .V            Display server version

Abend Message Display. The Dot-A command (.a) displays the Abend error description (listing 14).

Listing 14: Dot-A command results following a page fault processor exception.

# .a
Debug entry: 14
Break caused by: Page Fault Processor Exception
Error code: 00000000 (set by processor during exception)

If you drop into the debugger without an Abend condition,the Dot-A command will reply with only a keyboard request (listing 15).

Listing 15: Dot-A command results without an Abend condition.

# .a
Debug entry: 257
Break caused by: Keyboard Debugger Request
Error code: None

Registers and Flags Display. The R command (r) displays the CPU's registers and flags (listing 16).

Listing 16: The R command results.

#R
EAX = 00000005 EBX = 00000000 ECX = 00000000 EDX = 00001C8E
ESI = 00000000 EDI = F802A843 EBP = 00000001 ESP = 00035CE8
EIP = F8021DA5 FLAGS = 00007202 (IF NT)
F8021DA5 C705E0FF5FF007 MOV     [F05FFFE0]=00000007,00000007
         000000

Where Am I (?). The ? command is used to display the location (both NLM and function) of the CPU's instruction pointer (EIP). Very often, EIP points into SERVER.NLM (listing 17). However, this isn't always helpful because 1) the Abend may not be due to a NetWare bug but to another NLM passing NetWare an invalid pointer or semaphore, and 2) debug symbols are not available inside SERVER.NLM. In these cases, you have to locate more information about the running process (via the Dot-P command) to see if you can identify the calling process was.

Listing 17: Results of the ? command following a server Abend while execution is inside SERVER.NLM

# ?
Address in SERVER.NLM at code start +00021DA5h
Current:    00000000  F8021DA5

Listing 18 shows the results of a ? command following an Abend while execution was inside CLIB.NLM. In this case, CLIB's exported APIs are used as symbols inside the debugger. So you learn the name of the NLM (CLIB) as well as the name of the CLIB function (malloc) that was executing when the Abend occurred. This information is very helpful during the troubleshooting process.

Listing 18: Results of the ? command following a server Abend while execution is inside CLIB.NLM.

# ?
Address in CLIB.NLM at code start +0001EEA1h
Previous:  -00000020 F105AE81 CLIB.NLM|malloc
Current:    00000000 F105AEA1
Next:      +0000005D F105AEFE CLIB.NLM|_msize

For example, if an Abend occurs while execution is inside CLIB, the Abend may be due to a CLIB bug or, more frequently, the error is due to another NLM passing invalid information to CLIB. If the calling process passes a bad semaphore to CLIB, the software exception will initially point at CLIB or the kernel. So you need to run the ? command first, and, if that points to the SERVER.NLM or some other core NLM, look at the Dot-R information (described below) for addition information concerning the calling process.

Running Process Display. The Dot-R command (.r) displays information about the running process (listing 19). This information should be used to identify the NLM the server was in when the Abend occurred.

Listing 19: The Dot-R results.

#.R
Running process pointer: FB002000
Process name: Server 00 Process  Address: FB002000
Stack pointer: 35CE4
Stack limit: 32CF0
Scheduling priority: 0
Wait state: 00
00035CE4  24 7B 01 00 00 00 00 00-00 20 00 FB 73 73 73 73  ${....... .{ssss
00035CF4  CC 32 63 F0 20 00 00 00-B0 F8 03 00 11 46 52 45  L2cp ...0x...FRE
00035D04  4E 43 48 20 2D 20 43 41-4E 41 44 49 41 4E 00 FF  NCH - CANADIAN..

Control Register Display. The RC command is only available in 4.x versions of NetWare and is used to display the CPU's control registers (listing 20). The information in CR2 (control register 2) is most helpful during a page fault. The contents of CR2 indicate the address that generated the page fault. In this case, 00000000 is the first page of memory where the DOS interrupt vector table is stored and which is an invalid address. However, this address is frequently used by buggy software that mistakenly dereferences a null pointer in its code. In server autopsies performed at Novell, this error produces the majority of Abends.

Listing 20: The RC command results following a page fault at address 00000000.

#RC
IDTR = 07FF:00018AA0 GDTR = 0028:0001D660 LDTR = 0000 TR = 0000
CRO = 80000013 CR2 = 00000000 CR3 = 00280000

Recovering From a Page Fault

Following a page fault, the CPU state is preserved. This means that the invalid instruction that generated the fault has not been executed. When you drop into the debugger the immediate instruction is the instruction that triggered the fault. If you can successfully quarantine the process (using thread quarantine), there is a good chance that you'll be able to restart the server and recover gracefully.

The recovery process for page faults includes the following tasks:

Write down the Abend message.
Drop into the debugger using the <shft><alt><shft><esc>key sequence.
Display the running process using the .R debugger command and note the running process information.
Display your location using the ? command and note the NLM and function information (if available).
Display the control registers using the RC command and note the contents of CR2.
Quarantine the running thread by setting EIP to CSleepUntilInterrupt (case sensitive).
Try to restart the server using the G command.
If you're successful, retry your client connections, close your applications, and allow all pending I/Os to flush to disk. If you're unable to restart the server at this point, use the information you've gathered to as certain which module may be generating the fault, reboot and proceed to step 10.
Down the server using the DOWNconsole command.
Begin troubleshooting by removing the buggy software module, or load the module inside a protected domain using DOMAIN.NLM.

These tasks are demonstrated in listing 21.

Analysis of the Page Fault Recovery. The recovery (listing 21) began with the original Abend message (lines 1 through 10). We then dropped into the debugger (line 12) and received the debugger preamble followed by the CPU registers and flags display (lines 14 through 22). It's here that we first noticed that the address 00000000 was being used in a memory to memory write in the immediate instruction (line 22).

Listing 21: Recovery from a page fault.

1:  System halted Thursday, May 4, 1995   2:10:25 pm MDT
2: 
3:  Abend: Page Fault Processor Exception (Error code 00000002)
4:      OS version: Novell NetWare 4.10 November 8, 1994
5:      Running Process: ProDemo Process
6:      Stack: AC 1F 65 01 E7 66 03 F1 50 CA 65 01 03 00 00 00
7:             D0 1F 65 01 09 00 00 00 B0 81 01 F9 54 CE 65 01
8:             39 67 03 F1 0B CB 65 01 B4 D0 65 01 B0 81 01 F9
9:  Press "Y" to copy diagnostic image to disk.
10: Otherwise press "X" to exit.  
11:
12: <shft><esc><shft><alt>
13:
14: Novell 386 Debugger
15: (C) Copyright 1987-1993 Novell, Inc.
16: All Rights Reserved
17: 
18: Abend: Page Fault Processor Exception (Error code 00000002)
19: EAX = 00000000 EBX = 016A40E4 ECX = 00000000 EDX = FAC01000
20: ESI = F90181D0 EDI = 00000006 EBP = 01698F9C ESP = 01698F9C
21: EIP = F1036D82 FLAGS = 00017246 (PF ZF IF NT RF)
22: F1036D82 C70000000000   MOV     [EAX]= ?,00000000
23: # .R
24: Running process pointer: FB001000
25: Process name: ProDemo Process  Address: FB001000
26: Stack pointer: 1698E98
27: Stack limit: 1696010
28: Scheduling priority: 0
29: Wait state: 00
30: 00035CE4  24 7B 01 00 00 00 00 00-00 20 00 FB 73 73 73 73  ${....... .{ssss
31: 00035CF4  CC 32 63 F0 20 00 00 00-B0 F8 03 00 11 46 52 45  L2cp ...0x...FRE
32: 00035D04  4E 43 48 20 2D 20 43 41-4E 41 44 49 41 4E 00 FF  NCH - CANADIAN..
33: # ?
34: Address in PRODEMO.NLM at code start +00000D82h
35: Previous:  -00000014  F1036D6E PRODEMO.NLM|WriteAddress
36: Current:    00000000  F1036D82
37: Next:      +00000008  F1036D8A PRODEMO.NLM|DoPrivOp
38: # RC
39: IDTR = 07FF:00018AA0 GDTR = 0028:0001D660 LDTR = 0000 TR = 0000
40: CRO = 80000013 CR2 = 00000000 CR3 = 00280000
41: # EIP = CSleepUntilInterrupt
42: Register changed
43: # G
44:
45: <alt><esc>
46:
47:  System halted Thursday, May 4, 1995   2:10:25 pm MDT
48: 
49:  Abend: Page Fault Processor Exception (Error code 00000002)
50:      OS version: Novell NetWare 4.10 November 8, 1994
51:      Running Process: ProDemo Process
52:      Stack: AC 1F 65 01 E7 66 03 F1 50 CA 65 01 03 00 00 00
53:             D0 1F 65 01 09 00 00 00 B0 81 01 F9 54 CE 65 01
54:             39 67 03 F1 0B CB 65 01 B4 D0 65 01 B0 81 01 F9
55:  Press "Y" to copy diagnostic image to disk.
56: Otherwise press "X" to exit.  
57:
58: <ENTER><
59: ABENDECTOMY:
60: ABENDECTOMY:<down>
61: Notifying stations that file server is down
62:
63: Downing the router...
64:
65:  5-04-95   2:12:14 pm:   DS-4.63-30
66:      Bindery close requested by the SERVER
67: 
68:  5-04-95   2:12:14 pm:   DS-4.63-30
69:      Directory Services: Local database has been closed
70: 
71: Dismounting volume SYS
72: 
73:  5-04-95   2:12:17 pm:   SERVER-4.10-2009
74:      ABENDECTOMY TTS shut down
75:      because backout volume SYS was dismounted
76: 
77: Type EXIT to return to DOS.
78: ABENDECTOMY:

At the debugger prompt we entered the .R command (line 23) to view information about the running process (lines 24 through 32). We didn't find anything helpful here except a confirmation that the running process was indeed the ProDemo Process (line 25).Then we entered the ? command to view the NLM and function we were executing at the time of the Abend (line 33). In this case, the server was executing code in an NLM other than SERVER.NLM so the debugger was able to use exported APIs as symbols (lines 35 through 37). It is here (line 35) that we learned that the WriteAddress function within PRODEMO.NLM was the offending function that caused the page fault.

Next, we used the RC command (line 38) to find the exact address that produced the page fault located in RC2. We then recorded the contents of RC2 for later troubleshooting. Next, we set EIP to CSleepUntilInterrupt (line 41) which has the effect of taking the ProDemo process off the run queue indefinately. To do this, we changed the process's instruction pointer. Instead of pointing to the MOV instruction with an invalid destination located in the DOS interrupt vector table, it pointed to the CSleepUntilInterrupt function. This had the effect of forcing ProDemo to request to go to sleep until an interrupt that never came. With the abending process setup to go to sleep, we were free to use the G command to restart the server (line 43).

Once restarted, we watched the lights on the server's disk subsystem as several outstanding I/Os were flushed to disk. We then asked all of the users to retry their connections and exit their applications. Again, we watched the server's disk subsystem lights flicker as several hundred users exited their applications and logged out. Once the users had exited we returned to the Abend recovery process.

When we restarted, the operating system returned us to the ProDemo screen. But because we put ProDemo to sleep it didn't accept any input, so we used the console command Alt-Escape (line 45) to switch to the console screen that displayed the original Abend message. Our return to the console screen placed us at the end of the Abend message (line 56) without a prompt. Once we pressed <ENTER> (line 58) we received a command prompt and were able to down the server gracefully. We then noted the need to edit the server's AUTOEXEC.NCF and remove PRODEMO.NLM from the load sequence for further troubleshooting and testing. We then brought the server back up.

Recovering from an NMI or Machine Check

Following an NMI or Machine Check, the CPU state is preserved with one exception. NetWare clears the CPU's Resume Flag (RF) to remove the possibility of a system restart using the G command. The Trace-N-Go technique executes the immediate instruction which in turn sets the RF flag and restores the system to a restartable state. If you can successfully restart the server using the Trace-N-Go technique, there is a good chance you'll be able recover gracefully.

The recovery process for NMIs and Machine Checks includes the following tasks:

Write down the Abend message.
Drop into the debugger using the <shft><alt><shft><esc>key sequence.
Trace-N-Go using the T and G commands.
If you're successful, retry your client connections, close your applications, and allow all pending I/Os to flush to disk.
Down the server using the DOWNconsole command.
Begin troubleshooting by replacing the faulty hardware component.

These tasks are demonstrated in listing 22.

Analysis of the NMI Recovery. The NMI recovery (listing 22) began with an Abend message (lines 1 through 11). From the message, we learned that there was a parity error on the system board (line 4) that required immediate replacement. We then dropped into the debugger (line 13) which transfered us to the debugger screen (lines 15 through 25) and gave us the debugger prompt (line 26).

At the debugger prompt we entered the T command to trace, or step, through the immediate instruction. The T command returned a break message (line 27) with a new dump of the registers and the next instruction. If you lood closely you can see the RF flag set (line 30). We were then free to restart the system with the G command (line 32).

The Trace-N-Go technique returned us to the console screen at the end of the Abend message (line 43) without a prompt. Once we hit <ENTER> we received a command prompt and were able to down the server gracefully (line 45). We then replaced the system board in the machine and brought the server back up.

Listing 22: Recovery from an NMI.

1:  System halted Tuesday, May 9, 1995 11:54:46 am MDT
2:  
3:  Abend: Nonmaskable Interrupt Processor Exception (Error code 00000030)
4:  Parity error was generated by the system board.
5:      OS version: Novell NetWare 4.10  November 8, 1994
6:      Running Process: Server 01 Process
7:      Stack: 02 70 00 00 00 20 02 f8 01 00 00 00 00 00 00 00
8:             00 60 00 fb 73 73 73 73 f8 33 63 f0 60 38 00 00
9:             20 8f 09 00 00 00 00 00 00 00 00 00 00 00 00 00
10: Press "Y" to copy diagnostic image to disk.
11: Otherwise press "X" to exit.
12:
13: <shft><esc><shft><alt>
14:
15: Novell 386 Debugger
16: (C) Copyright 1987-1993 Novell, Inc.
17: All Rights Reserved
18: 
19: Abend: Nonmaskable Interrupt Processor Exception (Error code 00000030)
20: Parity error was generated by the systems board.
21: EAX =  00000005 EBX = 00000000 ECX = 00000000 EDX = 00001C8E
22: ESI = 00000000 EDI = F802A843 EBP = 00000001 ESP = 00035CE8
23: EIP = F8021DA5 FLAGS = 00007206 (PF IF NT)
24: F8021DA5 C705EOFF5FF007 MOV     [F05FFFEO]=00000006,00000007
25:          000000
26: # T
27: Break at F8021DAF because of single step
28: EAX = 00000005 EBX = 00000000 ECX = 00000000 EDX = 00001C8E
29: ESI = 00000000 EDI = F802A843 EBP = 00000001 ESP = 00035CE8
30: EIP = F8021DAF FLAGS = 00017206 (PF IF NT RF)
31: F8021DAF FF0534AD60F0   INC     dword ptr [F060AD34]=00000931
32: # G
33: System halted Tuesday, May 9, 1995 11:54:46 am MDT
34: 
35: Abend: Nonmaskable Interrupt Processor Exception (Error code 00000030)
36: Parity error was generated by the system board.
37:     OS version: Novell NetWare 4.10  November 8, 1994
38:     Running Process: Server 01 Process
39:     Stack: 02 70 00 00 00 20 02 f8 01 00 00 00 00 00 00 00
40:            00 60 00 fb 73 73 73 73 f8 33 63 f0 60 38 00 00
41:            20 8f 09 00 00 00 00 00 00 00 00 00 00 00 00 00
42: Press "Y" to copy diagnostic image to disk.
43: Otherwise press "X" to exit. <ENTER><
44: ABENDECTOMY:
45: ABENDECTOMY:<down>
46: Notifying stations that file server is down
47:
48: Downing the router...
49:
50:  5-09-95   11:56:14 pm:   DS-4.63-30
51:      Bindery close requested by the SERVER
52: 
53:  5-09-95   11:56:14 pm:   DS-4.63-30
54:      Directory Services: Local database has been closed
55: 
56: Dismounting volume SYS
57: 
58:  5-09-95   11:56:17 pm:   SERVER-4.10-2009
59:      ABENDECTOMY TTS shut down
60:      because backout volume SYS was dismounted
61: 
62: Type EXIT to return to DOS.
63: ABENDECTOMY:

Recovery From an Invalid Opcode

Following a processor exception caused by an invalid opcode the CPU state is preserved. But the opcode the software is trying to execute is invalid either because the the module's code segment has been corrupted or because the module is trying to execute garbage by jumping to an invalid address outside its code segment. When you drop into the debugger you'll see the invalid opcode displayed on the immediate instruction line as "??".

Your only option is to quarantine the process that attempted to execute the invalide opcode and remove the NLM from your production system until you're able to identify the problem. If you can successfully quarantine the process (using thread quarantine), there is a good chance that you'll be able to restart the server and recovering gracefully. The recovery process for invalid opcodes includes the following tasks:

Write down the Abend message.
Drop into the debugger using the <shft><alt><shft><esc>key sequence.
Display the running process using the .R debuggercommand and note the running process information.
Display your location using the ? command and note the NLM and function information (if available).
Display module information with the .M commandand determine whether the invalid opcode is in the module's code segment.
Quarantine the running thread by setting EIP to CSleepUntilInterrupt (case sensitive).
Try to restart the server using the G command.
If you'resuccessful, retry your client connections, close your applications, and allow all pending I/Os to flush to disk. If you're unable to restart the server at this point, use the information you've gathered to as certain which module may be generating the fault, reboot and proceed to step 10.
Down the server using the DOWNconsole command.
Begin troubleshooting by removing the buggy software module.

The tasks are demonstrated in listing 23.

Analysis of the Invalid Opcode Recovery The invalid opcode recovery (listing 23) began with the original Abend message (lines 1 through 10). We quickly identified this Abend using the Abend description (line 3). We also learned that we were executing inside NetWare (line 5). Usually, you'll see an EIP pointing into an "unknown" or data address space rather than inside NetWare's code space. In these cases, you need to identify the running module by the running process. When the EIP is pointing inside NetWare then it is likely that another NLM has corrupted NetWare's code segment making restart less likely to succeed.

We then dropped into the debugger (line 12) and received a display of the CPU registers and flags (lines 14 through 18). We'd used the debugger the day before so the debugger preamble wasn't displayed this time. The opcode in the immediate instruction (line 18) was disassembled by the debugger and displayed for our reference. The question marks in the opcode A??U5" mean the opcode value is invalid. When the CPU ran into this invalid opcode a fault was generated and execution was passed to NetWare's fault handler. Our only option was to quarantine the process whose thread of execution included the invalid opcode and try to troubleshoot the problem.

At the debugger prompt (line 19) we entered the .R command to view information about the running process (lines 20 through 28). But it was little help in this case because we already knew that we were executing inside NetWare.

Then we entered the ? command to view the NLM and function we were executing at the time of the Abend (line 29). In this case, the server was executing code in SERVER.NLM so the debugger was not able to use exported API as symbols (lines 30 through 31). But we did find some useful information here. The ? command told us where the instruction pointer was pointing inside SERVER.NLM.

Our next move was to find out whether that offset was actually a valid code segment or some other area in memory. So we used the Dot-M command (line 32) to view a list of the NLMs running in the server with the start addresses and lengths of their code and data segments (lines 33 through 53). SERVER.NLM came up first, so we pressed <esc> to return to the debugger prompt and compared several numbers. We could have typed any other key to continue displaying the remaining NLMs.

First, we recorded the beginning of SERVER.NLM's code segment as F8000000h (line 36). Then we added the length of 000F2000h (line 36) to arrive at the ending address of SERVER.NLM's code segment, or F80F2000h. Then we compared the beginning and end of the code segment to the location of EIP (line 30) which was "code start plus 00021DA5h" and found that the instruction pointer (EIP) was well within the code segment. This meant that the code segment had somehow become corrupted. If the EIP had fallen outside of the code segment, we would have deduced that SERVER.NLM had jumped to an invalid address outside its own code segment. In both cases, the information is useful during the troubleshooting process.

Listing 23: Recovery from an invalid opcode.

1:  System halted Wednesday, May 10, 1995   1:07:54 pm MDT
2:
3:  Abend: Invalid Opcode Processor Exception
4:      OS version: Novell NetWare 4.10 November 8, 1994
5:      Running Process: Server 00 Process
6:      Stack: 00 00 00 00 00 20 00 FB 73 73 73 73 CC 32 63 F0
7:             20 00 00 00 B0 F8 03 00 11 46 52 45 4E 43 48 20
8:             2D 20 43 41 4E 41 44 49 41 4E 00 FF 77 AA CC FF
9:  Press "Y" to copy diagnostic image to disk.
10: Otherwise press "X" to exit.
11:
12: <shft><esc><alt><shft>
13:
14: Abend: Invalid Opcode Processor Execption
15: EAX = 00000005 EBX = 00000000 ECX = 00000000 EDX = 00001C8E
16: ESI = 00000001 EDI = 00000000 EBP = 00000000 ESP = 00035CE8
17: EIP = F8021DBC FLAGS = 00017002 (NT RF)
18: F8021DBC FFFF               ??U5    EDI
19: # .R
20: Running process pointer: FB002000
21: Process name: Server 00 Process  Address: FB002000
22: Stack pointer: 35CE4
23: Stack limit: 32CF0
24: Scheduling priority: 0
25: Wait state: 00
26: 00035CE4  24 7B 01 00 00 00 00 00-00 20 00 FB 73 73 73 73  ${.......
27: 00035CF4  CC 32 63 F0 20 00 00 00-B0 F8 03 00 11 46 52 45  L2cp .FRE
28: 00035D04  4E 43 48 20 2D 20 43 41-4E 41 44 49 41 4E 00 FF  NCH - CAN
29: # ?
30: Address in SERVER.NLM at code start +00021DA5h
31: Current:    00000000  F8021DA5
32: # .M
33: SERVER.NLM       NetWare Server Operating System
34:   00280000 domain
35:   Version 4.10    November 8, 1994
36:   Code Address: F8000000h Length: 000F2000h
37:   Data Address: F05F2000h Length: 000A0000h
38: RSPX.NLM         NetWare Remote Console SPX Driver
39:   00280000 domain
40:   Version 4.10    October 20, 1994
41:   Code Address: F102D000h Length: 000023A0h
42:   Data Address: 0155D000h Length: 000038D8h
43: REMOTE.NLM       NetWare 4.1 Remote Console
44:   00280000 domain
45:   Version 4.10    October 20, 1994
46:   Code Address: F1027000h Length: 00005148h
47:   Data Address: 01575000h Length: 000014D4h
48: MONITOR.NLM      NetWare 4.10 Console Monitor
49:   00280000 domain
50:   Version 4.12    October 21, 1994
51:   Code Address: F1013000h Length: 00013946h
52:   Data Address: 0153D000h Length: 000037E4h
53: <Press ESC to terinate or any other key to continue< <esc>
54: # EIP = CSleepUntilInterrupt
55: Register changed
56: # G
57: System halted Wednesday, May 10, 1995   1:07:54 pm MDT
58: Abend: Invalid Opcode Processor Exception
59:     OS version: Novell NetWare 4.10 November 8, 1994
60:     Running Process: Server 00 Process
61:     Stack: 00 00 00 00 00 20 00 FB 73 73 73 73 CC 32 63 F0
62:            20 00 00 00 B0 F8 03 00 11 46 52 45 4E 43 48 20
63:            2D 20 43 41 4E 41 44 49 41 4E 00 FF 77 AA CC FF
64: Press "Y" to copy diagnostic image to disk.
65: Otherwise press "X" to exit. <ENTER> <
66: 
67: ABENDECTOMY:
68: ABENDECTOMY:<down>
69: Notifying stations that file server is down
70:
71: Downing the router...
72:
73:  5-10-95   1:09:34 pm:   DS-4.63-30
74:      Bindery close requested by the SERVER
75: 
76:  5-10-95   1:09:34 pm:   DS-4.63-30
77:      Directory Services: Local database has been closed
78: 
79: Dismounting volume SYS
80: 
81:  5-10-95   1:09:37 pm:   SERVER-4.10-2009
82:      ABENDECTOMY TTS shut down
83:      because backout volume SYS was dismounted
84: 
85: Type EXIT to return to DOS.
86: ABENDECTOMY:

Next, we set EIP to CSleepUntilInterrupt which had the effect of setting up the service process "Server 00 Process" to be taken off the run queue indefinately (line 54). We then used the G command to restart the server (line 56).

Once restarted, we asked all of the users to retry their connections and exit their applications. We watched the server's disk subsystem lights flicker as the users exited their applications and logged out. Once the users had exited we returned to the Abend recovery process.

Our return to the console screen placed us at the end of the Abend message (line 56) without a prompt. Once we hit <ENTER> (line 58) we received a command prompt and were able to down the server gracefully.

Recovery from a Software Exception

A software exception occurs when a low-level consistancy check on a data structure fails. The CPU state is not preserved in these cases because the error is serious enough to place the integrity of the entire system in question. NetWare automatically zeroes the instruction pointer (EIP) so the server cannot be restarted immediately via the G command. If you can successfully quarantine the process (using thread quarantine), there is a chance that you'll be able to restart the server and recover gracefully. The recovery process for software exceptions includes the following tasks:

Write down the Abend message.
Drop into the debugger using the <shft><alt><shft><esc>key sequence.
Quarantine the running thread by setting EIP to CSleepUntilInterrupt (case sensitive).
Try to restart the server using the G command.
If you'resuccessful, retry your client connections, close your applications,and allow all pending I/Os to flush to disk. If you're unable to restart the server at this point, reboot the server and proceed to step 10.
Down the server using the DOWNconsole command.
Begin troubleshooting. These tasks are demonstrated in listing 24.

Recovery from the software exception (listing 24) begins with the Abend message (lines 1 through 10). We knew the exception occurred at process time because the phrase "Interrupt service routine" is not present.

We then dropped into the debugger (line 12) and received the debugger preamble followed by the CPU registers and flags display (lines 14 through 21). The ? command was invalid because EIP had been set to 0 by NetWare. So we proceeded to quarantine the running thread by setting EIP to CSleepUntilInterrupt. With the abending process now asleep, we are free to use the G command to restart the server (line 43).

Once restarted, we waited for the outstanding I/Os to be flushed to disk. We then asked all of the users to retry their connections and exit their applications. Once the users had exited we returned to the Abend recovery process.

Listing 24: Recovery from a software exception that occurred at process time.

1:  System halted Friday, May 5, 1995   3:09:47 pm MDT
2: 
3:  Abend: SERVER-4.10-350: Free called with a memory block that has an
    invalid resource tag.
4:      OS version: Novell NetWare 4.10 November 8, 1994
5:      Running Process: Initialization Process
6:      Stack: 16 1A 02 F8 84 42 6A 01 90 19 82 01 00 00 00 00
7:             00 20 00 FB 73 73 73 73 CC 32 63 F0 20 00 00 00
8:             B0 F8 03 00 11 46 52 45 4E 43 48 20 2D 20 43 41
9:  Press "Y" to copy diagnostic image to disk.
10: Otherwise press "X" to exit.  
11:
12: <shft><esc><shft><alt>
13:
14: Novell 386 Debugger
15: (C) Copyright 1987-1993 Novell, Inc.
16: All Rights Reserved
17:
18: Abend: SERVER-4.10-288 Stack overflow detected by kernel.
19: EAX = 00032CF0 EBX = 00000000 ECX = 00040027 EDX = 00001C8E
20: ESI = 016A427C EDI = F1029344 EBP = 016A4284 ESP = 00035CDC
21: EIP = 00000000 FLAGS = 00007097 (PF ZF IF NT RF)
22: # EIP = CSleepUntilInterrupt
23: Register changed
24: # G
25:
26: System halted Friday, May 5, 1995   3:09:47 pm MDT
27: 
28: Abend: SERVER-4.10-350: Free called with a memory block that has an
    invalid resource tag.
29:     OS version: Novell NetWare 4.10 November 8, 1994
30:     Running Process: Server 00 Process
31:     Stack: 16 1A 02 F8 84 42 6A 01 90 19 82 01 00 00 00 00
32:            00 20 00 FB 73 73 73 73 CC 32 63 F0 20 00 00 00
33:            B0 F8 03 00 11 46 52 45 4E 43 48 20 2D 20 43 41
34: Press "Y" to copy diagnostic image to disk.
35: Otherwise press "X" to exit. <ENTER><
36: 
37: ABENDECTOMY:
38: ABENDECTOMY: down
39: Notifying stations that file server is down
40:
41: Downing the router...
42: 
43:  5-05-95   3:12:14 pm:   DS-4.63-30
44:      Bindery close requested by the SERVER
45:
46:  5-05-95   3:12:14 pm:   DS-4.63-30
47:      Directory Services: Local database has been closed
48:
49: Dismounting volume SYS
50:
51:  5-05-95   3:12:17 pm:   SERVER-4.10-2009
52:      ABENDECTOMY TTS shut down
53:      because backout volume SYS was dismounted
54:
55: Type EXIT to return to DOS.
56: ABENDECTOMY:

Conclusion

The recovery techniques in this AppNote are provided to help you save time and frustation when a server encounters an Abend condition. These techniques are not a sure bet, but the several minutes they require can potentially save you hours of repair.

For more information concerning Abends, including trouble-shooting techniques and server core dumps, see "Resolving Critical Server Issues" in the February 1995 Novell Application Notes.

For more information concerning NetWare's internal debugger use the debugger's on-line help screens via the "h" command, and see "Using NetWare's Internal Debugger" in the August 1991 Novell Application Notes.

* Originally published in Novell AppNotes

Disclaimer

The origin of this information may be internal or external to Novell. While Novell makes all reasonable efforts to verify this information, Novell does not make explicit or implied claims to its validity.