IntranetWare Server Automated Abend Recovery
Articles and Tips: article
Worldwide Support Team
Novell Technical Services
01 Mar 1997
One of the niftiest features of IntranetWare is that the server can automatically recover from most types of abend error conditions. Find out how this all works and you'll never have to face another abend unprepared.
The term "Abend" is an acronym for "Abnormal END." An abend represents an unexpected critical error in a server that prevents any further processing. Because the error condition is unexpected, the system is left in an idling state in which many tasks are left undone. Or worse, some tasks may be left partially done, with the potential of causing file and database corruption as well as mirror mismatches in the server's redundant Directory Entry Tables (DETs) and File Allocation Tables (FATs).
Two years ago, in a BrainShare session and subsequent AppNote entitled "Abend Recovery Techniques for NetWare 3 and 4 Servers," Novell introduced a methodology for using NetWare's internal debugger to recover gracefully from server abends. NetWare 4.11, the server operating system platform for IntranetWare 1, has improved recovery options that enable the server to automatically recover from an abend in various ways. These options are enabled by default when an IntranetWare server is installed; no user intervention is required.
The purpose of this AppNote is to help you become familiar with the new abend recovery features implemented in NetWare 4.11. It follows a lab project format, demonstrating the use of the SET parameters involved in determining how an IntranetWare server responds when it encounters a critical error.
Improved Abend Recovery in NetWare 4.11
The NetWare operating system continually monitors the status of various server activities to ensure proper operation. If either NetWare or the server CPU detects a condition that threatens the integrity of its internal data (such as an invalid parameter being passed in a function call, or certain hardware errors), the active process is abruptly halted and an "Abend" message is displayed on the screen. An abend is NetWare's way of protecting itself and users against the unpredictable effects of data corruption. For example, if the operating system detected invalid pointers to cache buffers and yet continued to run, user data being held in memory would soon become unusable or corrupted.
When an abend occurs and the server abruptly halts, it can be left with an unknown quantity of pending disk I/O requests from both user applications and internal file system processes. If the server is rebooted in this state, these pending requests are lost, often resulting in file system corruption that requires time-consuming repairs with the VREPAIR utility. On the other hand, if the server is allowed to shut down gracefully, saving cached user data and flushing pending file system I/O requests to disk, the integrity of the data is preserved and the server can be reinitialized and resume operations with a minimum of disruption to the user community.
In NetWare 4.1 and earlier versions, administrators' options were fairly limited in the event of a server abend. They could either shut down the server and restart it (and deal with the possibility of file system corruption), or they could enter NetWare's internal debugger and try to identify which process caused the error, terminate that process, and resume other server operations long enough to shut down the system properly.
With NetWare 4.11, the server operating system offers improved recovery options for handling an abend. These improvements include the following features and capabilities:
Additional information about the source of the abend is displayed on the server console, identifying the NLM or hardware problem that caused the abend so the administrator can take corrective actions.
When an abend occurs, information about the abend is automatically written to a text file named ABEND.LOG. This file is initially written to the server's DOS partition. Then, when the SYS volume is remounted, the information is appended to the ABEND.LOG file in the SYS:SYSTEM directory and removed from the DOS partition.
Two new SET parameters "Auto Restart After Abend" and "Auto Restart After Abend Delay Time"enable the server to automatically recover from an abend in various ways. Using the default options for these parameters, IntranetWare servers will automatically recover from most abends and continue functioning normally. Users will be able to save their files before the server is brought down, and file system corruption will be avoided because volumes can be dismounted properly.
A third new SET parameter, "CPU Hog Timeout Amount", allows the abend recovery features to work when a server is "hung" because a thread is refusing to relinquish control of the CPU. Undisciplined NetWare Loadable Modules (NLMs) will no longer be able to monopolize the CPU and cause the server to appear hung. The NLM's process will be suspended and the server will continue functioning.
Note: Keep in mind that, when an abend occurs, the server is in a critical state and needs to be restarted as soon as possible. Under certain conditions, automatic abend recovery is not possible: for example, if an abend occurs at interrupt time, or if the server is scheduled to go down because of a previous abend and another abend occurs. In these situations, the server's cache is immediately discarded and the server is restarted without performing a graceful shutdown. Novell has also verified that certain drivers fail in the abend recovery process. If you find you have a driver that is preventing abend recovery, contact the driver/hardware vendor for an update.
Following is a description of the three SET parameters you can use to modify how the server will respond to critical situations. The default parameters should be sufficient to handle the majority of server abends and hangs.
Auto Restart After Abend: 1(Possible values: 0, 1, 2)
Controls how the server responds after an abend. The possible values are:
Do not try to recover from the abend. (This is effectively the same behavior as in NetWare 4.1, meaning the entire server is halted when an abend occurs.)
(Default) For Page Fault abends, suspend the offending process and leave the server up. For Non-Maskable Interrupt (NMI), Machine Check, and sofware exceptions, force a delayed restart (attempt to recover from the problem, down the server in the configured amount of time, and then restart the server).
For all software and hardware abends, always force a delayed restart (attempt to recover from the problem, down the server in the configured amount of time, and then restart the server). This parameter can be set in the STARTUP.NCF file.
Auto Restart After Abend Delay Time: 2(Range: 2 to 60 minutes)
Indicates the time (in minutes) the server will wait after an abend occurs before automatically going down and restarting itself. In most cases IntranetWare can recover from an abend; however, the server remains in a critical state and needs to be restarted as soon as possible. The delay time allows users to save their files and log out before the server is brought down, to prevent the loss of data. This parameter can be set in the AUTOEXEC.NCF file.
CPU Hog Timeout Amount: 60(Range: 0 to 3600 seconds)
Indicates the time (in seconds) the server will wait before terminating a process that has not relinquished control of the CPU. A value of zero (0) disables this option. This is a new SET parameter in the Miscellaneous category.
Abend Recovery Lab Project
To become familiar with the automatic abend recovery options in IntranetWare, you can try them out on a test (non-production) server. Prior to starting this lab project, you should install IntranetWare on the test server. The abend recovery features of NetWare 4.11 are automatically installed and set up during the install process.
Next, obtain a copy of ABENDEMO.NLM from Novell. With this NLM, you can force various critical situations in the lab to verify how the server will respond according to how the abend recovery options are set. ABENDEMO can be downloaded from the Novell Consulting Toolkit at http://www.novell. com/consulting on the World Wide Web.
Demo 1: Automatic Abend Recovery
To demonstrate IntranetWare's automatic abend recovery in action, we will first generate a Page Fault exception. The Auto Restart After Abend and Auto Restart After Abend Delay Time SET parameters should be at their default settings.
At the server console, load ABENDEMO.NLM. From the menu of bad behavior, select "Generate a Page Fault at Process Time" (see Figure 1). You should hear a beep, indicating an abend has occurred.
Figure 1: Generating a Page Fault abend with ABENDEMO.NLM.
Press <Alt<+<Esc< to toggle back to the console prompt. You will see an abend display similar to the following:
System halted Tuesday, February 4, 1997 4:48:11 pm CDT Abend: Page Fault Processor Exception (Error code 00000002) OS version: Novell NetWare 4.11 June 14, 1996 Running Process: Abendemo Process Stack: 26 43 0C F1 40 C3 B3 00 0C 9E 87 00 7C D0 0D F1 40 C3 B3 00 CE A3 87 00 B6 67 23 F1 50 A3 87 00 40 C3 B3 00 00 00 00 00 A0 9E 87 00 46 30 00 00 Additional Information: The CPU encountered a problem executing code in ABENDEMO.NLM. The problem may be in that module or in data passed to that module by another NLM. The Running process will be suspended. 2-4-97 4:48:54 pm: SERVER-4.11-4631 WARNING! Server TOAST experienced a critical error. The offending process was suspended or recovered. However, services hosted by this server may have been effected. TOAST <1>:
Notice that the server is still operational; only the process that was running at the time the abend occurred has been halted. The abend message now identifies which process has been suspended (ABENDEMO.NLM in this example). The server prompt also displays a number ("<1>" in this example) after the server name to indicate how many threads have been suspended.
At this point, you could proceed to troubleshoot the cause of the Abend, armed with the information displayed on the screen and the additional information written to the ABEND.LOG file (described later).
Demo 2: Automatic Abend Recovery with Delay
This time we will issue a Non-Maskable Interrupt (NMI) abend which will force a delayed restart of the server after the default of 2 minutes.
From the server console, unload and reload ABENDEMO.NLM. From the menu of bad behavior, select "Generate an NMI" (see Figure 2). You should hear a beep, indicating an abend has occurred.
Figure 2: Generating an NMI abend with ABENDEMO.NLM.
Press <Alt<+<Esc< to toggle back to the console prompt. You will see the following abend screen:
System halted Tuesday, February 4, 1997 7:30:11 pm CDT Abend: Nonmaskable Interrupt Processor Exception (Error code 00000020) Parity error was generated by the system board. OS version: Novell NetWare 4.11 June 14, 1996 Running Process: Abendemo Process Stack: 26 43 0C F1 40 C3 B3 00 0C 9E 87 00 7C D0 0D F1 40 C3 B3 00 CE A3 87 00 B6 67 23 F1 50 A3 87 00 40 C3 B3 00 00 00 00 00 A0 9E 87 00 46 30 00 00 Additional Information: The CPU encountered a problem executing code in ABENDEMO.NLM. The problem may be in that module or in data passed to that module by another NLM. "Auto Restart After Abend" has been selected. The server will attempt to go down in 2 minutes. This should give users time to save out any files they are using. Because server critical data structures may have been corrupted, the server may encounter additional problems. 2-4-97 7:30:14 pm: SERVER-4.11-4631 WARNING! Server TOAST experienced a critical error. It is going down in 2 minutes. Save your files and logout. TOAST <1>:
Notice the additional message stating that the server will restart after 2 minutes because the auto restart option was selected. The warning message has also changed to state that the server has experienced a critical error and will be going down. The message warning users to save files and logout will be sent as a broadcast message to all users.
In this case, the prompt did not increment because a thread was not suspended. The prompt still reads TOAST <1>from our previous Page Fault abend.
Wait for 2 minutes and watch the server restart. Notice that the operating system does not reboot the computer; it only recycles memory and restarts the server.
Demo 3: Manual Abend Recovery
If an abend keeps reoccurring, it may be helpful to use the manual abend recovery process described below to obtain more troubleshooting information.
Type "SET Auto Restart After Abend=0" at the server console prompt.
Unload and then reload ABENDEMO.NLM. From the menu of bad behavior, select "Generate a Page Fault at Process Time" again. You should hear a beep, indicating an abend has occurred.
The server immediately displays the following abend screen:
System halted Tuesday, February 4, 1997 6:40:11 pm CDT Abend: Page Fault Processor Exception (Error code 00000002) OS version: Novell NetWare 4.11 June 14, 1996 Running Process: Abendemo Process Stack: 26 43 0C F1 40 C3 B3 00 0C 9E 87 00 7C D0 0D F1 40 C3 B3 00 CE A3 87 00 B6 67 23 F1 50 A3 87 00 40 C3 B3 00 00 00 00 00 A0 9E 87 00 46 30 00 00 Additional Information: The CPU encountered a problem executing code in ABENDEMO.NLM. The problem may be in that module or in data passed to that module by another NLM. Press: "S" to suspend the running process and update the ABEND.LOG file. "Y" to copy diagnostic image to disk (COREDUMP). "X" to update ABEND.LOG and then exit.
Notice that, as in previous versions of NetWare, the entire server has halted and is awaiting user intervention. You have the familiar options to do a "coredump" (copy an image of the server's memory contents to disk for diagnostic purposes) or exit to shut down the server. However, now there is an additional parameter "S" which allows you suspend just the running thread and continue on.
At this point you could press "Y" to do a coredump if one is requested by Novell Technical Support. After you do so, you will also be given the option to suspend the running thread and keep the server up. Keep in mind that in order to see the option to do a coredump, you must first set the Auto Restart After Abend SET parameter to zero at the console prompt.
Press "S" to suspend the process and keep the server up.
Demo 4: Recovering from a Server Hang Due to a CPU Hog
This demonstration shows how the CPU Hog Timeout Amount SET parameter automatically suspends a process that refuses to relinquish control of the server's CPU.
Type "Set CPU Hog Timeout Amount=15" at the server console prompt. (This is simply for the purposes of this lab, to reduce the time you must wait after generating a hang before the server will suspend the thread and continue running.)
From the server console, unload and reload ABENDEMO.NLM. From the menu of bad behavior, select "Generate a Hang" (see Figure 3). You should hear a beep, indicating an abend has occurred.
Figure 3: Generating a hang with ABENDEMO.NLM.
Press <Alt<+<Esc< to toggle back to the console prompt. You will see the following abend screen:
System halted Tuesday, February 4, 1997 8:00:21 pm CDT Abend: SERVER-4.11-4431: CPU Hog Detected by Timer OS version: Novell NetWare 4.11 June 14, 1996 Running Process: Abendemo Process Stack: 26 43 0C F1 40 C3 B3 00 0C 9E 87 00 7C D0 0D F1 40 C3 B3 00 CE A3 87 00 B6 67 23 F1 50 A3 87 00 40 C3 B3 00 00 00 00 00 A0 9E 87 00 46 30 00 00 Additional Information: The CPU encountered a problem executing code in ABENDEMO.NLM. The problem may be in that module or in data passed to that module by another NLM. The Running process will be suspended. 2-4-97 8:00:21 pm: SERVER-4.11-4631 WARNING! Server TOAST experienced a critical error. The offending process was suspended or recovered. However, services hosted by this server may have been effected. TOAST <1>:
At this point the server is still operational; only ABENDEMO has been halted. Notice the abend message states "CPU Hog Detected by Timer", meaning that the running process (ABENDEMO in this example) refused to relinquish control of the CPU and was suspended.
Reading the ABEND.LOG File
Each time the server experiences an abend, information about the abend is appended to the ABEND.LOG file. If this file does not exist, it is created. It is initially written to the local DOS partition (C: drive) and then copied to the SYS:SYSTEM directory when the server is restarted.
The information in this log file is helpful in:
Keeping an accurate history of abends that took place and when they occurred
Having correct information about which modules were loaded at abend time
Having complete information in an easily accessible form to send to Novell, if requested by Novell Technical Services
To view the contents of the log file, type the following command at the server console prompt:
LOAD EDIT SYS:SYSTEM\ABEND.LOG
This displays the log file in the Edit window. (You can also use any text editor to view the log file.)
Message. The first information written to the ABEND.LOG file is the message for the first abend. This includes the error message, date/time, registers, and the running NLM, as shown in the example below:
Server TOAST halted Tuesday, February 4, 1997 6:40:27 pm Break: Server-4.11a: Page Fault Processor Exception (Error code 00000002) Registers: CS = 0008 DS = 0010 ES = 0010 FS = 0010 GS = 0010 SS = 0010 EAX = 00000000 EBX = 00748F10 ECX = F1000238 EDX = 00000009 ESI = 00742DF0 EDI = 00000000 EBP = 00748ED8 ESP = 00748EC8 EIP = F1000232 FLAGS = 00017246 F1000232 C600CC MOV [EAX]= ?,CC EIP in ABENDEMO.NLM at code start +00000232h
Processes. The next section in the ABEND.LOG file is the running process and stack information. This is useful to see which NLM was active when the server abended and what other processes were on the stack prior to the abend:
Running process: Abendemo Process Created by: ABENDEMO.NLM Stack pointer: 748CE0 Stack limit: 745000 Scheduling priority: 0 Wait state: 00 Stack: F10002C1 (ABENDEMO.NLM|MenuAction+89) F8113585 (NWSNUT.NLM|NWSMenuAction+30) --00000008 ? --00000000 ? --00748F28 ? --00741010 ? --00000001 ? F81139F3 (NWSNUT.NLM|NWSLList+320) --00000010 ?
Modules. The last piece written to the log file is a list of loaded modules, including versions, date and addresses. This information is very helpful to know exactly what was loaded at the time of the abend.
Loaded Modules: SERVER.NLM NetWare Server Operating System Version 4.11 June 14, 1996 Code Address: F8000000h Length: 00100000h Data Address: F0611000h Length: 000DF000h ABENDEMO.NLM NetWare 386 Statistics Utility Version 3.00 April 4, 1996 Code Address: F1000000h Length: 00000456h Data Address: 00028000h Length: 00000368h NWSNUT.NLM NetWare NLM Utility User Interface Version 4.15 April 29, 1996 Code Address: F8110000h Length: 00011B3Ah Data Address: F900E000h Length: 0000071Ch Global Code Address: F5000000h Length: 00001000h Global Data Address: F0000000h Length: 00001000h DIAG411.NLM Diagnostic/coredump utility for NW v4.11 (960520) Version 1.01 May 20, 1996 Code Address: F810D000h Length: 00002AA5h Data Address: F900A000h Length: 0000315Ch
This "mini-dump" of information written to the ABEND.LOG file is not intended to replace full coredumps of server memory. However, it is a powerful debugging tool to keep an accurate log of abends that occur on the server, identifying the abend name along with other processes, and modules that were loaded at abend time.
Interaction with Network Management Applications
Many customers have asked whether Simple Network Management Protocol (SNMP)-based applications will be able to trap abend recovery error messages. Since these messages are standard NetWare Alerts, applications which know how to deal with NetWare Alerts can simply add these messages to the collection they already have.
Note: As of this writing, this functionality has not yet been added to the ManageWise trap agent and therefore abend alerts are currently discarded in ManageWise. Check Novell's web site (http://www.novell.com) for information about trap agent updates.
The following information provides details about the two new abend recovery alerts and who is notified.
Alert 1. The structure of the first new alert message is:
NetWareAlertStructure AbendRecoveryAlert #define nmAbendRecovery 286
The text of this message is:
WARNING! Server %S has experienced a critical error. It is going down in %d minutes. Save your files and logout.
where %S is the server name and %d is the delay time (in minutes) set in the Auto Restart After Abend Delay Time SET parameter. This message is displayed only when the abend meets the conditions of a restart resulting in a fresh image in memory (refer to the Auto Restart After Abend SET parameter).
This message will be sent every two minutes, to the error log, the console screen, and to all connected users:
NOTIFY_ERROR_LOG_BIT = Error log NOTIFY_CONSOLE_BIT = Console NOTIFY_EVERYONE_BIT = All connected users
Alert 2. The structure of the second new alert message is:
NetWareAlertStructure = AbendRecoveredAlert #define nmAbendRecovered 288
The text of this message is:
WARNING! Server %S experienced a critical error. The offending process was suspended or recovered. However, services hosted by this server may have been effected.
where %S is the server name. This message is sent when an abend has occurred and the server has recovered without having to restart or refresh its memory.
This message appears only once at the time of the abend. The console prompt changes as described above to help indicate (at a quick glance) that a problem has occurred at the server. This message is sent to:
NOTIFY_ERROR_LOG_BIT = Error log NOTIFY_CONSOLE_BIT = Console
Using the improved abend recovery options in NetWare 4.11, IntranetWare servers can automatically recover from most abends. By allowing users time to save their files and providing for the proper dismounting of volumes, these features help you avoid time-consuming file system repairs. They also protect the server from faulty NLMs that dominate the CPU and cause the server to appear hung.
The information written to the ABEND.LOG file provides a good basis for troubleshooting the cause of the abend. Any time you experience a server abend or other critical condition, you should follow the guidelines outlined in TID 2917538, "Troubleshooting Operating System Abends," available on the Novell Support Connection CD-ROM or web site (http://support.novell.com).
For more general information about the different types of abends and how to obtain server coredumps, see "Resolving Critical Server Issues" in the February 1995 Novell Application Notes and "Abend Recovery Techniques for NetWare 3 and 4 Servers" in the June 1995 Novell Application Notes.
The origin of this information may be internal or external to Novell. While Novell makes all reasonable efforts to verify this information, Novell does not make explicit or implied claims to its validity.