How to Troubleshoot Server Abends
Articles and Tips: article
01 Sep 2002
Taken from Technical Information Database #10017179 and #2917538 and
modified to include NetWare 6 information
from Kevin Burnett
If you are a Novell developer, you may find that your test servers Abend (stop operating) from time to time. Here are some suggestions on how to troubleshoot server Abends, thus reducing calls to Novell Technical Support. Novell's Support Web site offers a wealth of information about how to troubleshoot Abends. Some of the highlights are given below.
What Is a Server Abend?
The NetWare 6 and 5.x operating systems continually monitor the status of various server activities to ensure proper operation. If NetWare detects a condition that threatens the integrity of its internal data (such as an invalid parameter being passed in a function call, or certain hardware errors), it abruptly halts the active process and displays an "Abend" message on the screen. (Abend is a computer science term signifying an Abnormal END of program.)
When an Abend message appears on the server console, either NetWare or the server CPU has detected a critical error condition (fault) and has jumped into the NetWare fault handler. This handler idles the NetWare OS and displays the Abend message on the server console so the server administrator can take immediate action.
Error conditions come in two flavors:
An error condition, or fault, that is detected by the CPU is called a "Processor Exception."
An error condition that is detected by NetWare is called a "Software Exception."
The NetWare fault handler (function) is declared as a public function by the operating system so it can be used by any operating system module or Novell NLM. (See "Abend Recovery Techniques for NetWare Servers," Novell Application Notes , June 1995.)
The primary reason for Abends in NetWare is to ensure the stability and integrity of the internal operating system data. For example, if the operating system detected invalid pointers to cache buffers and yet continued to run, data would soon become unusable or corrupted. Thus an Abend is NetWare's way of protecting itself and users against the unpredictable effects of data corruption. (For more information on this issue, see "Resolving Critical Server Issues," Novell Application Notes , Feb. 1995.)
In this discussion, the term Abend will be used in its most generic sense, meaning any Software or Processor Exception, or case where the server hangsor locks.
Note: Be aware that an NMI Parity error (Abend: Non-Maskable Interrupt) is a hardware error.
Troubleshooting an Abend: The Preliminary Steps
Any time you experience a server Abend (or almost any other server problem), consider the following maintenance steps before you do anything else. Experience has shown that a high percentage of Abends are corrected by applying current versions of the Operating System (OS) patches, and by updating LAN and disk drivers.
Novell Technical Support (NTS) strongly recommends that, regardless of the problem you are experiencing, you should apply the current patches and update drivers. See the Minimum Patch List at http://support.novell.com/misc/patlst.htm .
After this is done, consider the other items on the list that follows. The list is in no particular order.
Load all the current patches that are available for your version of NetWare. The patches are written to resolve known issues. If you are using a version of NetWare prior to 4.11, the patch download file will be called <OS version>PT<file revision number or letter>.EXE for the "PT" set, and <OS version>IT<file revision>.EXE for the "IT" set (IE: 410pt3.exe & 410it6.exe). Load both the "PT" patches and the "IT" patches. (See the Minimum Patch List for a detailed explanation of the patches.)
If you are using NetWare 4.11, or NetWare 4.2, all patches and updates are in the support pack, NW4SP<file revision number>.EXE. There are no "PT" or "IT" files for NetWare 4.11. If you are using NetWare 5, the patch name will be NW5SP<file revision number>.EXE.
Update Drivers. Each manufacturer of LAN and disk cards must develop their own drivers. The only way to assure that you have the latest version of these drivers is to download them from the respective vendor. Even new hardware does not usually ship with the most current drivers. Be certain that drivers are the newest available from the respective vendor.
Look for NLMs (NetWare Loadable Modules) that may be outdated. Remember, every time Novell updates a file, it is to make it more robust and usable. Commonly updated NetWare NLMs are compressed into self-extracting download files. Get a list of the current revision of files and patches from the file Minimum Patch List. Also, don't forget to check with third-party vendors to update their NLMs as well.
Run a virus scan on the server's DOS partition to make sure there are no viruses.
Clean and re-seat the cards and cables ( always use precautions against static electricity).
Double check termination and SCSI IDs and so on.
Check fans for proper operation.
Use an anti-static air can to blow dust off of the system board and other cards and components.
Troubleshooting an Abend: Data Collection
If done properly, this step can be the most valuable toward identifying your problem. The most common shortcoming when performing data collection is not looking far enough beyond the immediate symptoms.
This following is a partial list of things to look for and questions to ask yourself. Don't feel embarrassed to raise a question that seems completely unrelated. Sometimes it is unrelated, but often times it helps to get a more complete picture of the problem.
The questions and suggestions that follow should help you gain a higher level view of the problem. Sometimes your data collection will reveal more than one potential problem and you'll have to perform "computer triage." However, without a complete picture, you could waste your time troubleshooting in a meaningless direction.
Use this list to get you thinking about the kinds of things that could be happening on your server/network to cause problems. These are in no particular order.
Keep a record of the Abend messages and watch for trends. Abend messages that are consistent sometimes indicate that software is at fault, while messages that are not consistent may indicate a hardware failure. There is no hard or fast rule here--you want the data to point you in some troubleshooting direction.
Look through the system error log (sys:system\ sys$log.err) for clues that haven't surfaced anywhere else. For example, an error on a certain node just before the Abend, or an error on a certain file, a print queue, volume dismounts, etc.
What resources are being used at the time of the Abend (printing, file, tape, COM port, etc.)?
What time of day does the Abend occur? Is it consistent?
Is the room air conditioning off when the machine Abends? This may indicate a heat problem.
What is the environment like? Dry, dusty, or hot environments contribute to heat problems and static.
Are there power problems either at the power source or at the power supply?
Is there a certain database function running, such as reindexing?
Is LAN traffic high? Are disk reads or writes high?
With the Abend still on the screen, break into the debugger and record basic information such as the EIP (instruction pointer), running NLM, and running process.
Use CONLOG.NLM to capture console messages you would otherwise miss.
Note: CONLOG must be unloaded to close the log file and make it available to read. (See CONLOG.NLM on the Novell Technical Support Web site at http://support.novell.com .
Is there a certain user, segment, application, server process (like a backup), or anything else that is common or consistent when the Abend occurs?
Is the client software current?
Ask, "What has changed in my server environment?" Consider questions along these lines:
Has the number of users increased?
Have any new software or software upgrades been installed recently?
Is someone using software in a way that is different than how it has been used previously, such as database indexing?
Is there new or different hardware?
Have there been changes to the LAN, the routers, or the cabling?
Have workstations or the file server been physically moved?
Are there new printers on the LAN?
Have there been any power outages?
Have the server's SET parameters been changed?
Is there any new or strong electromagnetic emission (EMI) near the server or cabling? This would include large motors, cabling across the florescent lights, vacuum cleaner, transmitters, and so on.
Has the hardware been handled without static protection?
At the server command prompt, type "Set DStrace =ON" and watch the DSTRACE screen for errors, or for an "All processed = NO" message. Be sure to allow time for any NDS errors to go away before you worry too much. Fifteen minutes to several hours is usually adequate.
Are any users dropping their connection?
Are there signs of file corruption?
Are there signs of drive deactivation? If partitions are mirrored, the drive can deactivate without bringing the server down. Check the error log.
Are there any printing problems?
Is power filtered? If so, has the filtering hardware been tested recently and is it still functioning?
MONITOR.NLM and INSTALL.NLM are valuable NetWare utilities to check your servers health. Use them to find information such as:
Climbing packet receive buffers
No ECB available count that continues to climb
Low server memory. Cache buffer percentage should usually be around 60 - 70% or higher.
LRU sitting time
Dirty cache buffers that stay high
A high number of LAN errors (more than 10% of the total packets sent or received)
High utilization (if it stays high for more than 10 or 20 minutes at a time)
Check Service Processes to see if they have maxed out
Partition, volume, and mirroring information
View and edit NCF files
An invaluable tool for data collection is CONFIG.NLM (included in TABND2.EXE). When you run it at the server, CONFIG.NLM will create a file which includes information about your server's configuration.
You may notice something in the configuration report that you had not noticed before but now raises a "red-flag," Use it to document your configuration before you make any changes to the server. Also, if you place a call to Tech Support, you will often be asked for this information.
Note: It is important to establish what is normal for your environment so that you can accurately determine when you have a real problem and when you have simply hit against the limitations of your hardware and/or software.
Note: Sometimes understanding the data you've collected will require you to find out from other sources if what you are seeing is normal. For example, it's common for the server's utilization to stay at 100% for a few seconds or even a few minutes or more.
Note: It is also ok to get DSTRACE errors. If Directory Service is trying to process a request, and another request is already in process, you can get an NDS error until the first process completes.
Note: Likewise, the allocation of packet receive buffers or the size of the directory entry table is dynamic up to the setable maximum. They are allocated dynamically, on an as- needed-basis. It is often only through experience that you'll determine if what you're seeing is normal or if it is indicating a problem.Watch your server to determine what is normal in your environment and then tweak SET parameters or make other changes as needed.
Troubleshooting an Abend: Narrowing (Isolation) and Duplication
Now that the preliminary steps have been covered and the initial data collected, troubleshooting is primarily a matter of going back and forth between "Problem Isolation" and "Problem Duplication." You are trying to narrow in on the problem, while at the same time discover a sequence of events that will reproduce the problem.
In simplest terms, an Abend is caused either by a hardware failure or by a misbehaving NLM. In either case the result is usually corrupted memory. Remember from the introduction, a software exception occurs when NetWare fails a consistency check (performed in memory, on memory). A processor exception occurs when the processor encounters an address or machine instruction that does not comply with the rules (again resulting from corrupted memory).
Problem isolation and problem duplication are almost the same. The main difference is that in problem duplication, you are specifically trying to reproduce a problem. There may be an NLM which causes the server to Abend every time it loads. Or perhaps the server Abends when someone performs a large file copy while someone else is logging in. If you are able to find a reproducible problem like this, you can now eliminate variables one at a time, try the test again, and see if the problem goes away.
In the other case, problem isolation (narrowing), the data may not have given you a clue so you have to probe around, trying different things to see if you can narrow the problem down to a system or component. You may be able to determine that the problem is isolated to the disk channel because the server only abends when the disk is being accessed. Or you may be able to relate the problem to a certain NLM, such as when you perform a backup.
Consider these systems when trying to isolate a problem:
Disk channel
LAN channel
System board
Com port
NetWare Operating System
Third-party NLM product
Cabling
A certain type of workstation
A certain type of shell (VLM, client32, or a third-party client).
Remember that the main objective is to find a sequence of events that will reproduce the problem, or at least narrow down the problem to a system, an NLM, or a piece of hardware that is always involved when the Abend occurs. The following troubleshooting ideas should help you to "divide and conquer." This list is in no particular order, but it is grouped somewhat by LAN, disk, and general troubleshooting ideas.
Use "SERVER -NS" or "SERVER -NA" to bring the server up without executing the STARTUP.NCF or the AUTOEXEC.NCF files respectively. Loading "SERVER -NS" will allow you to bring up the server without the volume mounting automatically. These parameters also work for SFT III with the "ACTIVATE SERVER" command.
Does the Abend message itself suggest anything? This could include the LAN channel, disk channel, memory corruption, system board problem, a certain NLM, printing, a certain piece of hardware, a certain LAN segment, a workstation, a router, an environmental condition, etc. NetWare 4.11, NetWare 4.2, and NetWare 5 will create a file called ABEND.LOG in the SYS:SYSTEM directory that will contain the current and previous abend messages, as well as a list of modules loaded at the time of abend.
Use "SERVER -NA" to prevent the AUTOEXEC.NCF from running. Then load the NLMs manually, one at a time.
Use "SERVER -NDB" to prevent the DS database from loading and thereby eliminate Directory Services as a point of failure. Note, however, that you won't be able to log in without the database loaded.
What is the age of the hardware? If nothing has changed in the environment, then the hardware may have simply failed. Don't assume that new hardware is always good.
Could hardware have received static shock? Static is not always destructive, often it will cause degenerative damage to your hardware, allowing it to continue to work for a time before failure ensues.
Check with third-party product vendors to see if they are aware of the kind of problem you are experiencing.
Temporarily unload any third-party NLMs.
Temporarily unload virus scan software, as well as server/LAN monitoring NLMs.
Could there be power problems either at the power source or from the power supply?
Check the cooling fan in the power supply, the case, and on the CPU. Heat will cause hardware failure.
A dry, hot or dusty environment can cause hardware degradation due to static electric discharge. It also increases the chance of NMI errors.
Avoid Interrupts 15, 2, and 9, in that order.
Try to isolate the problem to a hardware subsystem, such as the LAN channel, disk channel, and system board. You can swap hardware or try a different interrupt or slot.
Although the Abend message is very generic, it can still be used to point you in a particular direction. Most Abends indicate memory corruption. Some will be disk related, while others are LAN related.
Often an Abend message will include a function name. For example: "Abend: Deallocate Mapped- Page was supplied an invalid memory pointer." The words that appear without spaces (DeallocateMappedPage) is the name of a function in NetWare's code. In this case, a memory pointer was sent into the DeallocateMappedPage function. During a consistency check the function determined that the pointer was not a valid address.
When an Abend mentions words like "interrupted," take a look at the LAN simply because the LAN does more interrupting than anything else in the server. As is always the case, this becomes more intuitive through experience.
Clean and re-seat the cards and cables. Remember static protection.
How long has the server been installed? If it is a new install (less than a month) you may still have configuration and set up issues, or you could have faulty hardware--even if it is new.
Verify that you have the most current BIOS revision for your hardware.
Run hardware diagnostics on the server machine, if available.
If the machine is having trouble with the LAN channel, try swapping out the LAN card with another manufacturer.
SERVER.EXE, like any other file, can become corrupt. A corrupt SERVER.EXE can be difficult to track down. If you can reduce the environment to not much more than a server, a disk and a LAN, and you still have the problem, try a fresh copy of SERVER.EXE. The same idea applies to any other NLM on the server. But be sure to copy the correct SERVER.EXE file.
Run DSREPAIR.
Run the Install utility and view partition or mirroring information. The Install program retrieves this information dynamically. There can be partition table corruption, for example, that is not surfacing as such. When you try to access partition information in install, you'll get a specific error because the table cannot be read.
Always double and triple check termination, interrupt settings, SCSI ID, drive translation (should be off), and so on.
Run VREPAIR. If VREPAIR runs clean (zero errors) and you still suspect disk corruption, change the VREPAIR options. Option 2 from the main menu will change the VREPAIR options. Set them to "Write changes immediately...," "Write all changes....," and "Purge deleted files." (This is not the exact syntax.)
Don't be alarmed if you see a lot of errors after these changes. Because of the nature of what VREPAIR is doing, Novell recommends that you always have a verified good back up before you run VREPAIR.
Try different workstations.
Try workstations on different segments.
Can you isolate the LAN completely by attaching a single workstation directly to the server?
Category 5 cabling is usually required on faster LAN cards. Running the wrong cable can cause problems.
Heavy I/O will sometimes stress the server enough to force an error. Try doing a COPY *.* or an XCOPY continuously. Try it from several workstations.
Use NCOPY to copy a file if the source and destination are on the same server. NCOPY will not send any packets across the LAN in this scenario. If you still have the problem, you know the LAN is not part of the problem. Any other form of copy (COPY, XCOPY), where the server is the source or destination, will send the file across the wire even if it is just going to be sent back a second time across the wire. If copy has the problem but NCOPY does not, then the problem is probably on the LAN.
* Originally published in Novell AppNotes
Disclaimer
The origin of this information may be internal or external to Novell. While Novell makes all reasonable efforts to verify this information, Novell does not make explicit or implied claims to its validity.