Basic Troubleshooting Steps for Critical Server Issues
Articles and Tips: tip
Novell Technical Services
01 May 1998
Networking involves a wide variety of components working together to form a functional whole. Each component has one or more specific relationships to other components in the system. Over time, various components are added, removed, and reconfigured as the network adapts to changing business needs. While these characteristics of networks allow flexibility, they can make it difficult to pinpoint the exact cause of problems when something goes wrong.
The troubleshooting steps outlined in this NetNote can be followed for most ABEND (ABnormal END) errors and hang situations on NetWare servers. By following these steps, you can narrow down the possible causes and provide more accurate information for the support technician if needed.
Gather information about the problem.
Identify probable causes of the problem.
Test possible solutions.
Use debugging tools, if necessary.
Resolve the problem.
Step 1. Gather Information about the Problem
It is a good idea to record basic information about your server before any problems occur. This information will prove invaluable in your investigation of possible causes for an ABEND or hang condition.
Compile a list of all the server's hardware components. Include Novell testing and certification information on these components. Using certified hardware is highly recommended to reduce the chance of hardware-related problems.
Put together a complete listing of the LAN and disk drivers that are loaded on the server, along with the dates and versions. Again, a lot of compatibility problems can be avoided by using drivers that have been certified by Novell.
It is helpful to have a complete list of the NetWare Loadable Modules (NLMs) that are running on the server, with date and version information for each one.
Obtain a listing of the server's STARTUP.NCF and AUTOEXEC.NCF files that specify which server console commands to execute when the server is booted up.
Network administrators should maintain a log for each server to record both hardware and software changes. When a problem arises, these records can help determine if the server has a history of stable operation, and whether the problem could be related to a recent change to the system.
When a critical server issue arises, it is helpful to note what activities were taking place on the network just prior to the ABEND or hang. These might include events such as system maintenance (backups or database rebuilds), system failures, error messages, and warnings. Also make a note of user activities such as high workload due to month-end closing, and so on.
Any error messages that occurred near the time of the server problem need to be gathered and analyzed. Sources include:
The Abend information screen and ABEND.LOG file provide valuable information on the state of the server at the time of an Abend.
Some console message might still be displayed on the server console screen. (The CONSOLE.LOG file produced by loading CONSOLE.NLM captures important information and messages that would otherwise scroll off the main server console screen.)
The system error log (SYS$LOG.ERR), volume error log (VOL$LOG.ERR) and Transaction Tracking System error log (TTS$LOG.ERR) provide information about errors relating to the server, a particular volume, or TTS.
Some NLMs have their own debug screens or log information to files that will help in the debugging process.
Step 2. Understand the Problem and Identify Probable Causes
Understanding the problem comes by asking questions about the information and facts gathered in Step 1. Some of the types of questions you might ask include the following:
Has the server's hardware configuration changed from one that has been tested and certified?
Which drivers and NLMs are loaded for this hardware configuration? Are all of these drivers and NLMs up to date?
When did this problem occur? For example, did it occur while trying to boot the file server, and if so, at what point did the failure occur?
If the server is in a hung state, what can you still do at the server? (For example, try toggling to different screens to determine whether the server is partially or totally locked up.)
What information from the error log files might relate to the ABEND message or hang?
Once you have a good understanding of the problem, try to identify some probable causes by drawing conclusions from the information gathered and forming one or more hypotheses.
As an example, suppose you just added a new network interface card to the server. The next time you bring up the server, it hangs. After going through the information-gathering suggestions listed above, you arrive at two possible causes:
Hypothesis 1. Since the last thing that changed at the server was adding a new network card, there's a pretty good chance this is the cause of the problem.
Hypothesis 2. The server might be experiencing file corruption resulting from a power outage or drive failure.
The more information you have, the easier it is to form your hypotheses. For example, if there was an error message in the CONSOLE.LOG file stating that the network card did not bind properly, this additional information helps you further isolate the problem.
Step 3. Test Possible Solutions
There are several methods or techniques you can use to test your hypotheses. Some of the most common are:
Component Swapping. This technique involves swapping or replacing the suspected faulty component with a similar component that is known to be good. This method is most effective when you are familiar with the expected behavior of each component and already have a good idea of what could be causing the problem. Be sure to replace only one component at a time so you can be sure which one fixed the problem.
Divide and Conquer. To make it easier to isolate a problem, remove unneeded components (such as NLMs and hardware components) to simplify the system.
Discuss the Problem with Others. A good way to gain valuable feedback about a problem is to discuss possible solutions with other experienced CNEs and Novell support engineers. The Novell Support Connection Forums are a good resource for sharing ideas and information.
If you are limited in the amount of information to work from, you will most likely have to repeat the process of gathering information and forming a hypothesis several times. Even if your first tests fail to resolve the problem, you will probably glean additional information to help you further isolate the problem. Pay particular attention to any new messages that appear after you have changed the system.
Step 4. Use Debugging Tools
If you have not been able to gather enough information to make conclusions about the abend or hang, the use of additional debugging tools such as network analyzers, along with a memory image from the server, can help in resolving server abends or hangs.
Network Analyzers. Network analyzers (such as Novell's LANalyzer, Network General's Sniffer, and so on) are great tools for gathering troubleshooting information. In many cases, knowing about the behavior of protocols and packets on the network can help speed up the resolution of the problem.
NetWare's internal debugger. This is a good debugging tool, especially for pre-4.x versions of NetWare.
Memory Image File. If the problem still exists after you have taken all of the above steps, there is another useful tool available to you. That is to create a memory image or "core dump" of the server and send it to Novell Technical Support for analysis. This memory image provides a snapshot of your server at the time of the ABEND. Most of the time memory images provide enough information to resolve over 90% of all OS escalation.
Note: Before sending in a memory image, make sure all the tested and approved NetWare patches have been applied to the server.
Step 5. Resolve the Problem
Once the problem has been isolated and you have proven your hypothesis correct, it is time to resolve the issue. For software issues, you can resolve problems with patches, workarounds, new drivers, and so on. This means contacting the correct software engineering group that would fix the problem. If a incident is not already open with Novell, open a support incident and present your information, so that a fix can be generated. For a hardware problem, repair or replacement is the option.
If these steps do not resolve the problem, contact your Novell Authorized Dealer or Novell Technical Support for assistance.
* Originally published in Novell AppNotes
The origin of this information may be internal or external to Novell. While Novell makes all reasonable efforts to verify this information, Novell does not make explicit or implied claims to its validity.