Basic Terminology Surrounding Critical NetWare Server Issues
Articles and Tips: tip
Critical Problem Resolution Team
01 Apr 1998
Few events strike terror into the hearts of network administrators as much as a server shutting down unexpectedly. The severity of a server problem depends on the use of the server, what data it provides to the company, and who is using that data. If it happens to be a Vice President or the head of the company, that extra pressure only adds to the stress level and complexity that normally surrounds critical server problems. But with sufficient troubleshooting information and a few advance precautions, you can be better prepared to handle server problems proactively.
Why NetWare Servers Abend
To many end-users and network administrators, operating system (OS) failures are shrouded in mystery. They wonder how the OS recognizes that there is a system problem and how it knows what to do to handle the problem. To shed some light on this subject, it is helpful to understand the intricate juggling act a NetWare server performs in supporting network communications and handling shared data requests from many users or clients.
In addition to the core NetWare operating system, a typical NetWare server loads a number of other programs to support various network functions. For example, at startup the server loads one or more LAN drivers to provide communication with the client workstations, and disk drivers that enable communication with the server's storage devices. Other NetWare Loadable Modules, or NLMs, may be loaded such as the print server NLM to support network printing, as well as an AppleTalk protocol stack so that applications running on Apple Macintosh computers can access the network.
Once the server is up and running, most of its time is spent processing read and write requests to the file system and handling network communication. Occasionally it performs other functions such as printing files and updating a database. Suppose that several people are using the database concurrently, entering data or searching the file for specific information. All of a sudden the server fails, and everybody wonders why.
While it is difficult to pinpoint the exact cause of the server failure without some investigation, the general reason why a NetWare server stops running is to preserve the integrity of the operating system. If either the software or the hardware on a NetWare server detects a serious problem from which it cannot recover, the OS will stop all server operations to prevent further corruption to the file system, the Directory Services database, or other vital system data. This halting of the server is called an Abend.
To understand why it is better for the server to stop than risk continued operation after such an error, consider the following example. A large accounting firm has a NetWare server that runs two applications: an account database that contains all records for all of their accounts, and a backup application that is used as a safeguard to ensure availability of the data in case anything happens to the server. Once a week, the backup program reads information from the server's disk and copies it into a buffer, from which the data is then written to a tape drive.
The system was working fine until the network administrator upgraded to a new tape drive that, unbeknownst to him, came with a buggy driver. While performing the next weekly backup, the backup application experiences a problem due to the bug in the tape driver. The pointer to the tape buffer is overwritten in memory and now points to a location inside the memory range where the database application is cached. So instead of writing data to the tape drive's buffer, the backup software is writing it to the accounting database's cache buffers. As a result, the accounting database buffer contains invalid data. Eventually this invalid information will be written to disk, overwriting the valid information previously stored on disk. At that point, one week's worth of work could be lost, and the firm would have to restore from the old backup.
To prevent this type of situation from happening, NetWare must recognize the problem and issue an Abend before the valid data becomes corrupted. In this case, the administrator was able to correctly identify a faulty tape driver as the culprit. After installing an updated driver, the file server was rebooted and the backup was rerun successfully.
Coming to Terms with an Abend
With this background information in mind, let's define some general terms that are used to describe a server that unexpectedly ceases operation. It doesn't always happen the same way. The server may simply stop running, or applications may be unable to run at the server. The server may appear to be running but users suddenly become unable to perform any work with the server. Terms such as server crash, hang, or Abend are often used interchangeably to refer to such situations, but it is important to use the correct terminology to speed the resolution process.
Server Crash. This term is generally used to describe any type of critical computer issue. In networking, the meaning has evolved to encompass any time the server ceases operation. However, the word "crash" is not very descriptive, and it provides little or no help when support personnel are brought in to help resolve the issue. To the technician, it's like building a puzzle without all of the pieces.
The term "crash" seems to be used more by non-technical personnel and upper management, who use it mainly because they lack an understanding of what's really happening when a server goes down. While it might be okay to use this term when recounting your server-fixing exploits to a group of non-technical friends, it's not okay to use it in a technical troubleshooting session. When talking to technical support personnel, use more descriptive words such as hang or Abend to describe your critical server issues.
Server Hang or Lockup. The term "hang" is a slang expression used in everyday speech to mean waiting around in a mostly idle state, as in "I'm going to just hang around the house until a friend drops in." In networking, the term is often used to describe a server or application that no longer performs valid functions and that provides no error message to indicate what went wrong. The system may still be capable of performing a few functions, but it is not fully operational.
A NetWare server can hang when the system becomes "confused" and incapable of proceeding further without help. Typically, a hang occurs when the server is trying to perform a set of conflicting instructions that are passed to it. One example of this is known as a deadlocked condition in which one process is contending with another for the same server resource. Neither process is able to proceed because each is waiting for the other to do something, much like two boxers locking arms in the ring so neither one can throw a punch.
Another common cause of a server hang is like the proverbial Mexican standoff. A program that is trying to communicate with the server might find itself waiting for output from the server before it can send anything more to it. The server, on the other hand, is waiting for additional input from the client or application before it can output anything.
Perhaps a more descriptive way to describe a server hang would be "server lockup." A lockup can be partial or full. A partial lockup may mean that one utility, application, or driver does not function, yet other processes are still running. For example, the console may be frozen while you are trying to load a particular NLM, yet you can switch to other console screens on the server. Another example of a partial lockup is when a thread is caught in a tight code loop waiting for a certain condition to be met. Within this loop, the code is written in such a way that the thread can relinquish control of the CPU to allow other threads to have their turn, until the original thread's condition is met and the thread can continue.
When a full server lockup occurs, no processes are allowed to run and no one can log in to do work on the server. Connections that are currently logged in or attached are dropped. Nothing can be done at the server console or other NLM screens. If interrupts are disabled there will be no response at all from the server keyboard.
Server Abend. The term "Abend" is short for ABnormal END. In NetWare, an Abend is a routine or function that halts server operation, thereby stopping any further CPU processing. The thread running on the CPU is suspended when the server halts.
To illustrate what happens during a server Abend, suppose that a critical problem occurs while the server is running and the NetWare OS finds itself in a position where it can no longer continue without compromising the integrity of the system data or the data it is currently handling. NetWare issues an Abend error and halts all CPU processing on the server, which makes the server unavailable to clients and to any other services. The result is that applications such as backups are stopped before they have completed, client connections are dropped, and any unsaved information could be lost.
Keeping Your Head
When faced with a critical server problem, server administrators and support engineers are expected to resolve the problem quickly to minimize server downtime. They are expected to make correct choices based on the information that is available. It is too late to read up on server troubleshooting when the heat is on to resolve the problem. A good first step is to take the time now to educate yourself on what causes server problems and where to go to gather helpful information.
Another step that will help make your job easier is to keep up-to-date on the latest operating system versions and patches. Whereas NetWare 3.x provides little help in recovering from a server Abend, NetWare 4.11 provides automated Abend recovery techniques to minimize downtime and reduce the need for administrator involvement. It also generates an ABEND.LOG file containing information that is helpful for the resolution process. In the future, NetWare 5 will enhance these existing features by incorporating privilege-level protection and support for multiprocessors. (For more information about Abend recovery in NetWare 4.11, see the AppNote entitled "IntranetWare Server Automated Abend Recovery" in the March 1997 issue of Novell AppNotes.)
* Originally published in Novell AppNotes
The origin of this information may be internal or external to Novell. While Novell makes all reasonable efforts to verify this information, Novell does not make explicit or implied claims to its validity.