Reducing Abend Resolution Time with Novell's Abend Analyzer
Articles and Tips: article
Software Engineer
CPR Team
01 Apr 1999
The process of analyzing ABEND.LOG files just got a whole lot easier, thanks to a new problem/resolution database developed by Novell's internal support engineers.
Introduction
Those who have worked with NetWare servers have probably encountered at least one Abend error message indicating that the server's operations have ABnormally ENDed. If the error continues to reoccur, the traditional solution method involves capturing the contents of the server's memory (a "coredump") and sending it, along with the ABEND.LOG file, to Novell Technical Support for analysis. In some cases, this analysis can be time consuming and it may be days before the server is back up and running.
Novell engineers have been working on ways to expedite the process of analyzing server Abends. Now they have come up with a solution that allows Abend information to be analyzed much faster, even instantly in many cases.
This AppNote introduces the Abend Log Analyzer tool that is currently available to all internal support engineers. This tool makes it possible to compare an ABEND.LOG file or a coredump to those that have been previously analyzed and resolved, and then return suggested solutions.
This AppNote first reviews the Abend process, along with the system processes and components associated with an Abend. It then discusses the Abend Analyzer and presents scenarios that show the efficiency of the analyzer in resolving critical server issues.
Concepts to Understand
Before we explain how the Abend Analyzer works, it is helpful to understand the following:
What is an Abend?
What is a coredump?
What is an ABEND.LOG file?
What is a footprint?
What Is an ABEND?
An Abend (ABnormal END) is the most critical problem that a NetWare file server can experience. Sometimes when errors occur on a NetWare server, vital data can become corrupted in such a way that it cannot be accessed. When a server can no longer access its vital data, that server's state is considered compromised. When NetWare detects a compromised state, the server's operations are brought to a halt to avoid further data corruption. All code execution stops and the contents of the server's memory are preserved as they are. The operating system then calls an Abend handling routine to handle the situation from there.
At this point, the server is unavailable to handle client requests. Critical processes on the server are not able to complete execution. Nothing else can be done on the server until the Abend is handled by the NetWare auto-recovery process (4.11 and later), or until the server is rebooted.
What Is a Coredump?
The term "coredump" comes from the mainframe world where RAM was referred to as core memory because of the way data was stored in ferrous magnetic cores (little doughnut-shaped pieces of iron). Although today's microcomputers no longer store data in this manner, a PC's system RAM is still occasionally referred to as the core. Copying an image of the system's memory is still referred to as dumping core memory or making a coredump.
A coredump is a byte-for-byte image of a NetWare server's memory, or a "snapshot" of a server's RAM at the time it Abended. Since memory does not change or refresh on the server when it is in an Abended state, the coredump will contain information about all of the following system activities and the state they were in when the server experienced the critical error:
Processes
Loaded Modules
Allocated Memory
Cache Memory
Screen Shots
Processes. A coredump contains all processes allocated on the server at ABEND time. This includes the process that was currently running when the server ABENDed, processes waiting to run, and processes that were not in use. For each process a call stack (history of what the process has done) is also preserved in the coredump.
Loaded Modules. A coredump also contains all of the modules that were loaded on the server at the time of the ABEND. This includes the module information, code, and data.
Allocated Memory. As processes are run on the server, they allocate memory for various functions. They can set values in that allocated memory and then use it later. That allocated memory is also stored in a coredump.
Cache Memory. Memory that has not been allocated for a module, process, or allocated memory is called cache memory, and is also included in a coredump.
Note: The latest version of DIAGxxx.NLM allows cache to be excluded from the coredump. This keeps the size of the coredump considerably smaller.
Screen Shots. Screen shots of every screen on the NetWare server are also preserved in the coredump. These appear on the console screen and include the Abend message, server name, and any remaining errors. Other helpful screens include MONITOR and SERVMAN for statistical information, and application screens for application errors.
What Is an ABEND.LOG File?
In NetWare 4.11 and later, advanced features were added for handling critical server issues such as Abends and NLM (NetWare Loadable Module) lockups. One advanced feature is the automatic creation of ABEND.LOG, a file that contains a history of every critical situation the server experiences. The ABEND.LOG is essentially an abbreviated version of a coredump, with only the most vital pieces of information included to keep the size as small as possible.
The ABEND.LOG file is created as part of the auto-recovery process in NetWare 4.11 and later. In most circumstances, this process will simply suspend the thread responsible for the Abend and then allow the server to continue its operations. During the auto-recovery process, NetWare creates a summary log of the state of the file server at the time of the Abend. This is written to a file named ABEND.LOG on the DOS (C:) partition, and then later appended to a file with the same name on the SYS volume in the SYSTEM directory. (The ABEND.LOG file in the SYS:SYSTEM directory can be reset or deleted to save space.)
Figure 1 indicates the various types of information contained in a sample ABEND.LOG file.
Figure 1: NetWare automatically creates an ABEND.LOG file to provide helpful troubleshooting information.
Each piece of information in this sample ABEND.LOG file is described in more detail below.
File Server Name. The name of the file server is important, especially if you have multiple file servers that are experiencing the Abend. It is also important to keep track of which servers are experiencing the problem and which are not. Many times a simple comparison between one server that is Abending and one that is not will show the cause of the Abend or at least give a clue.
Date and Time of Abend. Because of NetWare's auto recovery process, administrators may not notice that a file server has Abended more than once. Looking at the ABEND.LOG file may be the only way to know.
Abend Message. The Abend message itself can be one of the biggest hints regarding the cause of the error. An Abend message such as "Free detected modified memory beyond the end of the cell being returned" is fairly specific. This message indicates that the running process tried to free memory that had been overwritten. Often a combination of this type of Abend message, the running process, and the stack trace will be enough to provide a solution.
Registers. The contents of the registers become important when the above pieces of information do not provide enough to isolate the cause of the Abend or at least suggest a troubleshooting path. The registers are helpful if a pattern can be established from several Abends. If several registers contain the same value every time the server Abends, it is likely that a software bug or possibly a corrupt file is causing the Abend. This is especially true for the EIP register.
Abended NLM. The Abended NLM is the module that owns the code that was running when the server stopped. However, this does not necessarily mean that this module caused the Abend. Often a function in the Abended NLM can be passed a bad value from another module. Knowing what module was being executed when the server Abended is useful to determine which NLM or product to troubleshoot.
Running Process. Often the running process belongs to the module that caused the Abend. Usually it is somehow related to the Abend. If the running process is something other than just a generic server process, the module that owns that process can be targeted for troubleshooting.
Stack Limit and Pointer. The stack limit and pointer are used to determine if severe memory corruption has taken place. The stack limit is simply the smallest size of the running process stack.
Stack Trace The stack trace is a printout of the stack, one value at a time. If the value is an address located inside a module, the module name, function (if available), and offset is printed to the right side of the address. The values of the stack trace that are important are the values that fall inside the module. A pattern in the stack trace of several Abends will give clues to a possible cause. If the pattern shows that the stack trace of several Abends is exactly the same, it is likely that there is a code path to the Abend and a good chance that a fix is already available.
Modules List. This is a listing of all modules that were loaded on the server, complete with version numbers and dates, so it is easy to tell which revisions were on the server when it Abended.
The first thing you should do with an Abended server is look at the modules list and compare it to the latest released files from Novell. If you find any modules that are outdated, update them, especially if one of them is the Abended NLM. Software updates and patches are readily available from Novell at:
http://support.novell.com/misc/patlst.htm
The Concept of a Footprint
Taken as a whole, a coredump provides a massive amount of information that is extremely tedious to sift through. Through years of experience, Novell engineers have collected specific pieces of information from coredumps that actually define what caused the Abend. This set of information, called a "footprint", is the most frequently used information in troubleshooting Abend issues. In fact, it is the same information that is contained in the ABEND.LOG file.
Novell has used these Abend "footprints" as the basis for the Abend Analyzer. They have taken the information contained in various ABEND.LOG files and isolated the exact piece or pieces that caused the Abend. Once the cause had been isolated, the next step was to determine which of the pieces remained consistent or always the same for the Abend in question. For example, if a "Free" Abend occurred in a specific NLM version, the footprint would contain the module name, version, and date as well as the Abend message.
Once the footprint for a specific Abend has been isolated, it is stored in a database along with the solution. If the Abend is encountered again, the same solution can be implemented again with very little analysis required.
The Abend Analyzer
Once we had a database of Abend footprints and solutions, we needed a way to analyze coredumps and ABEND.LOG files to compare them with known issues. Novell Technical Support has implemented the Abend Analyzer in the form of a Web database. This database allows support engineers to quickly analyze ABEND.LOG files and coredumps with known issues.
The analysis process includes several steps:
Taking the coredump and stripping it down to an ABEND.LOG file.
Comparing the information in the ABEND.LOG file to footprints already stored in the database.
Returning the solution(s) for footprints that match the current coredump or ABEND.LOG file.
Abend Resolution Example
The following is an example of how the Abend Analyzer could save an IS manager time and trouble. A NetWare server at a company Abends. There are 500 users attached to that server who are now unable to work.
Without the Abend Analyzer, the IS administrator must do the following:
Make a call to Novell Technical Support.
Work with NTS over the phone, only to be told that an ABEND.LOG file must be submitted to NTS.
Send the ABEND.LOG file to NTS.
Wait while NTS evaluates the ABEND.LOG, determines that it is a known problem, and suggests a course of action.
While the administrator is waiting for NTS, he/she has to worry that the server might Abend again, causing more downtime and lost employee productivity. In many cases, nothing can be done to keep the server up until a resolution is found.
By comparison, look at the tasks an IS administrator would follow if an Abend Analyzer were available:
The administrator sends the ABEND.LOG file to the Abend Analyzer.
Upon inspection, the Abend Analyzer quickly determines that the issue is a known problem that exists in the database.
The Abend Analyzer returns the appropriate course of action to the system administrator in a matter of minutes.
Considerations for the Future
Novell Technical Support is currently using the Abend Analyzer to speed up response times to customers. In addition, the following scenarios suggest other ways such an analyzer could be used to resolve Abend issues faster.
Customer Access
One obvious enhancement is to provide customers with a direct way to access Novell's Abend database. This would allow many issues to be resolved without placing a support call to Novell and would minimize downtime by providing knowledge and insight faster than the traditional call-in method.
Resource Server
Another possibility is to make a resource server available for the Abending server to access. This would allow for a possible solution to be displayed on-screen along with the Abend message.
For example, if a NetWare server Abended, it could automatically access the resource server and check the database for possible solutions. The resource server could then return the solution, which the Abended server would then display on-screen and in the ABEND.LOG file.
Smart Server
A third suggestion is to have a smart server waiting to process a request from an Abended server. The smart server would have the database and available solutions. When a NetWare server Abends, it would contact the smart server and request not only the solution text but also the solution itself.
For example, if a NetWare server Abended, it could access the smart server and check the database for possible solutions. If a solution were found, the smart server could then return the solution. If a file needed to be replaced, the smart server would have the ability to send the appropriate file to the Abended server. The Abended server could then display on-screen and in the ABEND.LOG file that a solution had been found and applied to the server.
Conclusion
This AppNote has reviewed the basic concepts behind server Abends, their causes, and possible solutions. It has described how the "footprint" information provided in a coredump or ABEND.LOG file can be used to populate a database of problems and solutions. This has led Novell Technical Support engineers to develop an Abend Analyzer tool that can significantly reduce the time and effort required to resolve critical server issues.
* Originally published in Novell AppNotes
Disclaimer
The origin of this information may be internal or external to Novell. While Novell makes all reasonable efforts to verify this information, Novell does not make explicit or implied claims to its validity.