TUXEDO System: Recovery and Troubleshooting

Articles and Tips: article

HASHEM HEJAZI-FAR
Developer Support Engineer
Developer Support

01 Dec 1995

This DevNote discusses TUXEDO System failures, their causes, and the techniques used to recover from these failures. The areas of application failures are also covered.

This DevNote assumes that you have prior knowledge of TUXEDO System and the functionality of client/server in TUXEDO System environment, as well as some familiarity with the UNIX operating system.

Introduction
Partitioned Networks
Troubleshooting
Glossary

Introduction

Many times when an application fails, users call the administrator and complain that they cannot get to a service or when they try to run it, their system hangs, etc. In order to restore and provide services, administrators tend to shut down servers and/or systems, kill processes, reboot, etc. without letting TUXEDO System try to recover.

There are some time dependent operations that are involved behind the scenes in the recovery procedure that most users seem to overlook. These operations depend on parameters such as:

Parameter	Explanation
SCANUNIT	Time in secondsbetween scans by the BBL to look for oldtransactions and timed-out blocking calls.
SANITYSCAN	A multiplierof SCANUNIT that sets time for a sanity checkof the application. A this interval eachBBL reports to the DBBL on its validity withan AI-m OK@ message.
BBLQUERY	A multiplierof SCANUNIT that sets frequency of statusverification by the DBBL. The DBBL sendsa message to any BBL that has not filed anAI-m OK@ message within the period. If noresponse is received, the node representedby the non-responding BBL is partitioned.

To put it all together, BBLQUERY is approximately 300 seconds. That is 5 minutes, if the administrator has not increased these values. I have heard of cases where BBLQUERY is set as high as 1800 seconds (half hour). You can see how an administrator can run into more problems by trying to recover before TUXEDO has had a chance to automatically correct any problems. I suggest that the administrator be aware of the parameters mentioned above and allow TUXEDO to do the housekeeping.

Use TUXEDO's rich set of administrative commands to diagnose the problem before restarting or killing processes or looking outside of TUXEDO for answers.

At the first sign of a problem use the tmadmin(1) bbclean (bbc) command:

> bbclean (bbc) {machine}

This command will check the integrity of all accessors of the bulletin board residing on machine {machine}, and the DBBL as well. If there is a problem, this command should allow TUXEDO to recognize it, and begin trying to correct it.

Partitioned Networks

One of the most significant sources of failures is partitioned networks. If one or more nodes cannot access the MASTER node, then the network is partitioned. Following are the major causes for this problem:

Node failure
Network failure
BRIDGE failure

Note: For definition of the TUXEDO terminologies used in this article, refer to the glossary.

Detecting and recovering from such failures require appropriate action and it is different in each case. When there are problems with the network, the TUXEDO System administrative servers will send messages to the user log.

Note that if the user log is setup over a remote file system, the remote file system may no longer be available. This means if there is no connectivity to the system which has the user log, then the administrator will not be able to access it.

Partitioned Networks Detection

There are several ways to detect a partitioned network:

User Log
tmadmin(1)

Look at the user log. It will have some messages similar to:

112110.persia!DBBL.11007: ... :ERROR: BBL partitioned, machine=SITE2

Use the tmadmin(1) printnet (pnw) command:

> pnw SITE2

Could not retrieve status from SITE2

Use the tmadmin(1) printserver (psr) command:

>psr -m SITE1

a.outName	Queue Name	Grp Name	ID	RqDone	Load Done	Current Service
--------------	-----------------	--------------	---	-----------	---------------	--------------------
BBL	30002.00000	SITE1	0	-	-	( - )
DBBL	123456	SITE1	0	121	6050	MASTERBB
simpserv	00001.00001	GROUP1	1	-	-	( - )
BRIDGE	16900672	SITE1	1	-	-	( DEAD )

Use the tmadmin(1) printservice (psc) command:

> psc -m SITE1

ServiceName	Routine Name	a.out	Grp Name	ID	Machine	# Done	Status
-----------------	------------------	-----------	---------------	----	------------	-----------	----------
ADJUNCTADMIN	ADJUNCTADMIN	BBL	SITE1	0	SITE1	-	PART
ADJUNCTBB	ADJUNCTTBB	BBL	SITE	0	SITE1	-	PART
TOUPPER	TOUPPER	simpserv	GROUP1	1	SITE1	-	PART
BRIDGESVCNM	BRIDGESVCNM	BRIDGE	SITE	1	SITE1	-	PART

Partitioned Networks Recovery

The partitioned network recovery procedure is different for each type of failure:

Master node failure
Non-master node failure

If the master node has failed, the application administrator will need to migrate the master to the backup nodes and run the tmadmin(1) pclean (pcl) command with the master node as an argument. If a non-master node has failed, the partitioned node is used as an argument.

If there are two LMIDs specified in the MACHINE section of the configuration file, the first LMID designates the master node and the second designates the backup nodes. There can be more than one backup node. The backup node generally comes into play when a network is partitioned; there are situations where the administrator needs to shut down the master and a migration should be done from the master to the backup node.

Cleaning up the master node is only a first step. The machine problem will need to be fixed and the failed node restored. To restore the failed node, the application administrator will need to boot the failed node from the acting master node. If the failed node was the master node, then using tmadmin(1) master (m)command may be used to change the master.

In addition, the application servers and data may need to be migrated back (if they were migrated off the failed node). Otherwise the application servers should simply be rebooted.

Restoring Failed Node

Use the tmadmin(1) pclean (pcl) command:

> pcl SITE2>
     Cleaning the DBBL.

     Pausing 10 seconds waiting for system to 

 stabilize.

     3 SITE2 servers removed from bulletin board

 

 > boot -B SITE2>
 Booting admin processes ...

 Exec BBL -A :

    on SITE2 -> process id=22923 ... Started.>
 1 process started.

>q>

Network Failure Recovery

To recover from transient network failures the application administrator will need to call the tmadmin(1) reconnect (rco) command from the master node. The name of the non-partitioned and partitioned nodes need to be passed as arguments to this command. Note that the BRIDGE will try to automatically recover from transient network failures and reconnect. In most cases the transient network failure will be unnoticed.

Transient network failure:

Corrects itself within minutes
BRIDGE is left unconnected

To recover from server network failures the application administrator will need to call the tmadmin(1) pclean (pcl) command from the master node. The name of the partitioned node needs to be passed as an argument to this command. The application administrator may then migrate the application servers or reboot the machine once the problem is corrected.

Server network failure:

Does not correct itself
Partitioned node needs to be taken out of the network

BRIDGE Failure. The easiest problem to deal with is the BRIDGE process failure because if the BRIDGE process fails, the TUXEDO System will restart it automatically and reconnects to other nodes in the network and new bulletin board information is downloaded to the partitioned node. Some users may be tempted to kill the BRIDGE process using UNIXkill(1) command. DO NOT kill the BRIDGE process manually. This may make matters worse! The application administrator could manually restart the BRIDGE process by using the tmadmin(1) bbclean (bbc)command.

Use the tmadmin(1) printserver (psr) command:

> psr -m SITE1

a.out Name	Queue Name	Grp Name	ID	RqDone	Load Done	Current Service
-----------------	--------------	---	-----------	---------------	--------------------
BBL	30002.00000	SITE1	0	-	-	( - )
DBBL	123456	SITE1	0	162	8100	MASTERBB
simpserv	0001.00001	GROUP1	1	-	-	( - )
BRIDGE	16900672	SITE1	1	-	-	( DEAD )

Use the tmadmin(1) bbclean (bbc) command:

>bbc Cleaning the bulletin board on machine SITE2. Cleaning the bulletin board on machine SITE1. Cleaning the Distinguished Bulletin Board.

Use the tmadmin(1) printserver (psr) command:

>Psr -m SITE1

a.out Name	Queue Name	Grp Name	ID	RqDone	Load Done	Current Service
-----------------	--------------	---	-----------	---------------	--------------------
BBL	30002.00000	SITE1	0	-	-	( - )
DBBL	123456	SITE1	0	182	9100	MASTERBB
simpserv	00001.00001	GROUP1	1	-	-	( - )
BRIDGE	16900672	SITE	1	-	-	( - )

Note that Current Service for the BRIDGE is now changed from DEAD to ( - ).

Troubleshooting

The administrator needs to distinguish between various types of failures in order to close in on the problem. There are several areas that can cause failures and each area requires specific troubleshooting skills and knowledge in addition to communications with the appropriate administrators, e.g., network administrator.

The areas that can cause failure are:

The application itself.
The TUXEDO System software.
The Database Management System software.
The network.
The operating system.
The hardware.

Troubleshooting the Application

There are several places where the application administrator can look to try and find application failures:

Warnings and error messages will be placed in the user log by TUXEDO System.
Message manuals for TUXEDO System could be used to obtain complete description of the problems as well as the actions to be taken to recover from a failure.
Application warnings and error messages can also be located in the user log.
Warnings and errors in application clients are usually sent to a stdoutand stderrfile.
Warnings and errors in application servers are usually sent to a stdoutand stderrfile.
The application administrator should also look for core dumps and use a debugger to get a stack trace. Application developers will need to be notified if core dumps are found in the APPDIR.
System activity reports such as sar(1) can be used to determine why things are not working the way they should. The system may be running out of memory, or the kernel might not be tuned correctly.
A majority of the UNIX Operating Systems provide truss(1)utility that can be used to produce traces of system call, signals and machine faults.

Replacing Application Components. Complete the following steps to if it is necessary to replace application components.

Install the application software, which can consist of application clients, application servers and various files such as the FML field tables.
Shutdown the application servers being replaced.
Build the new application servers if necessary.
Boot the new application servers.

TUXEDO System provides the Configuration Command Interpreter (tmconfig) that can be used to dynamically reconfigure an application. It can modify the TUXCONFIG file while the application is running. This means that existing parameters can be modified and new machines, servers, etc. can be added without shutting down the application.

Some changes, such as changes to the MACHINE section, require the application be shut down on the specified machine. Also servers need to be rebooted to have changes take effect.

The TUXCONFIG file is updated on all nodes in the application that are currently booted, and will be propagated automatically to new machines as they are booted.

The tmconfig(1) command can be run from any machine in the configuration but only the administrator can make updates. A typical scenario might be to migrate a machine to the backup, shut the machine down, make the changes and migrate the backup back to the original.

Troubleshooting the TUXEDO System

The administrator can use the same resources as those used to troubleshoot application failures. In addition, the administrator can set the TMTRACE environment variable which enables tracing in the user log. Setting TMTRACE=on will write additional information to the user log. This provides additional information to the administrator about the TUXEDO System function calls and the parameters.

The error messages in the user log have many different identifiers. These identifiers are defined in the header files included with TUXEDO System. The header files can be found in the include directorylocated in $TUXDIR (the root directory of TUXEDO). As a first step, the two identifiers listed below will allow the administrator to distinguish between errors and narrow down which component is responsible for the failure.

TPEOS indicates Operating System Error
TPESYSTEM indicates TUXEDO System Error

Glossary

BBL	Bulletin Board.
BRIDGE	Specifies the device name used by the bridge process to access the network. It will likely be ofthe form BRIDGE="/dev/starlan"or BRIDGE="/dev/tcp".The IBM AIX systems do not have this parameterspecified.
DBBL	Distinguished Bulletin Board.
LMID	Logical Machine ID.
MASTER	Specifies themachine on which the mastercopy of the TUXCONFIGis found. Also, if application is beingrunin MP (multiprocessor) mode, MASTER namesthe machine on whichthe DBBL should be run.
TUXCONFIG	The binary version of the configuration file.
UBBCONFIG	An ASCII file that contains TUXEDO System configurationinformation.

* Originally published in Novell AppNotes

Disclaimer

The origin of this information may be internal or external to Novell. While Novell makes all reasonable efforts to verify this information, Novell does not make explicit or implied claims to its validity.