Preparing a Disaster Recovery Plan for NetWare andeDirectory
Articles and Tips: article
Independent Consultant
rbender@telocity.com
01 Nov 2001
Protecting business-critical data from a disaster that affects an entire site is on many network administrators' minds these days. The key to being able to continue operations after such an event is having a disaster recovery plan that has been carefully conceived and documented. This AppNote presents an outline of such a document and provides recommendations for an effective recovery of a Novell NetWare/eDirectory network environment.
- Introduction Introduction
- The Disaster Recovery Plan Document The Disaster Recovery Plan Document
- Section 1: Overview Section 1: Overview
- Section 2: Document Maintenance Section 2: Document Maintenance
- Section 3: Disaster Recovery Action Plan Section 3: Disaster Recovery Action Plan
- Appendices Appendices
- Conclusion Conclusion
Topics |
disaster recovery, backup, restore |
Products |
NetWare, Novell eDirectory |
Audience |
network administrators, consultants, integrators |
Level |
intermediate |
Prerequisite Skills |
familiarity with NetWare and eDirectory |
Operating System |
NetWare |
Tools |
none |
Sample Code |
no |
Introduction
Within many companies, data storage is growing at an exponential rate. As the amount of stored data grows, so does the potential consequence of losing data in a complete site disaster. For many businesses, the costs associated with the loss of critical data can be devastating. A detailed recovery plan may be crucial to the survival of any organization in the event of such a disaster.
This AppNote provides an outline that can be used for the creation of a disaster recovery plan for a NetWare/Novell eDirectory (NDS) environment. This plan can be used alone or as part of a complete IT recovery plan for your company. The procedures are valid whether you provide your own recovery facilities or decide to use a hot-site vendor, such as Sunguard or Comdisco.
The goal of this recovery plan is to enable you to restore your NetWare and eDirectory environment. Recovery of applications, such as GroupWise and DNS/DHCP, is not covered but may be the topic of future AppNotes.
The Disaster Recovery Plan Document
The structure of this AppNote mirrors the sections of a recommended disaster recovery plan document. You can use this as a guide in preparing a similar document tailored to the needs of your organization.
Table of Contents
The following is the Table of Contents for the disaster recovery plan document. Each section listed will be discussed in more detail in the sections that follow.
Section 1: Overview
Executive Summary
Assumptions
Scope and Limitations
Section 2: Document Maintenance
Section 3: Disaster Recovery Action Plan
Flowchart
Step 1 - Preparation
1.1 - Verify Hardware
1.2 - Obtain Media Step
2 - Recovery
2.1 - Prepare Hardware
2.2 - Configure NetWare and NDS on the Recovery Servers
2.3 - Restore NDS
2.4 - Restore File System
2.5 - Set Up Printing
Step 3 - Verification
3.1 - Verify NDS
3.2 - Verify File System
3.3 - Verify Login Scripts
3.4 - Verify Applications Step
4 - Fail-back Plan
4.1 - Recover Original Server Environment
4.2 - Restore Changes Made During Recovery Mode
4.3 - Verification Appendices
A: Vendor Support Contacts
B: Hardware Requirements
C: Software Requirements
D: Current Backup Scheme
E: Current Data Volume Configuration
F: Current Server Configuration
G: NDS Configuration
H: Troubleshooting Tips
I: Data Restore Flowchart
J: Login Scripts
Section 1: Overview
The overview contains summary information, plan objectives, assumptions, and project scope and limitations. This section should be brief enough to enable an executive to justify the project and understand the scope of the plan.
Executive Summary
The first section of the overview should be a brief summary of what this document is and what it includes. For example, this section might start with a paragraph such as this:
"The purpose of this document is to outline the disaster recovery procedures for <company>'s <site location> campus. At this site, the NetWare server infrastructure is currently providing file/print and authentication services. In the event of a site disaster, recovery of <number> NetWare servers will be required for business continuity."
Next, state the goals of the recovery plan and give a brief outline of the recovery cycle. This should be a preview of the actual recovery plan. You may also include a recovery time estimate if you feel it is necessary. Here is an example:
"The objective of this document is to provide a workable network disaster recovery plan for <company>'s NetWare file and print services. This document will provide a set of guidelines for a network technician to follow during a disaster recovery situation. These guidelines will streamline the recovery process and minimize the number of decisions required during the recovery. It is estimated that the recovery procedures outlined in this document will require thirty hours to be carried out. Fail-back procedures are also specified for reinstating the complete network environment after the disaster is over."
This section should also name the author of this recovery plan and indicate the initial creation date. If you are a consultant developing this plan on behalf of a customer, include your contact information.
Assumptions
This section lists all of the assumptions that will be made at the beginning of the recovery process. For example, these could include the following:
Access to hardware and software required for recovery (include a reference to a list of requirements)
Access to all required backup tapes
Replacement hardware is functional
Required connectivity and cabling is available
Network technician is familiar with the basics of networking in general, as well as the software and hardware contained within this plan
Scope and Limitations
This section specifies exactly what is and what is not contained within the plan. For example, the steps outlined in this document assume that this is a plan to recover whatever is necessary within the walls of the data center: servers, NDS, connectivity between the servers, and data.
In the event of a disaster, it is assumed that access to the main site is not available. The printing environment at the recovery site will therefore probably not be the same as it was in the production environment. Due to this, the recovery of the printing environment will be dependent on the equipment available at the recovery site. This document assumes that the specifics are not known and therefore the recovery printing environment will need to be configured manually.
If your organization has a hot site completely specified, you can include printing recovery in your plan. This topic will be discussed further in Section 3.
Section 2: Document Maintenance
This disaster recovery plan is designed to recover the current production system; any changes to the production system will alter the recovery procedures. In order for this recovery plan to be relevant at the time of a disaster, it must be continuously maintained. It is therefore important to schedule routine maintenance of this document.
This section outlines the items contained within this document that need to be routinely maintained. Most of the information that will need to be updated exists in the Appendices; however, some exists in the body of the document also. To maintain control over modifications, make the appropriate entry in a change log whenever this document is modified. Include a specific checklist of items to check on a routine basis. The schedule of this routine check should be determined by how frequently your network infrastructure is modified.
Be sure to schedule a review of this document after any significant change is made to the network. Depending on the magnitude of the change, you may want to schedule a complete dry run of the plan. Include verification of this list and any changes in the change log.
Note: It is recommended that you schedule a periodic pilot of the procedures within this document. This not only verifies the steps contained within, but also gives the network engineer practice performing the steps. This experience will be invaluable during the pressure of recovering from a real disaster.
When changes are made to this document, log changes to a table contained within the document. An example is given below.
Date
|
Name
|
Description
|
Oct. 12, 2001 |
Randy Bender |
Initial creation of document. |
Oct. 19, 2001 |
Randy Bender |
Verification of plan. Changed flowchart of data restore procedure due to a modification to the production tape rotation. |
Section 3: Disaster Recovery Action Plan
The disaster recovery action plan itself is divided into three sections: Preparation, Restoration, and Verification. Each section is represented by a number to signify the order: 1 for Preparation, 2 for Restoration, and 3 for Verification. Tasks under each heading are distinguished by adding a decimal place. For example, restoring NDS is the third major task in the restoration phase and is represented as task 2.3. These tasks numbers should be shown on the flowchart.
The final section of the recovery plan is a fail-back procedure. These instructions should be comprehensive, but not necessarily as detailed as the fail-over plan. Typically there isn't as much time pressure during fail-back as there is during fail-over. However, it is still important to recover the delta changes. These procedures are outlined in Section 4.
I prefer to organize these steps into a checklist format that can be followed during recovery. This way, the engineer can initial and timestamp the completion of each step. An example of this format is given in the Preparation section.
Also, at the end of each major task is a section to verify that things have gone according to plan up to this point. It is important to verify your progress so "molehills" don't turn into "mountains" later in the recovery procedure.
Flowchart
The first section of the action plan should be a flowchart that reflects the major sections of the recovery plan and illustrates their interdependence.
Note: I recommend that you wait to create this flowchart until after your plan is finalized. Otherwise, you will spend a lot of time changing it to match modifications of your plan. It is much easier to create the flowchart after the recovery plan is in place.
Step 1 - Preparation
The first step in recovering the NetWare server environment is to verify that you have all the components necessary before starting the procedures. The specific hardware and software that will be required are listed in Appendix B and C for ease of future maintenance.
The following table is a sample checklist format that can be used throughout the entire recovery plan. Once each step is complete, the network engineer can check the task as completed and record the time. This will help you adjust recovery time estimates for future recoveries.
Checklist for Step 1: Preparation
|
Done
|
Date/Time
|
Step 1.1 - Verify Hardware Verify that the necessary hardware is available and in working condition. (See Appendix B for specific hardware requirements.) Verify that all servers complete POST (Power-On Self Test) properly and count the correct amount of memory. Verify the configuration of disk arrays on each server. Verify power to connectivity hardware (routers/switches). |
||
Step 1.2 - Obtain Media Verify that necessary hardware is available and in working condition. (See Appendix B for specific hardware requirements.) Verify that all servers complete POST (Power-On Self Test) properly and count the correct amount of memory. Verify the configuration of disk arrays on each server. Verify power to connectivity hardware (routers/switches). |
Step 2 - Recovery
This section is the "meat" of your document. It gives the steps necessary to recreate the NetWare server environment, including the NDS tree, and to recover the data.
Note: In the final disaster recovery document, this section-along with the others-should be in the checklist format presented in the previous section. For ease of reading, I will discuss the issues presented in each phase without using the checklist format.
Step 2.1 - Prepare Hardware. The first step in the recovery phase is to prepare the recovery hardware. List all the steps necessary to ensure that the hardware is ready for the installation of NetWare: define the array, configure the hardware interrupts, and so on. Also include the configuration of any connectivity hardware such as switches and routers.
I recommend using a scriptable NetWare installation procedure to save time and eliminate possible data entry errors during the server configuration. (For details, see "Automating the NetWare 5 Installation with a Response File" in the December 1998 issue, and "More About Automating the NetWare 5 Installation with a Response File" in the February 1999 issue of Novell AppNotes.)
If you are recovering to Compaq hardware, you should investigate Compaq's Scripting Toolkit at http://www.compaq.com/manage/toolkit.html. This toolkit contains utilities which allow you to configure the hardware resources, configure the array, create and populate the configuration partition, create and format the DOS partition, and start the installation.
At the completion of this step, your hardware should be configured and ready for NetWare to be installed. If you decide to use the scriptable installation, you will automatically start Step 2.2.
Step 2.2 - Configure NetWare and NDS on the Recovery Servers. This step contains the instructions necessary to recreate the server environment. At the end of this step, you will have the NetWare servers installed into a skeleton NDS tree that matches your production environment.
Regardless of whether or not you are using a scriptable installation, recreate the first server alone. Be sure to create the tree with the original name and locate the server in the original context. This server will hold the Master NDS replica.
The remainder of the servers can be recreated simultaneously. Be aware that the next two servers installed will automatically receive Read/Write replicas of NDS. There is a key in the response file that you can specify whether this server will get a replica. However, it is referring to a replica of the [Root] partition, as no other partitions have been created yet.
Note: It is important to recreate each server with exactly the same name and Server ID. If not, you will not be able to recover trustee assignments later. Also, it is very important to reinstall servers into the same context as in the production system.
Remember to install all of the server and connection licenses at this time. The license installation may also be automated with the scriptable installation. It is important to install the licenses now because if you forget to do so later, you may encounter application authentication problems. Before you know it, you could spend hours troubleshooting what you believe to be a communications problem only to find out that your GroupWise POA can't authenticate to your domain because there are no connection licenses available.
At this point, you will need to patch the servers to the latest level approved by your organization. To speed the patch installation, have available a copy of the expanded service pack on a CD for each server. This way you can install the patches to all the servers simultaneously. During a disaster recovery, you'll want to save as much time as possible wherever you can.
Next, recreate all data volumes on each server. The configuration for these should be listed in Appendix E. Don't forget to assign all name spaces that were assigned to the original server.
Before moving on to the next step, verify the current environment. At this point, you should have all the servers recreated and located in the same context as they were in the production environment. Verify that all of the volumes have been recreated and the proper name spaces have been assigned. Also triple-check the server names, server IDs, and addresses. It is easy to make a typo while under the pressure of a disaster recovery. This is a big reason why I recommend taking the time to develop a response file to automate the server installation.
Step 2.3 - Restore NDS. Before NDS or file system data can be restored, the same User ID that submitted the backup job must exist in this new tree, with the same password and rights as it had before. Be sure to record this User ID, password, and any security equivalences in the maintenance section of this document. This reference must be maintained up-to-date if any aspect of the User ID is changed in the production environment.
Next, restore your SMS-compliant backup program on the server with the master NDS replica. Verify that the TSA agents (TSANDS and TSA500) are loaded and that SMDR is configured (SMDR NEW).
Most backup programs require you to re-catalog the tape before you can restore. Veritas BackupExec v8.5 stores the catalog on each tape, which can be restored instead of rescanning the entire tape. This will save you critical time during the recovery phase. Insert and restore the catalog of the tape that contains the backup of NDS.
The first step to NDS recovery is to restore the schema extensions. To verify that the schema has been expanded, use the Schema Manager to record the number of object classes and attributes. Submit the schema restore job. Compare the number of classes before and after the schema restore. If the schema recovery fails and your recovery environment has applications that are dependent on specific classes, you will have to reinstall the application programs to extend the schema.
The next step is restoring the actual NDS objects. It is very important to choose only those object types that you will need during the fail-over phase. You may only want to restore the specific classes that you need during the recovery, such as User and Group objects. Do not restore objects that already exist in the recovered environment. Below is a list of some of the objects that you should not restore.
Security container
OU containers that already exist
Server objects
Volume objects
License containers and objects
Printer objects (unless they are required)
SMS objects
Application objects that are recreated during installation (BackupExec and NetShield are two examples)
Admin and Backup user objects
Once NDS is restored, verify that there are no unknown or renamed objects. These may be present if you restored an object that already exists. Verify time synchronization and NDS replication on all servers that have replicas.
Once the NDS restoration has been verified, you may set bindery contexts on servers that require them. Your instructions should also reference Appendix G with this information.
Step 2.4 - Restore File System. To restore the file system data, first restore your backup programs on the remainder of the servers and verify that the TSA agents are loaded. Then proceed with the data restoration.
To decrease the overall restore time, your plan may specify that each recovery server be configured with its own tape drive. This will allow you to restore the servers in parallel, dramatically reducing the time required. In a perfect world, each server's data would be contained on one tape and all the server restores could be submitted simultaneously. Obviously, this won't always be the case. Your plan must analyze the current backup rotation and determine the most appropriate restore cycle. Identify the most critical data and create the first job to restore this data.
Note: I recommend presenting the restore job submission in flowchart format. This way you will be able to show the relationships and dependencies of backup jobs.
It would ease recovery efforts to do full backups of each server each night; but again, this isn't always the case. Your plan will have to include instructions for restoring differential or incremental data. Differential backups include all data that has changed since the last full backup. Incremental backups include all data that was changed since the last full or incremental backup. During a recovery, it is easier to restore from a differential backup because it will be contained on one tape (or tape set). To restore from an incremental backup, you will need all the incremental tapes since the last full restore that you used.
Step 2.5 - Set Up Printing. As mentioned previously, this document assumes a complete site disaster. If this is the case, the printing environment during the recovery is probably not going to be the same as it was in the production environment. The restored configuration will be determined by the equipment on hand at the recovery site. As the printing environment will have to be manually recreated, specific instructions are not included in this plan.
I recommend recreating a Novell Distributed Printing Services (NDPS) environment since it can be easily configured to deploy the printer definitions to the workstations (assuming that the desktops are unavailable as well).
Be aware that some legacy applications-usually those that run within a DOS window-may require LPT ports to be captured. If this is the case, it's a good idea to make a note of these applications and the capture settings in this document.
At this point, all of your servers should be properly thriving in the recovered NDS tree and have their data restored. But before you stop to celebrate, you need to verify that everything is golden. Move on to Step 3.
Step 3 - Verification
Now that you've made it this far, please don't abandon your discipline. All of your planning and efforts may be for naught if the right users can't access the right information. It is important to check the accuracy of the recovered environment.
Step 3.1 - Verify NDS. First, verify that time is synchronized on all servers in the tree. Next, verify that NDS is healthy by running DSREPAIR's "Repair Local Database" option on each server that holds a replica. Verify that there are no errors.
Step 3.2 - Verify File System. Verify that the file system has been recovered by spot-checking file system and trustee assignments. If you have the tools to do so, include a trustee report with this plan. In the event of an error during the NDS restore, you may need this report to manually recreate trustee assignments. Also, check backup logs for restore errors.
Step 3.3 - Verify Login Scripts. Verify that login scripts are correctly mapping drives. It is a good idea to record the login scripts in Appendix J. They are a critical component that may be overlooked in the event of an NDS restore problem.
Remember to comment out references to resources that do not exist in the recovered environment. For example, remove references to printers and mappings to servers that may not be part of the recovery procedure.
Step 3.4 - Verify Applications. If any applications were running on your servers, make a note to verify their proper functionality. List these applications in Appendix A and include a reference to this section. Examples of critical applications that may be running on a NetWare server are DNS/DHCP, GroupWise agents, and applications dependent on the Pervasive database.
Step 4 - Fail-back Plan
Once the disaster has been handled, you'll be ready to restore the original production environment in its entirety. This section outlines the general steps to take when restoring the production environment.
Step 4.1 - Recover Original Server Environment. If the disaster destroyed the original production servers, follow these steps. If the original servers are recoverable, move to Step 4.2. If not, rebuild the servers and connectivity hardware following the same procedures in Step 2. Use the same tapes that were used during the recovery to restore the files system to the new production servers. This will give you the starting point from which the recovery mode began.
Step 4.2 - Restore Changes Made During Recovery Mode. Unless major changes have been made to the NDS tree while in recovery mode, NDS does not have to be backed up and restored to the new production system. Perform differential backups of the recovery servers to capture all data that has been modified. Restore this differential data to the new production environment..
Step 4.3 - Verification. Follow the same procedures as in Step 3 above to verify the production environment's proper operation. The only exception may be the printer restoration. Since you are recovering the original production site, the printers should be recovered as they originally were. If the physical site is recoverable, you can restore the NDS printer objects from tape. If the physical site is completely recovered, you will have to redefine the printers manually.
Appendices
The information in the Appendices is to be used as reference during the recovery. It is best to include specific information about your site so it is easy to maintain this document. The following are recommendations for what you may want to include in your disaster recovery plan.
Appendix A: Vendor Support Contacts
List the contact information for vendors of all components in this recovery plan. Below is a sample of what your table may look like and what information to include.
Vendor
|
Product
|
Support Number
|
Support ID
|
Novell |
NetWare |
1-800-858-4000 |
PIN XXXXXX |
Compaq |
Proliant Servers |
1-800-231-9977 |
XXXXXX |
Veritas |
BackupExec v8.5 |
1-800-634-4747 1-800-531-7750 Express Routing Code: 020 |
#XXXXXXXXX |
Appendix B: Hardware Requirements
Detail what specific hardware is required for recovery. Make a checklist so it is easy to verify that everything is there. The last thing you want is to realize that you have all of your servers, but only one monitor and keyboard.
Be sure to specify the same form factor of the tape drive on the recovery server. Verify that the recovery tapes can be read in the emergency drives.
Don't forget cables!
Appendix C: Software Requirements
List the software requirements with the same diligence as the hardware list. In the event of a complete site disaster, software stored at the main location may not be accessible. For this reason, all of the items contained on this list should be stored offsite along with the backup tapes.
Be sure to include the following:
NetWare media - one physical copy for each server
The latest service pack on a CD, expanded (as with the OS media, include one physical copy for each server)
Server vendor hardware configuration CDs (Compaq SmartStart, for example)
Install scripts and scripting tools, if used
NetWare server and connection licenses
Backup software and licenses (if a tape drive is going to be installed on multiple servers, you may need more backup software licenses that you are currently using)
Appendix D: Current Backup Scheme
Provide a detailed spreadsheet of your current tape backup rotation scheme. This will help the engineer performing the recovery to quickly know which tape should be cataloged.
Appendix E: Current Data Volume Configuration
Include a table of the current volume configuration for each server. Be sure to include the volume name, size, and any name spaces that are loaded. You may also want to note whether the volume is in the traditional NetWare format or the new NSS format. Note any deviations from the default for block size, compression, and suballocation settings.
Server
|
Volume
|
Size
|
Name Spaces
|
FS1 |
SYS |
5 GB |
DOS, LONG |
VOL1 |
10 GB |
DOS, LONG |
|
VOL2 |
5 GB |
DOS, LONG |
Appendix F: Current Server Configuration
Include specific information for each server being recovered. At minimum, include the following items:
Server Name
Server ID
IP address (including subnet mask, DNS server, and default gateway)
IPX frame type(s) and network address
Server's NDS context
Any bindery contexts on this server
Appendix G: NDS Configuration
List information about how NDS is partitioned and replicated. You may also want to document trustee information here if you are fortunate enough to have the tools to do so (such as BindView).
Appendix H: Troubleshooting Tips
Include any troubleshooting tips that you feel may be useful during the recovery process. Make notes of the problems you encountered during the pilot of this plan. Ideas for topics include:
How to add another name space to a volume
SMDR issues
How to catalog a BackupExec tape
Appendix I: Data Restore Flowchart
Depending on the complexity of your backup scheme, it may be useful to provide a flowchart for the data restore process. Show the order in which the restore jobs should be submitted and the interdependencies of the jobs. For example, if you have to wait until job 4 is complete before moving that tape to another server, a flowchart will show that more clearly that a restore job list.
Appendix J: Login Scripts
List each login script to be referenced in the event of an NDS restore problem. Login scripts are easy to overlook during a recovery. However, they can be critical to a successful recovery as far as users are concerned. You may have done everything right up to this point, but if the users can't see their P: drive, the recovery was unsuccessful in their eyes.
Conclusion
This AppNote has detailed the creation of a disaster recovery plan for a NetWare/ eDirectory network environment. Having such a plan for your organization is a big step to protecting your valuable business data from a complete site disaster.
For additional information, refer to the following resources:
TID #10012766, "How to Backup and Restore NetWare DS" dated May 15, 2001, available at http://support.novell.com
Disaster recovery articles in Network Computing magazine, March 5, 2001, available at http://www.networkcomputing.com/1205/1205toc.html
"A Disaster Recovery Strategy for Mixed NetWare 4/5 Networks" (http://support.novell.com/techcenter/articles/ana19990903.html)
"Backing Up and Restoring Novell Directory Services in NetWare 4.11" (http://support.novell.com/techcenter/articles/ana19961003.html)
"Backing Up and Restoring Novell Directory Services in NetWare 4" (http://support.novell.com/techcenter/articles/ana19950801.html)
* Originally published in Novell AppNotes
Disclaimer
The origin of this information may be internal or external to Novell. While Novell makes all reasonable efforts to verify this information, Novell does not make explicit or implied claims to its validity.