Losing Weight at COMDEX

Articles and Tips: article

Linda Kennard

01 Mar 2000

Although the name NCP might not sound familiar to you, many of you have both seen and used NCP. NCP is the messaging system that the Novell Corporate Events Team (hereafter called "the team") designs and configures before packing, shipping, and setting up at trade shows all over the world.

From BrainShare U.S. in Salt Lake City, Utah, to NetWorld+Interop in Tokyo, Japan, NCP travels the globe in various sizes and configurations. (For more information about NCP, see "Novell Connecting Points," NetWare Connection , June 1999, pp. 20-25.) Despite its changing locale and design, NCP's purpose remains the same: to provide individual e-mail accounts that trade-show attendees can use to exchange e-mail messages (with or without attachments) with anyone who has an e-mail address.

Of all of the trade shows that have featured NCP, COMDEX/Fall '99 was the largest. By the last day of this five-day show, more than 300,000 attendees had crowded the halls of the Las Vegas Convention Center (LVCC) and the neighboring Sands Expo and Convention Center and the Venetian Resort in Las Vegas, Nevada. During the show, attendees compared and tested new products and technologies that more than 2,100 IT vendors showcased. (To learn more about COMDEX, visit the COMDEX web site at http://www.zdevents.com/comdex.)

Interestingly, this largest show can also claim the smallest footprint for an NCP at COMDEX. NCP at COMDEX/Fall '99 required considerably less hardware than NCPs at any previous COMDEX. For example, for COMDEX/Fall '98 the team hauled 44 servers from Novell headquarters in Orem, Utah to the show floor in Las Vegas. In contrast, for COMDEX/Fall '99, the team took only 14 servers for the backend--eliminating unwanted (and unnecessary) pounds.

In shrinking NCP's footprint, the team simultaneously (and surprisingly) managed to improve NCP's reliability and performance. During the show, the backend NCP servers experienced abends only twice, and both of these failures went unnoticed by the COMDEX attendees who used NCP. In fact, the team members themselves did not learn about the abends until after the fact and, even then, were nonplussed. Brian Petersen, Novell field marketing specialist, explains that he corrected the problem "when [he] felt the need to," which was at the end of the day, long after attendees had gone home. In the meantime, NCP at COMDEX/Fall '99 continued offering its services--uninterrupted.

THE SECRET BEHIND THE HARDWARE DIET

At this point, you may well be asking, "How did NCP at COMDEX/Fall '99 lose weight and gain more reliability?" The answer is, by running--running NetWare Cluster Services for NetWare 5, that is.

NetWare Cluster Services for NetWare 5 is software that you install on Intel-based servers running NetWare 5 with Support Pack 3 or above to interconnect several servers. These servers, which are called nodes, then act as a single system, which is called a cluster. (For more information about NetWare Cluster Services, see "Uptime in Real Time With NetWare Cluster Services for NetWare 5," NetWare Connection , Sept. 1999, pp. 6-18. You can also test-drive NetWare Cluster Services by visiting Novell's democity at http://democity.novell.com , or you can visit Novell's web site at http://www.novell.com/products/clusters/ncs.html.)

The point of setting up a cluster is to make a system more reliable without having to purchase and install more hardware. By running NetWare Cluster Services on NCP at COMDEX/Fall '99, the team demonstrated this point well, dramatically reducing NCP's footprint while improving its reliability and performance.

THREE CLUSTERS AND A FISH TANK

NCP at COMDEX/Fall '99 featured a backend system consisting of three clusters. Six Compaq 1850R servers running NetWare Cluster Services formed Cluster 1. Clusters 2 and 3 were each built on four Compaq 1850R servers, which were also running NetWare Cluster Services. Do the math, and you'll arrive at a total of 14 nodes.

Fourteen nodes, says Petersen, who designed this backend system, wasn't a magic number; it was just the smallest number of nodes he "felt comfortable with." Mounted on one rack, these 14 nodes were centrally located in the NCP Network Operations Center (NOC), which the team calls the fish tank. (See Figure 1.)

Figure 1: To provide services to 300,000 attendees at COMDEX/Fall '99, NCP consisted of 14 NetWare Cluster Services nodes running on 14 Compaq 1850R servers, which were mounted on one rack.

Fiber optic cable and two microwave towers provided links from the fish tank to approximately 300 NCP workstations. (For more information, see "Connections to Connecting Points.") As you might expect, the workstations ran the Novell client software that ships with NetWare Cluster Services (that is, Novell Client 3.1 for Windows 95/98). Connection speeds between the NOC and the workstations ranged from 100 Mbit/s to 1 Gbit/s. (See Figure 2.)

Figure 2: NCP at COMDEX/Fall '99 included the Network Operations Center (NOC) and 300 workstations. The fiber optic cable connections between the NOC and these 300 workstations were as fast as 1 Gbit/s and a 100 Mbit/s microwave link.

Originally, Petersen planned to design one 14-node cluster. However, after speaking with Novell's clustering development group, Petersen and other members of the team decided that a lone cluster might not be able to handle some of the after-hour maintenance routines. For example, the team was particularly concerned that a single cluster might not perform satisfactorily on the first, second, and third nights of the show, when the team planned to distribute messages to each of the 300,000 NCP users.

"Our big concern was input/output," Petersen explains. "We only have a few hours at night to distribute mail and were concerned that the shared pipe from the cluster to the data stores would not be big enough to get the performance we needed."

The system's data stores to which Petersen refers were on shared volumes that Petersen set up on three Compaq RAID Array 4000 storage subsystems--one array for each cluster. These arrays were connected to the clusters by way of three Compaq Fibre Channel hubs--one for each array and cluster. The connection was made over fiber optic cable at a rate of 1 Gbit/s. Both the hubs and the arrays were mounted in a rack that sat next to the rack of cluster nodes. (Figure 3.)

Figure 3: For NCP at COMDEX/Fall '99, the Novell Corporate Events Team set up three NetWare Cluster Services clusters with a total of 14 nodes. Each cluster was connected over fiber optic cable to a dedicated array by way of a dedicated hub.

The team then set up two 40-inch flat-plasma Pioneer screens, one on each side of the racks. These screens served as monitors that were both connected to a single workstation the team had set up to manage NCP. The monitors displayed various screens from the ConsoleOne snap-in module for NetWare Cluster Services. The view on one monitor changed periodically, while the view on the other monitor continuously displayed the clusters' Cluster View screens. (Figure 4.)

Figure 4: In the event of a failover, the Cluster View screen from the NetWare Cluster Services ConsoleOne snap-in module shows the state and location of resources that were affected during that failover. (The screen shown here is only an example; it is not the screen that members of the Novell Corporate Events Team saw during COMDEX/Fall '99.)

NEW TECHNOLOGY, OLD-FASHIONED SERVICE

Running NetWare Cluster Services was relatively new to the team, who ran the clustering software only once before during the fall 1999 NetWorld+Interop held in Atlanta, Georgia. The pair of microwave towers that established the wireless, Fast Ethernet connection between the NOC and NCP workstations in the Sands and the Venetian Resort was also new to NCP. (For more information about this microwave connection, see "Cool Communications With Hot New Microwave.") Although these and other technologies that kept NetWare Cluster Services connected and running were new to NCP, the resources NCP provided at COMDEX/Fall '99 were essentially the same services that NCP has provided for other trade shows.

For example, NCP at COMDEX/Fall '99 ran Novell Directory Services (NDS), as have all NCPs in the past. However, NCP at COMDEX/Fall '99 ran NDS 8 on Nodes 1, 2, 3, 4, 5, and 6. (See Figure 3.) Running NDS 8, which can support one billion (and possibly more) objects in a single NDS tree, enabled the team to place User objects for all 300,000 users in one Organizational Unit (OU) object.

Earlier versions of NDS could not support 300,000 User objects in one container object. For example, at COMDEX/Fall '98, which ran NDS 7, the team had to create 36 OU objects, which each held only 7,500 User objects.

Like other NCP users, NCP users at COMDEX/Fall '99 could use NCP print services. The team ran Novell Distributed Print Services (NDPS) 2.0 on Node 3. NDPS enabled NCP users to print to any one of the eight Hewlett-Packard 4000tn printers available on the show floor. (Each NCP location had two printers.)

Of course, the star of NCP, as always, was GroupWise. For COMDEX/Fall '99, the team ran GroupWise 5.5 Enhancement Pack on Nodes 8, 9, 10, 11, 12, 13, and 14. As its name suggests, the GroupWise 5.5 Enhancement Pack enhances a GroupWise 5.5 system. (For more information about GroupWise 5.5 Enhancement Pack, visit http://www.novell.com/groupwise/gw55ep.html.)

Specifically, GroupWise 5.5 Enhancement Pack enables you to install one or all of the following enhancement components:

Windows Client Enhancement
WebAccess Enhancement
Agent Enhancement
Administration Enhancement

For COMDEX/Fall '99, the team ran all of these components except the Windows Client Enhancement, which the team did not have adequate time to test before the show.

The Agent and Administration Enhancement components offer behind-the-scenes enhancements, such as new monitoring capabilities and support for NetWare Cluster Services. Although these enhancements benefited the team members, the WebAccess Enhancement component affected NCP users (although they probably didn't notice).

WORLD-WIDE WEBACCESS

The team ran the WebAccess Enhancement component on Nodes 9 through 14. As its name suggests, the WebAccess Enhancement component adds new features to the GroupWise 5.5 WebAccess Gateway. The GroupWise 5.5 WebAccess Gateway (with or without the WebAccess Enhancement component) enables users to access their GroupWise mailbox over the web using an HTML-compliant browser. Specifically, the GroupWise WebAccess Gateway on NCP at COMDEX/Fall '99 enabled NCP users to access their GroupWise account using Netscape Navigator 4.6 running on NCP workstations or using a browser running on a laptop or other remote computer.

When users launched Netscape Navigator running on an NCP workstation, the browser opened automatically to the NCP home page. From this page, users could click the GroupWise WebAccess icon to open the WebAccess home page.

When users accessed this home page, the cluster node running the WebAccess gateway requested a username and password. The browser responded by checking the Windows registry. Because the user was using an NCP workstation, the browser found the username and password in the registry and returned it to the node, which subsequently logged the user in to the system. Consequently, the browser did not need to prompt users at NCP workstations for a username or password but instead automatically opened the user's GroupWise account.

The secret behind this seeming magic was a Java-based login program. To log in to NCP at COMDEX/Fall '99, users swiped their COMDEX badge through the swipe-card readers attached to the NCP workstations. The login program then listened on the COM port and captured the login data that passed through when the card was swiped. After capturing the username and password, the login program passed that information to another program, which placed the login information for that session in the Windows registry.

Users who accessed the GroupWise WebAccess home page from a laptop or other remote computer naturally did not have this handy program and so were prompted for their username and password. Users learned their NCP username and password by clicking the Remote Access Information icon on an NCP workstation desktop. Clicking this icon launched a small VisualBasic application (called RADINFO.EXE), which a team member wrote for NCP. The purpose of this application was to retrieve and display a user's NCP username and password.

Of course, enabling users to access their GroupWise accounts over the web is possible with or without the WebAccess Enhancement component. However, the WebAccess Enhancement component offers a few things that an unenhanced GroupWise 5.5 WebAccess Gateway cannot. For example, the WebAccess Enhancement component provides several patches for memory leaks in the GroupWise 5.5 WebAccess Gateway. According to Petersen, these patches made the NCP WebAccess gateways more reliable.

The WebAccess Enhancement component also enhances the functionality of the GroupWise 5.5 WebAccess Gateway. For example, by installing the WebAccess Enhancement component, the team enabled NCP users at COMDEX/Fall '99 to delete messages without having to open them.

In addition, the WebAccess Enhancement component enabled new items, such as messages, to open in separate, overlapping browser windows. This feature enabled users to toggle between messages without having to use their browsers' Forward and Back buttons.

SMALL FOOTPRINT, BIG SYSTEM

By running NetWare Cluster Services, the team was able to provide these same services--without interruption--using far fewer servers than NCP implementations have required in the past. Before the team ran NetWare Cluster Services, Petersen explains, "we managed to failsafe our system only with hardware-heavy designs." To ensure that NCP services remained available for the duration of a trade show, the team set up "spare servers," which were kept "online, ready to do whatever [the team] needed them to do."

More specifically, the team ran system services, processes, and applications on separate servers. For example, the team set up dedicated servers that ran only the data stores, other servers that ran only the GroupWise Message Transport Agent (MTA) and Post Office Agent (POA) NetWare Loadable Modules (NLMs), other servers that ran only the GroupWise WebAccess Gateway, and still other servers that ran only the GroupWise Internet Access (GWIA) Gateway. By using this design, Petersen explains, "if a server running the GroupWise agent NLMs failed, the servers running the data stores would remain up."

Furthermore, for each dedicated server, the team configured another identical server. Having to set up dedicated servers and identical twin servers created a hardware-heavy system.

THE FAILOVER PLAN

With NetWare Cluster Services, "there's no need for such redundancy," Petersen says. A NetWare Cluster Services node can run several services and applications without the threat of one of them downing the system. If a node fails, NetWare Cluster Services detects the failure and begins a failover process: NetWare Cluster Services moves the failed node's resources and associated IP addresses to one surviving node or distributes them among several surviving nodes.

Petersen's failover plan centered around the GroupWise system, which comprised a total of seven domains: one primary domain, named NCP-GW, and six user domains, named NCP-DO00, NCP-DO17, NCP-DO34, NCP-DO57, NCP-DO64, and NCP-DO88.

The numbers included in the domain names were arbitrarily generated by a user import utility. This utility divided the number of post offices so that each of the six domains had six post offices, for a total of 36 post offices. The utility then divided the number of user accounts equally among these 36 post offices. Each post office held approximately 7,500 user accounts.

For the NCP clusters at COMDEX/Fall '99, Petersen configured NetWare Cluster Services to move all of a failed node's resources and associated IP addresses to a prenamed surviving node. (See Figure 3.) As you would expect, the resources included the GroupWise agents and gateways within the domain for which the failed node was responsible.

As Figure 3 shows, Petersen configured NetWare Cluster Services to fail over resources (including the GroupWise agents and gateways) running on Cluster 1, Nodes 8, 9, and 10 to Cluster 1, Nodes 1, 2, and 3. Similarly, Petersen configured NetWare Cluster Services to fail over the resources on Cluster 2, Nodes 11 and 12 to Cluster 2, Nodes 4 and 5. Finally, Petersen configured NetWare Cluster Services to fail over the resources on Cluster 3, Nodes 13 and 14 to Cluster 3, Nodes 6 and 7.

As part of his failover plan, Petersen also configured the User Login resource so that NCP clients would attach to NCP nodes based on the nodes' IP addresses rather than node names (that is, server names). Thus, when a node failed, NCP clients would automatically reconnect to the failover node and its resources using the same IP addresses they had used to connect to the failed node and its resources. By configuring the User Login resource in this way, Petersen ensured that users would have uninterrupted service in the event of a failure.

Finally, Petersen configured NetWare Cluster Services so that all nodes had access to the system's data stores, which were stored on shared volumes set up on the system arrays. Anything users were working on (and had saved) at the time of a failure would be on the shared volumes to which all nodes had access. Consequently, if one node failed, the failover node accessed the data on the shared volume and simply picked up where the failed node left off.

A FAILURE? REALLY? WHEN?

During COMDEX/Fall '99, the team twice witnessed the success of Petersen's failover plan because two of the servers experienced abends during the show. Two abends during a five-day show is a "very low" number of abends, Petersen says. Petersen attributes the low number of abends to the memory patches provided by the WebAccess Enhancement component for the GroupWise 5.5 WebAccess Gateway.

When the two nodes experienced abends, the resources on those nodes failed over as planned. Consequently, team members noticed the abends only later, when they checked the Cluster View screen. Because the Cluster View screen showed that resources from the failed nodes were still running, there was little cause for worry, let alone action. Petersen casually investigated the failed nodes to determine the cause of the abends; he then restored these nodes and returned their resources only at the end of the day, when he had the time to do so.

When abends occurred on backend servers before NCP ran NetWare Cluster Services, the sequence of events that followed was not nearly as simple--and the team's response was not nearly as casual. In the days before NetWare Cluster Services, users who were accessing their GroupWise account by way of the WebAccess gateway saw an error message when the server running that gateway abended. "Then they'd turn around and look at us," Petersen begins. "And at that point, I would throw my hands in the air and start trying to figure out what happened."

Determining what happened was a lot more difficult than simply glancing at a screen displayed on a fish-tank monitor. Instead, Petersen or another team member would open the console for each of the servers, guessing at the possible failure and its probable cause. After determining which server failed, this team member could not just smile and wait until the end of the day to act on this knowledge. He or she downed the server immediately, fixed the problem, and restarted the server--all the while knowing users were waiting. Finally, the team would spread the word among users to restart Netscape Navigator.

USERS' BLISSFUL IGNORANCE

At COMDEX/Fall '99, the team didn't have to spread the word to restart Netscape Navigator because users never lost it in the first place. When one of the servers running the GroupWise WebAccess Gateway abended, the NCP users using that gateway noticed little if any effect.

For example, suppose user Cheryl had been writing a new message when the abend occurred. At the moment of the abend and the subsequent failover, Cheryl wouldn't have seen an error message. In fact, she probably wouldn't have noticed anything at all unusual until she clicked to send the new message. At this point, Cheryl would have seen a pop-up window prompting her to reenter her username and password. After entering the information, Cheryl would click to reconnect, the pop-up window would disappear, and the GroupWise WebAccess gateway would refresh Cheryl's mailbox.

In part, this ease of use during failovers was due to the fact that the team was running the WebAccess Enhancement component from GroupWise 5.5 Enhancement Pack. The WebAccess Enhancement component can display multiple screens at once, which enabled it to present the login prompt requesting that the user reenter his or her username and password. Without the WebAccess Enhancement component, GroupWise 5.5 cannot display multiple screens at once. As a result, when an abend occurs on servers running such gateways, users using the gateways are returned to the main login screen.

TIME TO SPARE

In addition to making server failures inconsequential, NetWare Cluster Services helped decrease the time required for the team's nightly maintenance routine. Every night during a show, the team backs up both the e-mail system and the NDS files. In the past, this backup could take anywhere from six to eight hours. At COMDEX/Fall '99, backing up the e-mail system and the NDS files took only about three hours.

On three of the show's five nights, the team distributed e-mail messages to each of the show's attendees. For example, on the first night--the night of the largest mail distribution--the team distributed five e-mail messages to each of the show's then expected 265,000 attendees in only 1 1/2 hours. At COMDEX/Fall '98, distributing the same number of messages took more than five hours.

Why did the nightly maintenance routine take so much longer in the past? Without NetWare Cluster Services, the team did not use arrays for GroupWise data stores because the arrays, without Novell's clustering software, offered little more than a large storage area. Instead, the GroupWise data stores were distributed among server volumes throughout the system. As a result, the speed of backups and mail distributions depended upon the speed of the network and the speed of the servers running the volumes.

At COMDEX/Fall '98, the server speeds were as fast as the server speeds at COMDEX/Fall '99. At both trade shows, the team ran Compaq 1850Rs with 450 MHz processors. However, the network speed at COMDEX/Fall '98--which up until that point was as fast as NCP had ever been--paled in comparison to the network speed at COMDEX/Fall '99. The connection speed between servers at COMDEX/Fall '98 was only 100 Mbit/s compared to the 1 Gbit/s connection between cluster nodes at COMDEX/Fall '99.

NetWare Cluster Services sped things up because it enabled the team to incorporate the arrays into the backend system and to use them in a failover plan. NetWare Cluster Services also supported the fibre channel pipe that provided the 1 Gbit/s connection between the arrays and the cluster nodes. A fast pipe between the clusters and their arrays meant a fast pipe between source and destination locations for backups and mail distributions.

REVVING UP WITH RAID 1

NetWare Cluster Services also supports Redundant Array of Independent Disks (RAID) 1, which played a noteworthy role in minimizing the time required for nightly maintenance at COMDEX/Fall '99. RAID 1 enabled the team to spend less time maintaining NCP (and more time checking out COMDEX) by reducing the number of times information had to be read from and written to disk during backups and e-mail distributions.

RAID is a disk subsystem architecture in which two or more physical drives act as a single logical drive, thus providing a backup in the event of a drive failure. RAID 1 is the second of six RAID levels (RAID 0 through RAID 5) and incorporates disk mirroring (or disk duplexing).

At COMDEX/Fall '99, Petersen configured the eight 18-GB drives in the Compaq arrays: The first drive was mirrored by the second drive, the third drive was mirrored by the fourth drive, the fifth drive was mirrored by the sixth drive, and so on. Each time data had to be read from or written to the arrays, the RAID 1 process required only one read and one write. "We wrote to the first drive," Petersen says by way of explanation, "and [RAID 1] wrote to the second. It was a one-to-one ratio."

Prior to COMDEX/Fall '99, the team had used an array for storing GroupWise data only once, at the fall 1999 NetWorld+Interop in Atlanta. However, in Atlanta, the team had not had time to test RAID 1 on the arrays. Instead the team used a RAID 5 configuration, which they had already tested.

In RAID 5, an entire block (or sector) of data is written to each of several hard disks, and parity data is added to another sector. Thus, if the team had used RAID 5 on the eight-drive arrays at COMDEX/Fall '99, each time data had to be written to the arrays during a backup or e-mail distribution, the RAID 5 process would have required that that piece of information be written eight times across the array's eight drives. In other words, Petersen says, "there would be eight I/Os per request." Clearly, the one-to-one ratio afforded by a RAID 1 configuration was considerably faster than a RAID 5 configuration would have been.

THEY'VE RUN IT BEFORE; THEY'LL RUN IT AGAIN

The NCP featured at COMDEX/Fall '99 was not the first NCP to run NetWare Cluster Services--and will not be the last. As you now know, the team first ran NetWare Cluster Services on the NCP at the fall 1999 NetWorld+Interop, which (as you don't yet know) holds the record for having the smallest footprint of an NCP at any show. This NCP's backend system consisted of one nine-node cluster, which provided the usual NCP services for 60,000 attendees.

"I think it's safe to say," Petersen concludes, based on this and other observations, "that NetWare Cluster Services will be required at all of the larger shows we do." When pressed to be more specific, Petersen says that by "larger" he means shows that have more than 30,000 attendees as opposed to smaller shows such as BrainShare, which has 5,000 to 7,000 attendees. Then Petersen reconsiders and revises this statement. On second thought, says Petersen, "Maybe we will run NetWare Cluster Services for every show we do."

Linda Kennard works for Niche Associates, an agency that specializes in writing and editing technical documents. Niche Associates is located in Sandy, Utah.

* Originally published in Novell Connection Magazine

Disclaimer

The origin of this information may be internal or external to Novell. While Novell makes all reasonable efforts to verify this information, Novell does not make explicit or implied claims to its validity.