Identifying Test Workloads in LAN Performance Evaluation Studies
Articles and Tips: article
Systems Research Department
01 May 1992
This AppNote divides LAN performance evaluation studies into two categories: component tests and system tests. It then examines a variety of test workloads in relation to their effectiveness in meeting these goals.
Performance evaluation (PE) studies of LAN systems are extremely complicated. Far too often, the tests and their results are misunderstood and misapplied...many times by the testers themselves. It's no wonder the majority of LAN industry customers are confused with so many differing, sometimes diametrically opposed results.
In this AppNote I examine the LAN performance evaluation process. Beginning with the process itself, I divide PE tests into two important categories-those that test network components, and those that test systems. I then take a look at how a variety of test workloads meet these goals.
Component Tests versus System Tests
PE studies require a great deal of precision if the resulting data are to be interpreted and applied correctly. Texts on the subject emphasize the importance of carefully selecting the goals for the PE study before designing any test.
Within the LAN industry, all of the goals for PE studies can be categorized into one of the five lifecycle phases of the network:
Product development and testing
The majority of LAN PE tests have been built by vendors for the purpose of product development and testing. During development, vendors find it difficult to test every conceivable error condition. So they develop tests that saturate an individual network component with a set of random events. They do this hoping to flush out as many error conditions and boundary problems as possible before the product is released for sale. I'll call these "component tests" to help separate and distinguish them from system tests.
Component Tests Defined
By design, good component tests will never be good system tests. Component tests:
Are designed to load or saturate an individual component
Produce an unnatural workload volume and pattern
Provide no known correlation to a production workload
Provide no known correlation to any number of real users
Cannot be used for system design, optimization, or capacity planning
The big difference between component tests and system tests is the test workload specification. Component test workloads are designed to stress an individual component.
PERFORM2, PERFORM3, IPXLOAD, MASTER/TEST, and DSK311 are all examples of component tests developed by Novell. Results from these tests are useful to Novell and other developers during the product development and testing process. But because these results describe component performance under unnatural conditions, they are irrelevant to the rest of the network lifecycle. Results from these tests don't provide a good indication of how the product will perform in a production environment.
On the other hand, system test workloads should be designed to emulate a customer's production environment. If the emulation is accurate, the results can be used to make decisions during the procurement, design, optimization, and capacity planning phases of the lifecycle. System tests and their results reveal the projected performance in the customer's environment.
Component tests, then, are appropriate only for the product development and testing phase. For all other phases of the network lifecycle, system tests are required (see Figure 1).
Figure 1: The majority of network lifecycle phases and PE goals require system tests rather than component tests.
Most of the confusion propagated by vendors, industry publications, test labs, and customers in the past stems directly from using component tests as system tests (benchmarks) and then extrapolating the results to represent conclusions that are not valid. This practice has resulted in a lot of skepticism about benchmarks in general.
There is a place for component testing. Vendors who design and test new components should use component tests developed to rigid specification for that purpose. But using the results of these tests to make decisions regarding system performance doesn't serve anyone well. The majority of the industry - those who are involved every day in the procurement, design, optimization, and capacity planning of systems - should use tests developed to test systems.
Three Laws of System Testing
To aid in this important migration away from the use of component tests as system tests (benchmarks), I've coined three laws of system testing.
Law 1. If you saturate the hardware, you're not testing the software.
A network system combines hardware and software to create a complete solution. The ideal test would not load individual components of the system, but would instead allow the tester to measure the impact of a production workload on the entire system under test. But since most tests don't adequately emulate production environments, they often saturate one component of the system abnormally. That is, they load the system in a way that may never be encountered in the real world.
An example of this unnatural loading is seen in Novell's component test PERFORM3. PERFORM3 was designed to load the LAN channel of the network and measure the composite throughput from server cache to the workstation. It does this by isolating the LAN channel hardware from other important server components and saturating that channel with a traffic pattern that will never be seen in the real world. Although these kind of test results are useful to Novell, it would be a mistake to use them to design systems.
NetWare is designed to service production workloads, not the unreal workloads generated by component tests. If the test saturates a hardware component unnaturally, the results don't represent the full capacity of the system (hardware and software) to service a production workload.
Law 2. System bottlenecks migrate with the workload.
Like the narrow part of a bottle, a system bottleneck is the resource within the system with the least capacity - the resource that is most likely to be overloaded or cause an obstruction under load. In network systems, the bottleneck can move, or migrate, under different kinds of workloads.
Differences in the workload characteristics of production workloads have a great impact on which system component becomes the bottleneck. For instance, the read-write ratio might be 3:1 in one environment, and 30:1 in another. The packet size distribution, the volume of transactions, and the frequency of requests also have a great deal to do with which system component becomes the bottleneck. These differences in workload makeup are the determining factor that moves the bottleneck from one area of the system to another.
Differences in PE tests have the same effect on system bottlenecks. Component tests locate bottlenecks within an individual component. System tests that accurately emulate production workloads locate the system bottleneck-one among many components - for that particular workload.
Law 3. System test workloads must emulate, as closely as possible, the production workload of the target system.
In light of Laws 1 and 2 above, system test workloads must accurately emulate the workload of the system being designed (the target system) if their results are to be used to design or manage that system.
Whether on the target system, in the lab, or on a pilot system, accurate system testing must include an analysis of the production workload of the target system before any PE study can begin. The designer must ask, "What kind of workload is the target system going to service?" Once the analysis is complete and the test workload is developed, the accuracy of the test workload's emulation of the production workload should then be verified before proceeding with any testing.
Understanding the Test Workload
A test workload is the network traffic generated by the PE test. The test workload specification is the description of the workload, including measurements that describe the volume and pattern characteristics of the traffic that make up the workload. The test workload specification and a measurement scheme are the two fundamental components of a PE test.
Figure 2 is a diagram taken from what some consider a classic text in this field: Computer Systems Performance Evaluation, by Domenico Ferrari from UC Berkeley. This book has proven helpful to me as I've come to realize the importance of workload to all LAN PE studies.
Figure 2: Test workloads can be classified as real, synthetic, or artificial.
Real Test Workloads
A real test workload is made up of all the original programs and data processed in a production environment. In fact, the ideal real workload is the same one the customer uses: the actual working environment and production workload. This is identical to the working environment and workload the system should be procured, designed, and optimized for. Tests using real workloads - or accurate representations of real workloads - provide the only accurate insight into these network management processes.
Customers who try to optimize their systems using rules of thumb or their own common sense most often use their production (real) workloads to test their newly optimized systems. After making changes to a production system with the hope of improving some aspect of the system performance, they allow their users to log into the system and proceed with their work. The problem with this method is the inability to distinguish between measurable improvements and perceived improvements.
Applying the Three Laws of System Testing, real test workloads don't saturate hardware components unnaturally (Law 1), bottlenecks that do occur are accurate (Law 2), and real workloads include the volume and patterns of the production workload (Law 3).
When applied to a system under test, real workloads provide one of two kinds of results: first, an assurance that the system is capable of handling the workload with bandwidth to spare; or second, one or more bottlenecks that accurately pinpoint areas needing attention.
Synthetic Test Workloads
A synthetic workload has some components of a production workload, such as real user applications or real data, but is devised and developed to meet certain criteria usually known only by the designer of the test. Synthetic test workloads lack the volume and pattern characteristics-the reality - of real test workloads.
Two Significant Trade-Offs. Synthetic workloads are built with two significant trade-offs in mind.
First, real workloads have been difficult, if not impossible, to develop with existing tools. Handmade, synthetic workloads are much easier to build than real ones. A synthetic workload can be built in less than a day using scripting tools shipped with most off-the-shelf applications, whereas the development of a tool that generates real workloads takes considerable engineering resources and man years of development. So, developers of synthetic test workloads sacrifice realism for ease of development.
Second, it's difficult to saturate today's LAN systems with real workloads, considering the fact that many production NetWare servers are servicing upwards of 250 users, even 1000 users in the case of the 1000-user version of NetWare v3.11. Working with a real workload that takes 250 users to load a server may be difficult for some lab environments. In fact, the generation of a real test workload of 250 users might not even fully load the system under test, a result some competitive testers find uninteresting.
Synthetic workloads, on the other hand, are most often run full speed without any representation of real delay - process time, think time, and so forth. Therefore, they quickly saturate an individual component of the system. These types of tests are sometimes called "server beaters." In this way, the goal to saturate the hardware is reached much more quickly with fewer resources. Again, the element sacrificed is realism.
For example, one spreadsheet file might be opened, read entirely, and written back to disk 50 times in a row - much faster than a human user would execute the same procedure. This type of test does not generate the workload of a real user using a spreadsheet, but creates an entirely new kind of workload. In fact, the process itself is one that might never be encountered in a production environment. The real trouble begins when the results are used to design systems for real users.
Problems with Synthetic Results. One symptom of this kind of testing is the easy misinterpretation of the resulting chart. Figure 3 shows an example of some results from a synthetic test.
Looking at charts similar to the one in Figure 3, non-technical customers often wonder if perhaps eight users is the maximum capacity of the system under test. Two reports from Novell - the LAN OS Report (1986) and the LAN Evaluation Report (1985) - helped propagate this myth by publishing graphs displaying systems that appear saturated with similar numbers of workstations.
Unnatural workloads generate unnatural results. This creates havoc for the person who has the responsibility to relay the meaningfulness of the results to the industry. Charts resulting from tests using synthetic workloads show wide margins of performance between competing products. In reality, the impact in a production environment of such differences may not even be perceptible to the user community. Although these tests are rarely positioned as accurate, this discrepancy is usually never discussed or understood by the reader of such information.
Figure 3: Sample chart of results from a synthetic workload.
Another problem with synthetic workloads is that anyone can create one that highlights their system's strengths and hides their weaknesses, resulting in the often diametrically opposed results from competing vendors, from field experience, or both. Tests using synthetic workloads are a favorite of hardware manufacturers because they so easily demonstrate the need for bigger and faster hardware.
Yet another problem with synthetic test workloads is their deceptive nature. Since synthetic workloads are often designed using real applications, interpretations of the results often include the word "realistic," leading many to believe the workload to be real. However, the script-driven workloads are anything but realistic. The bottlenecks pinpointed by these tests don't have any demonstrable relationship to those created by production uses of applications, nor do they allow system design or capacity planning based on a given number of users. This is because there is no way of knowing how many real users the synthetic workload represents.
Examples of tests using synthetic workloads are InfoWorld's unpublished LAN test suite, the LANQuest Application Benchmark (L.A.B.) test suite, and ZD Lab's LANtest.
Applying the Three Laws of System Testing, synthetic workloads do saturate hardware components unnaturally (Law 1), bottlenecks generated by the workload are not accurate (Law 2), and the test workload does not correspond in any realistic manner to a production workload (Law 3). Thus synthetic tests are not suitable for testing complete systems.
Synthetic test workloads developed with real applications and real data may be an improvement over some artificial test workloads because their instruction mix and sequencing are more accurate. The reading and writing of a file 50 times, for instance, is a more accurate representation of file service activity than 10,000 512-byte read requests. The read-write process includes file open and close requests, a file search request, and a mixture of reads and writes. Still, when you compare synthetic test workloads to real ones, there is a great deal lacking.
Artificial Test Workloads
Artificial workloads are those most often used by product developers to locate error conditions and boundary problems. The term "artificial" is appropriate because this type of workload uses no real workload components such as off-the-shelf applications or data.
Artificial workloads don't look anything like workloads produced by real users using real applications. They generate sequences of individual or mixes of instructions sequentially or randomly to load the component under test. Tests that use artificial workloads should be designed and developed carefully to meet the focused objectives of the tester. When the results are combined with workload calibrations and symptom monitoring, they make excellent component tests. But these are only component tests, and should never be used as system tests.
Novell's test suites of the recent past - IPXLOAD, PERFORM3, MASTER/TEST, DSK311 and so on - fall into this category. These tests all focus on the LAN channel of the network system under test. This direction was due to the early emphasis on the LAN channel by Novell and other industry leaders in the mid-1980s. Since then, the disk channel has become a more visible component, with a resulting increase in research and development within the industry.
IPXLOAD counts the number of transmit and receive packets per unit of time. It is intended to measure NetWare's Internet Packet Exchange (IPX) protocol data transfers. IPXLOAD's usefulness is confined to PE tests within the IPX protocol layers and related hardware.
PERFORM3 measures the composite throughput from server cache to the workstation by isolating the LAN channel and measuring only read operations. PERFORM3's usefulness is confined to the study of the LAN channel. Test results do not reflect performance under production workloads.
MASTER/TEST measures the composite response time for eleven types of DOS requests under an incremental load. MASTER/TEST's usefulness is confined to the study of individual DOS request response times in an artificial environment.
DSK311 loads the disk channel using brute force to bypass cache. DSK311's usefulness is confined to the study of disk channel components in isolation from the network system. It was designed by Novell to troubleshoot disk drivers during the development and testing process. In the hands of someone who doesn't understand the designer's intent, results from DSK311 can be extremely misleading. Comparisons of different systems using DSK311 are meaningless, since the results are relative to the component under test. Users who are led to believe that the differences in DSK311 results between two different drive subsystems represent meaningful information are mistaken.
Because they violate the Three Laws of System Testing, artificial test workloads should never be used to test complete systems.
The point is that all of today's component tests use artificial workloads and were never designed to be used as system tests (benchmarks). Artificial workloads make great development and quality assurance tests. Synthetic workloads are an improvement because they include some basic components of a real workload, but their results are often abused, misinterpreted, and misunderstood.
As a solution, Novell's Systems Engineering Division (SED) and Novell Research are working on PE tools that capture, model, and generate test workloads that accurately reproduce production workloads. Once these tools are available, Novell's customers will be a step ahead in their ability to design and manage NetWare-based systems more effectively.
Examples of industry applications of a replicable, real test workload include the ability to apply the results to all phases of the network lifecycle:
Product development and testing. Design and test network components using a variety of real workloads in tandem with component tests.
Procurement. Make procurement decisions by testing candidate systems with test workloads captured from the customer's environment.
Design and Optimization. Design and optimize a production system by relying more heavily on the pilot system. Make changes to the pilot system and test the optimal configurations with test workloads captured from the customer's environment.
Capacity Planning. Make capacity planning and upgrade decisions based on real workload tests rather than conjecture.
Research in this area is progressing with great success; so much so that the use - and abuse - of synthetic and artificial workloads may be reduced significantly in the near future.
More in-depth information on this subject can be found in the following books:
Ferrari, Domenico, Giuseppe Serazzi, and Alessandro Zeigner. Measurement and Tuning of Computer Systems. Englewood Cliffs, NJ: Prentice-Hall, 1983.
Fortier, Paul J., and George R. Desrochers. Modeling and Analysis of Local Area Networks. Boca Raton, FL: CRC Press, 1990.
Gray, Jim, ed. The Benchmark Handbook For Database and Transaction Processing Systems. San Mateo, CA: Morgan Kaufmann Publishers, 1991.
* Originally published in Novell AppNotes
The origin of this information may be internal or external to Novell. While Novell makes all reasonable efforts to verify this information, Novell does not make explicit or implied claims to its validity.