Readme for the TSATEST.NLM backup diagnostic tool.

(Last modified: 31May2006)

This document (10092890) is provided subject to the disclaimer at the end of this document.

goal

fact

Novell NetWare

note

TSATEST.NLM was designed to allow analysis of the behaviour of Novell's backup technology SMS. In particular, it executes backup jobs and performs statistical analysis of the behaviour of the backup APIs being used. As a consequence, it represents a useful tool for demonstrating what the SMS components are capable of delivering for many backup problems.

TSATEST does not actually save the data being backed up. While it is true that archive storage technology (e.g. tape) is often the limiting factor for backup it is also true that there are many techniques that can be used to mitigate that fact. Since Novell does not actually design or build archive devices it is clear that there is only a limited amount that Novell can do to make these devices faster. Therefore, Novell's contribution to backup performance is to make SMS components deliver sufficient performance to be useful in real backups. Therefore, TSATEST demonstrates what could be delivered from the SMS components if an infinitely fast, zero latency tape drive were being used. As unrepresentative as this sounds it is actually the best way to test the SMS components since it makes them entirely the bottleneck and, consequently, if they deliver better performance via TSATEST than you get from your current backup system then the SMS components are unlikely to be the cause of the problem.

As a result of this design TSATEST can be useful in comparing SMS capabilities with actual backup performance. While it is likely that no commercial backup application could extract all that the SMS components can offer it is true that backup applications should be able to extract most of what the SMS components can offer. If the result of a comparison is that real world backup performance is significantly less than the performance reported by TSATEST then it is a fact that the SMS components are spending long periods idle, waiting for the backup application to make requests of the SMS components. Clearly this will erode overall performance and clearly this is something that is outwith the control of Novell. It should be noted however that there is a case for pointing out that SMS capabilities that are lower than tape capabilities will result in aggregate performance that is even lower still since this will result in tape stalling, shoe-shining and these cost significant amounts of time waiting on the tape drive that effectively means the SMS components are idle. However, no PC based operating systems have file systems that are capable of keeping modern archive devices busy from a single data stream and the fact that the tape is busy should not prevent the backup application from continuing to pre-read data from SMS in advance of the archive device's ability to consume it. Thus, we are left to conclude that even when SMS performance, as demonstrated by TSATEST, is below that of a particular tape drive intelligent backup applications should have no trouble mitigating that problem and this means that, even in such cases, TSATEST provides a reasonable basis for comparison that can be used to determine if the SMS components are a causal factor in poor backup performance.

TSATEST delivers a relatively complex analysis of backup behaviour and some effort is required to use it for complete diagnosis of a problem. The remainder of this documents explains all the configuration options and how to interpret output.

USAGE:

TSATEST is a NetWare Loadable Module (NLM) that can be launched from the server's System Console screen. All configuration is performed using command line parameters and there is a large number of supported parameters. This can make using TSATEST somewhat complicated but typical command lines only use a small subset of the supported parameters.

As all configuration is performed at the command line and backup requires the specification of a username and password then it should be noted that the password will be displayed in clear text on the server console and the command history could be used later to readback the password. TASATEST must only be used in circumstances where such a feature will not compromise overall network security.

In the following discussion, items enclosed in square brackets [ ] are optional. Items in boldface should be typed as written and items in italics represent text that should be replaced with something appropriate for your own environment, e.g. a volume name, a server name, etc.

To load TSATEST type the following at the server console prompt:

[LOAD ]TSATEST [Parameters as needed]

Parameters:

/S=Server	Specify a server to backup. This is intended for backing up data on remote server. If this argument is not included then the local server on which TSATEST is loaded is assumed.
/V=Volume/Resource	Specify a resource to backup on the specified or implied server. For most cases this will be a volume, in which case the argument must include the colon, e.g. /V=VOL1: If this argument is not included then the SYS: volume is assumed. There are some special case values that can be specified: /V=ALL performs a backup of all volumes on the server /V=NDS preforms a full tree backup via the target server (TSANDS must be loaded for this option to function)
/PATH=Path	Allows backup to be performed on only a sub-tree of the specified resource. For example, using /V=SYS: /PATH=\System will cause the backup to be performed only for the SYS:\System directory and its sub-directores. The default for this parameter is no path, i.e. from the root down.
/B=BufferSize	Specifies a buffer size to be used for read operations. By default 65536 bytes is used and this is in line with what other backup applications would use but other values can be tested to see if they have an effect.
/U=Username	Specifies the username to be used for authentication with the TSA. The name must be a dot preceded, typeless fully qualified distinguished name. For example: .admin.myorg would be in the correct format but .CN=admin.O=myorg would not.
/P=Password	Specifies the password to be used when authenticating with the TSA. NB: This will appear in clear text on the server console and be visible to anyone with access to the server console either locally or remotely.
/LOG[=LogFile]	Create a log of statistical data gathered during job execution. If the optional argument is not included then the log file is written to SYS:\ETC\TSATEST.LOG on the server on which TSATEST was loaded. Any existing log file is over-written.
/I=Iterations	Specifies a number of times to execute the specified job in succession, e.g. /I=10 will cause the specified job to execute 10 times. This option can be useful when used with the /LOG option but care should be taked to ensure that the job does not fit in server cache memory causing the first execution to show realistic performance and the remaining iterations to show the much faster performance of file system cache.
/SIZE=DataSetSize	Specifies the total size of all data that will be backed up. This is only useful when used with the /PRES option because it only enables a progress bar that can only be displayed in /PRES mode.
/PRES	Enable "presentation" mode. The rolling log is not displayed, the aggregate performance is displayed in large characters and, if the /SIZE= option is used a progress bar is displayed.
/SHOWNAMES	Display filenames while enumerating the job.
/C=ScanTypeNumber	Specifies a value to be used in the scan type field of the job structure when creating a job. This option should only be used after referring to the SMS NDK documentation for appropriate values.
/ERR=ErrLogFile	Specifies a filename in which to list all errors reported during the backup. Any existing file is over-written.
/AGG	Aggregate statistical data across multiple iterations.
/FULLLOG	Cause the rolling log display to report the result of all operations rather than informational and error messages only.
/G=GrowAmount	Specifies a method for growing the read buffer size on each iteration. Two syntaxes are supported. If a number is specified then the buffer grows by the specified number of bytes on each iteration. If an X then a number is specified, e.g. /G=X2, then the buffer is multipled by the specified number on each iteration.
/MS	Use millisecond rather than 0.1ms resolution timing. Only useful for comparing data gathered by other programs if they support millisecond timing.
/CLUSTER	Backup the cluster file system TSA rather than the standard file system TSA. This allows access to TSA resources exposed only via the cluster target service.
/NOWAITONEXIT	Causes TSATEST to unload when the job(s) specified by the other arguments are complete. Without this argument the NLM waits for a key to be pressed before unloading (thus permitting the review of the statistics on screen). The default behaviour makes it difficult to use TSATEST in unattended scenarios so this option makes such uses easier.
/AVE[=Tolerence[,Group Length[,Filename]]]	Enables moving average analysis. This is very useful for investigating how performance varies throughout a backup job. If used without arguments (/AVE) the default behaviour is the dumping of moving average statistics to SYS:\ETC\TSATEST.AVE using a tolerence of 10% and a group length of 64 files. The tolerence defines the range across which the moving average is considered to be unchanged. The group length is the number of files in the group used to calculate the moving average. The filename permits the specification of an alternative output location to the default. NB: Any existing file is over-written. See the moving average file format notes for more details.
/?	Display a help page with many of the above arguments outlined.

Examples:

TSATEST /U=.admin.myorg /P=unsecure

Backup the SYS: volume using the supplied credentials

TSATEST /V=VOL1: /U=.admin.myorg /P=unsecure

Backup the VOL1: volume using the supplied credentials

TSATEST /B=131072 /U=.admin.myorg /P=unsecure

Backup the SYS: volume using the supplied credentials and a buffer size of 131072 bytes.

TSATEST /S=Serv2 /V=ABC: /U=.admin.myorg /P=unsecure

Backup the ABC: volume on server Serv2 using the supplied credentials. This will perform a remote backup if Serv2 is not the server on which TSATEST is loaded.

TSATEST DISPLAY:

TSATEST divides the display vertically into two regions. The top half contains the statistics for the current job. The bottom half contains a rolling log of status information or, if /PRES is used, the effective data rate and possibly a progress bar (if /SIZE= is also used). The log entries are somewhat self-explanatory but can be output faster than they can be read. If the output concerns you then use the /ERR= option to write the rolling log output to file.

The statistical information is displayed in two columns. In order to describe them it is easiest to refer to them by a co-ordinate. Thus, the first (i.e. on the highest displayed line) statistic on the left column is at Left-1, the second statistic on the right column at Right-2, etc.

Co-ordinate	Statistic	Description

Left-1	Read Count	The number of times the SMS read API has been used in performing the job. This can be useful with the Total Bytes Read statistic to calculate the mean read size.
Left-2	Last Read Size	The size in bytes of the last read performed. This effectively only usefully shows the size of the read of the tail block for a given file.
Left-3	Total Bytes Read	The number of bytes of data that have been read in performing the job.
Left-4	Raw Data MB/min	The rate at which data is being supplied if only the SMS read APIs are accounted for, i.e. the overheads of scan, open and close are ignored. This is useful if compared with the Effective MB/min statistic in that the difference is entirely accounted for by the overhead of using the other three APIs.
Left-5	Backup Sets	The number of objects that have been backed up. For file system backups an object is a file or directory. This can be used to calculate the mean "file" size by dividing it into the Total Bytes Read.
Left-6	Ave. Open Time	The mean execution time of the SMS open API. This API must be called once per object in the job in order to be able to read the data for that object.
Left-7	Ave. Close Time	The mean execution time of the SMS close API. This API must be used once per object in the job that was successfully opened.
Left-8	Effective MB/min	The data rate that the SMS components are able to deliver for the job as a whole. This is the statistic most useful in comparing with data rates from your actual backup system.
Right-1	Min. Read Time	The minimum time measured for execution of a SMS read API.
Right-2	Last. Read Time	The time measured for execution of the last SMS read API.
Right-3	Max. Read Time	The maximum time measured for execution of a SMS read API.
Right-4	Ave. Read Time	The mean time for all measured executions of the SMS read API.
Right-5	Ave. Scan Time	The mean time for execution of the SMS scan API. This API is used to parse through the objects to be backed up.
Right-6	Total Read Time	The total time spent performing reads. This is the time used to calculate the Raw Data MB/min statistic.
Right-7	Elapsed Time	The time during which the job has been running. This incorporates the overhead of displaying the screen output and is not, therefore, used for either of the data rate statistics on the grounds that the screen output of TSATEST is not representative and not necessary for normal backup tasks so is irrelevant overhead in terms of statistical analysis.
Right-8	Total TSA Time	The total time spent in SMS APIs. This is effectively the time it took to perform the backup and is the time used to calculate the Effective MB/min.
Right-9	Max. Error	The timer resolution is 0.1ms which means that any timed operation actually took anywhere between the reported time and the next 0.1ms interval. Thus the reported results are in error by a maximum of 0.1ms for each timed operation. This statistic calculates how the time would have differed had the results all been 0.1ms longer then expresses the resultant Effective MB/min as a percentage difference from the displayed result. Typically the error is only significant if the mean execution time for all APIs (especially read) is very small.

Log File Format:

The log file produced by the /LOG argument is a comma-delimited text file that can be loaded into almost any spreadsheet application. The first line of the log file has the column headings for each row's data entries. They map to the prime statistics described above and are as follows. Following the prime statistics are a set of histogram data entries that require independent explanation. However, the prime statistics are as follows:

Column Name	Description

Volume	The resource name for which backup statistics are provided on this row. NB: Does not include the value of any /PATH= argument.
Bytes Read	The total number of data bytes in the job
Reads	The number of times the SMS read API was used to retrieve data.
Data Sets	The number of objects that were backed up
Mean Scan Time	The mean execution time for the SMS scan API.
Mean Open Time	The mean execution time for the SMS open API
Mean Close Time	The mean execution time for the SMS close API
Mean Read Time	The mean execution time for the SMS read API
Min Read Time	The minimum time recorded for execution of the SMS read API
Max Read Time	The maximum time recorded for execution of the SMS read API
Total Read Time	The total time spent performing reads
TSA Time	The total time spent in SMS APIs
Elapsed Time	The total time (including screen I/O overhead) during which the job was executing

The remaining statistics are the histogram "buckets" for the primary SMS APIs: scan, open, read and close. Each histogram has 32 buckets each representing a set of time values for execution of the API it represents. The value in each bucket is the number of executions of that API that fell within the time interval represented by that bucket. The time intervals are on a semi-binary semi-exponential scale such that most buckets covers a time interval twice as long as the previous bucket for that API. Some buckets are unused in order to minimise bucket selection time. The precise time interval for each bucket is as follows:

Bucket Number	Start Time	End Time

0	>= 0.0 ms	< 0.1 ms
1	>= 0.1 ms	< 0.2 ms
2	>= 0.2 ms	< 0.3 ms
3	>= 0.3 ms	< 0.4 ms
4	>= 0.4 ms	< 0.8 ms
5	unused
6	>= 0.8 ms	< 1.6 ms
7	unused
8	>= 1.6 ms	< 3.2 ms
9	unused
10	>= 3.2 ms	< 6.4 ms
11	unused
12	>= 6.4 ms	< 12.8 ms
13	unused
14	>= 12.8 ms	< 25.6 ms
15	unused
16	>= 25.6 ms	< 51.2 ms
17	unused
18	>= 51.2 ms	< 102.4 ms
19	unused
20	>= 102.4 ms	< 204.8 ms
21	unused
22	>= 204.8 ms	< 409.6 ms
23	unused
24	>= 409.6 ms	< 819.2 ms
25	unused
26	>= 819.2 ms	< 1638.4 ms
27	unused
28	>= 1638.4 ms	< 3276.8 ms
29	unused
30	>= 3276.8 ms	< 6553.6 ms
31	>= 6553.6 ms

The bucket selection method is still being tweaked and is likely to change in a future release. Its unusual nature is designed to allow bucket selection in the fastest possible time.

Moving Average File Format:

The moving average file produced by the /AVE argument is also a comma-delimited text file that can be loaded into almost any spreadsheet application. The first line of the log file has the column headings for each row's data entries. The file contains a variable number of rows of data. As the job is executed the average performance for the most recent n files is tracked (where n is the group length). As each new file is added to the tracking one is dropped from tracking. If the performance of the new group is within the tolerence a counter is incremented. If the performance exceeds the tolerence then the moving average statistics to that point are dumped to the file and the statistics are reset using the new performance as the baseline (and, thus, with a new range of values representing the tolerence). For example, assuming the default values of 10% tolerence and a 64 file group length, let's say that average the performance for the first 64 files is 1000MB/min. The range of tolerence values is 900MB/min through 1100MB/min. So, the next file is read and the statistics for the very first file are dropped and the latest file's statistics replace them. Maybe the average performance is now 1001MB/min, this is still within the tolerence so the sample count is incremented and nothing else happens. On the next pass the second file recorded is dropped and it's data replaced with that for the next file. Maybe the average performance drops to 850MB/min. Now we dump the statistics for the data so far and reset with a new baseline average performance of 850MB/min and a sample count of 1. The new tolerence range is 765MB/min through 935MB/min (10%). Three statistical values are dumped to the log file:

Column Name	Description

Average MB/min	The average data rate in MB/min fore the sample set this row represents
Files	The number of files over which the moving average remained within the tolerence around the Average MB/min performance in the first row. Note that if you sum the files column you will find the result to be higher when you compare it to the number of data sets from the prtimary job statistics. This is because this set of statistics is for a moving average so some files will appear in more than one row of the moving average output.
Mean File Size	The mean file size for the files in the sample this row represents

Interpreting TSATEST results:

The most important thing to do when interpreting TSATEST output is to avoid over-reacting or jumping to conclusions. It's a fact that it is very unlikely that any commercial backup engine will be capable of extracting all from SMS that TSATEST can. However, commercial application should get 80% or higher without too much difficulty. Where the tape drive is faster than TSATEST reports for the SMS components then a commercial backup application should be able to deliver more than TSATEST as a consequence of using multiple streams to tape (multiplexing). Modern tape devices are so fast that this is necessary for any backup application to hope to keep up.

The second most important thing to realise is that care should be taken when apportioning blame for poor performance. Clearly a 15 disk RAID-0 system is going to deliver orders of magnitude greater throughput than a single disk IDE system. A small mean file size makes the scan, open and close more significant proportion of the total time spent on each file because the read time gets smaller and that's a problem for any file system. By the same token it is important not to use TSATEST to abuse your backup vendor. The vendor may sell many backup products for NetWare, some aimed at the enterprise, some at the workgroup. Using a workgroup solution for enterprise storage will produce poor results.

One simple thing worth considering is that a PCI bus has limits as well. Assuming you have 32 bit, 33MHz storage controllers and one PCI bus in your server then the maximum throughput you can reasonably expect is about 2.9GB/min, half the PCI bus capacity. This is because you have to suck the data from disk across the PCI bus then spit it back out to tape on the same PCI bus. The theoretical best performance is where each part of the operation uses exactly 50% of the bus capacity. So, if you see 5.3GB/min (as the author has) from TSATEST on a NetWare server then you know that SMS can go as fast as the hardware is capable of.

A critical issue to consider is the extent to which file system caching affects performance. It is possible to get extraordinary results from TSATEST due to file system caching. NB: The 5.3GB/min mentioned above was not due to caching, that was actual disk throughput, the author has seen nearly 7GB/min due to caching. The best way to prevent caching from being a factor is to define a backup job that is larger than the installed memory used for cache on the server. So, for a server with 1GB test with a backup job larger than 1GB. NB: NetWare currently only uses up to 4GB for file caching even if the server has more than 4GB of installed memory. This is because PCs still only have 32 bit DMA controllers even if they have 36 bit memory controllers and file caching in non-DMA accessible memory is very slow.

Your real goal with TSATEST should be to analyse the behaviour of your backups and to use that analysis to identify the areas where your backup system is inadequate with a view to replacing or repairing them in time. A single pass of TSATEST over a given SYS: volume is not sufficient analysis to make such judgements and care should be taken to perform a widespread analysis.

Trusting TSATEST?

TSATEST was designed to make it possible to correct performance problems in SMS. If it was not trustworthy then it would be of little use in achieving that goal. The Max. Error statistic was added to increase the level of trust in its analysis. However, there are several areas that may be of concern.

If it concerns you that the Elaspsed Time is not used for the Effective MB/min calculation then compare the difference between the Elapsed Time and the TSA Time result and consider what the difference would be if the screen I/O was included in the Effective MB/min. The screen I/O time rarely exceeds a second or two per minute and is frequently much lower. So, even if it was incorporated the effect would be very small.

If it concerns you that there is no tape I/O time included in the TSATEST output then consider this from a different perspective. If TSATEST reports 700 MB/min and your existing backup solution only delivers 300 MB/min from a tape drive capable of 1200 MB/min then how can it be that SMS is the cause. SMS would happily deliver 700 MB/min according to TSATEST. This leaves only one, inescapable conclusion - the backup application is allowing poor tape utilisation to propagate back to the server causing the SMS utlisation to be lower than it is capable of. In such cases it is actually possible that buying a slower tape drive would make backup faster!!! However, the proper solution would be to use backup software capable of aggregation of multiple data streams to tape.

If you are concerned about TSATEST reliability then you should invest the time to study its behaviour using test servers. In general, software like TSATEST should not be used in production server environments at all. However, it should be noted that TSATEST is used to stress test SMS components and it has to be reliable for that to be effective.

Other uses for TSATEST

If you are suffering from reliability problems with backup and are concerned that the SMS components are the cause then TSATEST can be used to quickly eliminate the SMS components. Simply use TSATEST with paramters that mimic the failing job and, if it works without failure then SMS is unlikely to be the cause. This type of test is most useful for cases such as servers abending during backup but is only useful in eliminating the SMS components, the abend could be related to the tape I/O, for example. To eliminate the tape I/O path SBCON can be used instead.

In the author's experience backup performance is often proportional to the mean file size. This would be an expected result in most file systems on most platforms. The reason is fairly simple, there are fixed overheads associated with scanning the file system for files to read, opening and closing those files and there is a fixed overhead in performing any read. Therefore, if there is a relatively constant overhead per file. Now, assume we read a 1 byte file. That fixed overhead is distributed across only one byte. If the file was 1000 bytes then it the constant overhead would be distributed across a thousand bytes. To convince yourself of this simply use the average option in TSATEST (/AVE) and load the output file into a spreadsheet application and create a graph using the Average MB/min column and the mean file size column. You may need to use two y column scales since the mean file size is likely to have a significantly larger range than the performance scale. In the author's experience the shape of the performance graph tracks the shape of the mean file size graph thus illustrating the original point. This makes the /AVE option very useful for cases where, for example, you have two identical systems that have drastically different backup performances.

Conclusion

A document of this size and a tool as simple as TSATEST cannot hope to provide sufficient detail to resolve every conceivable backup problem. However, TSATEST adds another weapon to your arsenal and it covers a wide variety of circumstances such that it should assist in diagnosing many current and future backup problems. However, at the end of the day, there is no better tool than your own experience.

fix

TSATEST.NLM can be found in the SMSSRVR.ZIP file located in the \PRODUCTS\SMS\ directory on the current NetWare 6.5 and NetWare 6.0 support pack CDs and images. It will be included in future releases.

document

Document Title:	Readme for the TSATEST.NLM backup diagnostic tool.
Document ID:	10092890
Solution ID:	NOVL96997
Creation Date:	14May2004
Modified Date:	31May2006
Novell Product Class:	Netware

disclaimer

The Origin of this information may be internal or external to Novell. Novell makes all reasonable efforts to verify this information. However, the information provided in this document is for your information only. Novell makes no explicit or implied claims to the validity of this information.
Any trademarks referenced in this document are the property of their respective owners. Consult your product manuals for complete trademark information.