Trailing-Edge
-
PDP-10 Archives
-
BB-H348C-RM_1982
-
swskit-v21/certification/25pag.asc
There are no other files named 25pag.asc in the archive.
NWSM: (AV) -- The average number of working sets in memory during the
interval. If this number is significantly larger than NRUN or
NBAL, then working sets are not being forced out of memory when
processes go into a wait state (like terminal input wait) and
consequently response times should not be greatly affected by
paging.
BSWT: (AV) -- The average number of processes in the Balance Set that
are waiting for the completion of some event. Normally this
number reflects the number of processes waiting for a page to be
read in from the disk. If NBAL - BSWT is less than one, then
there are not enough runnable processes in memory to keep the CPU
busy 100% of the time. In this case, SWPW or FILW will be
significant.
DSKR: (%) -- Percentage of the processes in Balance Set Wait (BSWT)
that are waiting for a file page to be read into memory.
DSKW: (%) -- Percentage of the processes in Balance Set Wait (BSWT)
that are waiting for file pages to be written back to the disk.
SWPR: (%) -- Percentage of the processes in Balance Set wait (BSWT)
that are waiting for a page to be swapped into memory from the
swapping area of the disk.
NLOD: (PM) -- The average number of working sets loaded into memory
per minute.
CTXS: (PS) -- The average number of context switches performed by the
scheduler per second. A context switch is made whenever the
scheduler decides to stop running one process and start running
another process. This happens when the running process
voluntarily blocks, or it faults on a page that is not in memory,
or when a higher priority process is ready to run. Since it
takes CPU time to perform a context switch, CTXS directly affects
SKED.
UPGS: (AV) -- The average number of pages assigned to processes with
loaded working sets. These processes may or may not be in the
Balance Set, but they are allocated memory.
FPGS: (AV) -- The average number of physical memory pages that are
currently available for swapping in user processes. The monitor
normally keeps between 20 and 100 free page. The monitor uses
these pages (and the rest of memory not in use by balance set
processes) as a page cache. For example, if a process reenters
the balance set after waking up from a blocked state and it still
WATCH Page 12
has some of its pages in memory in the free page pool, then those
pages are used directly without requiring any disk I/O. It has
been demonstrated that this cache plays an important part in
overall system performance. Therefore, if FPGS is very small,
the system performance has most likely been degraded.
DMRD: (PS) -- The number of reads per second made to the swapping
area.
DMWR: (PS) -- The number of writes per second made to the swapping
area.
DKRD: (PS) -- The number of reads per second made to the file system.
DKWR: (PS) -- The number of writes per second made to the file
system.
TTIN: (PS) -- The number of terminal input characters received per
second from all terminals on the system. This includes real and
pseudo-terminals as well as echoed characters.
TTOU: (PS) -- The number of terminal characters output per second by
all jobs on the system. This includes real and pseudo terminals.
WAKE: (PS) -- The number of process wakeups per second. Some of the
types of wakeups that fall into this category are:
IPCF
ENQ
Terminal Input
Terminal Output
Process Termination
DISMS
TIMER
IIC
TTCC: (PS) -- The number of terminal interrupt characters (e.g.
control-c) typed per second.
TDIO: (PS) -- The aggregate number of disk pages read or written per
second to both the file system area and to the swapping area.
Normally 60 pages per second for a one channel and 100 pages per
second for a two channel system are saturation levels. This
variable is the summation of DMRD, DMWR, DKRD and DKWR.
RPQS: (PS) -- The average number of pages per second which were
retrieved from the replaceable queue in order to satisfy page
faults. This number of pages faults (per second) don't require
disk I/O.
GCCW: (PS) -- The average number of pages per second which were freed
by global garbage collections. See TCOR and NCOR for more
WATCH Page 13
information.
XGCW: (PS) -- The average number of pages per second which were freed
by local garbage collections on specific processes. These
garbage collections remove pages from a process's working set
which have not been used in a long time.
KNOB: (value) -- This is the setting of the "bias control" knob. The
settings of this knob range from 1 to 20 and provide a mechanism
for the administrator to favor interactive or computational
users. Not all 20 settings are implemented (only 5 are unique in
release 4 of TOPS-20). However, a setting of 1 will cause the
scheduler to heavily favor interactive jobs, a setting of 20 will
cause the scheduler to heavily favor computational jobs, and a
setting of 11 (the normal default) will provide a balanced
approach. QUEUE DISTRIBUTION PERCENTAGE
Each active process is placed by the scheduler in one of 6 queues
depending on the process's recent history. The values in this
display represent the portion of USED time allocated to processes
which were in the respective scheduling queues.
The first queue is only used by Job 0 and by jobs in the special
high priority category. Normally the percentage of run time
accumulated in this queue is small.
The second and third queues are the interactive queues. If the
sum of these two values is high, then there is a high interactive
load on the system.
The last three queues are the computational queues. Processes
only move onto these queues if they have entered a compute bound
phase. If the sum of these three values is high, then the system
load is primarily computational.
When the class scheduler is turned on, interactive users are
scheduled in queue order while processes in the lower three
(computational) queues are given priority on the basis of their
class's distance from its target share.
4.3 The Load Averages Section
The term "Load Average" refers to the average number of processes
simultaneously demanding service over some interval of time. The load
average display looks like the following:
LOAD AVERAGES: 5.29 4.06 3.39
HIGH QUEUE AVERAGES: 3.76 2.86 2.25
LOW QUEUE AVERAGES: 1.54 1.20 1.14
WATCH Page 14
CLASS LOAD AVERAGES
CLA SHR UTIL
0 80.00 0.00 0.00 0.00 0.00
1 15.00 0.00 0.00 0.00 0.00
2 5.00 0.00 0.00 0.00 0.00
LOAD AVERAGES
The system keeps three exponential load averages. These values
represent the average load over the last 1 minute, the last 5
minutes, and the last 15 minutes. These numbers can be used to
estimate the expected elongation of the elapsed time required to
run a program. If the system load average equals X, then the
approximate elapsed time required to run an additional program on
the system is at least (1+X)*Y, where Y is the stand alone
elapsed time required to run this program. If the system is
swapping heavily, it may require a great deal more time.
By comparing the three load averages, it can be determined
whether the system load is rising or falling.
HIGH QUEUE AVERAGES
These values are the components of each of the load average
values that are attributable to interactive jobs.
LOW QUEUE AVERAGES
These values are the components of each of the load average
values that are attributable to computational jobs. The sum of
the high queue average and the low queue average equals the load
average.
CLASS LOAD AVERAGES
CLA SHR UTIL
When class scheduling is being utilized, this section will be of
interest. Information about the classes defined for the system
are presented with the following information.
1. CLASS NUMBER -- Under the column CLA, the class number is
displayed.
2. SHARE -- For each class, its share of the processor is
displayed under the column SHR. This share corresponds to
the percentage of the CPU which the monitor will try to
distribute among the jobs in this class.
WATCH Page 15
3. UTILIZATION -- For each class, the utilization of the CPU
actually achieved is displayed under the UTIL column. The
value here should be less than the share unless the class
received "windfall" because other classes did not use their
entire share.
4. Load Averages -- The 1 minute, 5 minute, and 15 minute load
averages are presented for each of the classes. These load
averages may appear to be very large because they are
computed as follows:
NUMBER OF PROCESSES IN
CLASS MAKING SIMULTANEOUS
DEMANDS
CLASS LOAD AVERAGE = ---------------------------
MAXIMUM (SHARE, UTILIZATION)
Thus if there are 5 processes making simultaneous demands in
a class with a share of 20% (which has achieved 15%
utilization), then that class's load average is 25 (5/.20).
Thus a user should expect a compute bound task to take at
least 25 times as long as it would on a stand alone system
with 100% of the CPU.
If the class scheduler is not running then the utilization and
load averages will all be zero.
4.4 Directory Cache Statistics
The directory cache statistics are only available when an enabled user
requests "ALL" to the question "Print monitor statistics ?". There are
three pieces of information in this display which are illustrated as
follows:
Directory Cache hits: 175
Directory Cache Misses - Cache Full: 0
Directory Cache Misses - New Entry Added: 321
The directory cache mechanism in the monitor holds retrieval information
on a number of directories. Whenever a directory access is made, the
cache is first interrogated to determine if the directory is there. If
it is not the retrieval information is "cached". If the accessed
directory is there, then the retrieval information from the cache is
used. The statistics indicated the effectiveness of this cache.
Directory Cache hits: -- This is the number of times an accessed
directory was found in the cache.
Directory Cache Misses - Cache Full: -- This is the number of times
WATCH Page 16
an accessed directory was not found in the cache and all the
cache slots were filled with active directories. In this case
the most recently accessed directory cannot be put into the
cache.
Directory Cache Misses -New Entry Added: -- The number of times the
accessed directory is not found in the cache, but room was
available to add it (possibly in lieu of an inactive entry).
The "hit ratio" is computed by HITS/(HITS+MISSES) and provides a
good indication of the cache's effectiveness. For instance, the
in the example is 175/(0+321) or 35% (which isn't all that
great).
4.5 Normal Per-job Information
The normal per-job information which is available to all users
running WATCH (not just enabled users) and consists of a line for each
job which had an active process during the interval. The statistics
reported for each job include the following.
JOB TTY USER PROGRAM DELTA RT %RT JU CSH
0 DET OPERATOR SYSJOB 0.01 0.2 0.00 80.00
2 207 OPERATOR BATCON 0.01 0.2 0.00 80.00
3 210 OPERATOR NETCON 1.62 44.9 0.00 80.00
7 143 MCKIE WATCH 0.63 17.5 0.00 80.00
17 21 D.SCHEIFLER EXEC 0.10 2.9 0.00 80.00
18 154 SEARS EMACS 0.00 0.1 0.00 80.00
20 121 ORAN EMACS 0.09 2.5 0.00 80.00
21 51 HARRELSON EXEC 0.03 0.7 0.00 80.00
22 DET OPERATOR PERF 0.08 2.4 0.00 80.00
23 170 HALLYBURTON VDIREC 0.30 8.4 0.00 80.00
JOB -- The is the job number assigned by the system when the user
logged in.
TTY -- This is the number of the terminal that is being used by the
user running this job. "DET" means that the job is not attached
to a controlling terminal (i.e. DETACHED).
USER -- This is the name of the directory that the user logged into.
PROGRAM -- This is the name of the program being run or the name of
the EXEC command being used. Please note that the program name
is obtained at the time sample is taken. It is not possible to
tell if the program or command was running during the entire
interval.
WATCH Page 17
DELTA RT -- This is the incremental amount of runtime (CPU time) which
the job used during the interval. This is in seconds.
%RT -- The percentage of the interval represented by the DELTA RT.
The sum of all %RT values is used to compute the SUSE.
JU -- Job Utilization. When the class scheduler is running this will
normally be a non-zero value. It represents the CPU utilization
accumulated by this job and charged to the job's class share.
Because the class scheduler tries to divide the class share
equitably among all active users in the class, computational jobs
within the same class should normally receive nearly the same job
utilization.
CSH -- Class Share. When the class scheduler is running, this will
reflect the class's share divided by the number of active jobs in
that class. This then is the target share for the job. Normally
the job utilization (JU) will be less than a job's class share.
4.6 Expanded Per-job Information
Most of the information presented in this section is obtained by
setting break points in the monitor with the SNOOP JSYS. Thus, this
information is only available to users who are running WATCH with either
WHEEL or OPERATOR privileges enabled.
The presented information occupies a full 132 character line. For
explanation purposes it will be broken up as follows:
Job identification Information
JOB TTY USER PROGRAM....
Job Utilization Information
....%RT DEMD USED GRDY BRDY SWPTR DSKR DSKW RPQW OTHR....
Memory, Response, and Disk Information
....IMEM NLD NRSP RESP SP WSS UPGS SWPR DSKR TPF IFA
The "...." indicates that there are other variables preceeding or
following.
WATCH Page 18
4.6.1 Job Identification Information -
JOB TTY USER PROGRAM ......
0 DET OPERATOR SYSJOB
2 21 D.SCHEIFLER ICP
4 210 OPERATOR NETCON
7 72 LIBMAN AID
9 117 LEAPLINE EXEC
10 2 OPERATOR BASIC
12 50 GUNN EMACS
15 DET OPERATOR PERF
16 215 OPERATOR IBMSPL
19 37 ACARLSON PTYCON
These variables have the same definition as in the normal per-job
display.
JOB -- The is the job number assigned by the system when the user
logged in.
TTY -- This is the number of the terminal that is being used by the
user running this job. "DET" means that the job is not attached
to a controlling terminal.
USER -- This is the name of the user account of the logged in job.
PROGRAM -- This is the name of the program being run or the name of
the EXEC command being used. Please note that the program name
is obtained at the end of the interval. and it is not possible
to tell if the program or command was running the entire time.
4.6.2 Job Utilization Information -
JOB .... %RT DEMD USED GRDY BRDY SWPR DSKR DSKW RPQW OTHR ....
0 1.0 14.3 8.6 72.9 5.7 4.0 8.8
2 0.1 0.7 18.0 82.0
4 0.6 3.6 19.3 34.8 38.4 5.8 1.8
7 0.0 0.3 15.5 19.3 65.2
9 0.5 1.2 43.0 15.7 35.6 5.7
10 9.7 49.8 20.7 63.5 2.9 11.9 0.9 0.5
12 0.1 0.4 29.4 70.6
15 1.0 6.0 19.4 20.2 1.1 16.0 43.3
16 22.4 38.7 61.5 38.5
19 1.1 5.7 27.7 72.3
%RT (%) -- The percentage of the interval during which this job
WATCH Page 19
actually received CPU time.
DEMD (%) -- Summation of the percentage of the interval during which
each process in the job was active. A process is active if it is
competing for usage of the CPU, the disk, or a resource for which
the wait time is reasonably short (less than 100 milleseconds).
If only one process in the job is simultaneously active during
the interval (the normal case) then the DEMD percentage will vary
from 0% to 100%. If more than one process was simultaneously
active, then the percentage could exceed 100% (if all processes
in the job were active for the interval, the DEMD would be
n*100%).
The rest of the variables in this section indicate what the job
was doing during its "active" period. Only the USED portion
indicates actual resource consumption. The others represent the
amount of time spent in various short term wait states. These
statistics are all expressed as percentages of DEMD and thus they
sum to 100%. When assessing the importance of the statistics for
a specific job, you should mulitply these percentages by DEMD to
get the percentage of the interval time.
USED (%) -- The percentage of the DEMD time that the processes in
this job spent using the CPU.
GRDY (%) -- The percentage of DEMD that processes in this job
were runnable but could not fit in the balance set. A
process must be in the balance set before it will be chosen
by the scheduler to run. The most common cause for
processes to be on this list is that there is not enough
memory to hold all runnable jobs.
BRDY (%) -- The percentage of DEMD that processes in this job
were in the balance set but were not being run. Usually
processes in this state are waiting for their turn to use
the CPU.
SWPR (%) -- The percentage of DEMD that processes in this job
waited on page faults from the swapping area to be
satisfied.
DSKR (%) -- The percentage of DEMD that processes in this job
waited for file pages to read in from the disk.
DSKW (%) -- The percentage of DEMD that processes in this job
waited for file pages to be written to the disk.
RPQW (%) -- The percentage of DEMD that processes in this job
waited for a physical memory page to become available for
swapping into. Usually when time is accumulating here,
there is a shortage of memory on the system.
WATCH Page 20
OTHR (%) -- This last category is the percentage of DEMD that
processes in this job spent in any of the other wait states.
4.6.3 Memory, Response, And Disk Information -
JOB....IMEM NLD NRSP RESP SR WSS UPGS SWPR DSKR TPF IFA
0 917.1 1 165 0.06 6 410.0 7.7 20 17 45 40
2 99.9 0 12 0.07 6 8.0 7.0 0 0
4 192.5 3 28 0.11 6 37.0 7.2 29 4 57 25
7 78.2 0 1 0.39 6 27.0 25.0 8 0 32 7
9 99.9 0 12 0.12 2 56.0 54.8 0 9 57 69
10 198.1 4 95 0.12 5 124.0 54.2 32 109 62 87
12 199.8 0 17 0.03 3 18.0 7.6 0 0
15 99.9 0 12 0.61 5 27.0 17.2 3 24 45 51
16 99.9 0 357 0.13 2 22.0 20.6 0 0
19 99.9 0 329 0.02 4 15.0 13.1 0 0
IMEM (%) -- The percentage of the time that the working sets for the
processes which makeup the job are in memory. This number is the
summation of the percentages for each process and thus may exceed
100%.
NLD (CNT) -- The number of times a working set for processes in this
job were loaded into memory. If this number is zero, then no
working sets were loaded during the interval. Because a process
had to be active before it could be reported on, this can only
occur if the working sets were in memory for the whole interval.
NRSP (CNT) --The number of responses that the job had during the
interval. A response is a counted whenever a process wakes up
for one of the reasons specified under WAKE:.
RESP (AV) -- The average response time in seconds during the interval.
The time for each response is defined as the elapsed time from
when an event for which a process is waiting has completed, to
the time that the process goes back into a wait state after
having responded to the event. Responses that require more than
2 seconds of CPU time to finish are not counted in this column.
SR () -- The "stretch ratio" for each response (which was represented
in RESP). The stretch ratio is obtained by dividing the elapsed
time of each response by the compute time required to satisfy it.
(SR = elapsed time/ CPU time). The only responses counted are
those which require less than 2 seconds of CPU time to complete.
Thus the stretch ratio is the elongation perceived by the
interactive user and not the computational user.
WATCH Page 21
WSS (SUM) -- The sum of the maximum working set size demanded by each
active process in the job.
UPGS (AV) -- The average number of pages actually in memory when a
process from the job is in the balance set.
This is obtained by integrating the instantaneous number of
assigned pages (FKWSP) over all time that the process is in the
balance set. The average used pages is then the integral divided
by the accumulated time, i.e.
UPGS = SUM all processes (integral FKWSP dt / T in balset)
SWPR (CNT) -- The number of times a process in the job waited for a
faulted page to be read in from the swapping area. This does not
include pages which were preloaded by the working set manager.
DSKR (CNT) -- The number of times a process waited for a for a disk
input to complete. Because many programs prefault pages, this
count will be different from the actual number of pages read.
TPF (AV) -- The average number of milliseconds that it took to satisfy
each page fault for this job during the interval.
IFA (AV) -- The "inter-fault average". This value represents the
average compute time in milliseconds between page faults for job.
A large "IFA" means that the working sets of processes in this
job are very stable.
4.7 Description Of The System Utilization Statistics
There are two parts to the system utilization statistics. The first
part consists of summaries for expanded per-job statistics. These
include summaries for the job utilization information and the memory,
response and disk information. The second part consists of additional
system variables and several computations.
4.7.1 System Summary Of Per-job Variables -
These statistics are listed under the per-job statistics on the line
which begins:
System Summary .......
The values might appear as follows:
DEMD USED GRDY BRDY SWPR DSKR DSKW RPQW OTHR ....
339.2 28.6 51.9 5.0 11.5 3.0
WATCH Page 22
....IMEM NLD NRSP RESP SR WSS UPGS SWPR DSKR TPF IFA
4102.9 37 1772 0.13 3 2168. 231.8 370 911 52 90
DEMD (%) -- The summary value for the DEMD column is the sum of each
item in the column. This represents the total demand put on the
system over the interval.
USED GRDY BRDY SWPR DSKR DSKW RPQW OTHR (%)
These values represent the average percentage of the DEMD time
that the jobs were in these states.
IMEM (SUM) -- The summary is the summation of all the per-job values.
It is significant as an indicator of how many working sets
belonging to processes which were active during the interval were
simultaneously in memory. For instance the value 4102.9%
indicates that approximately 41 working sets belonging to active
processes were simultaneously in memory. This number can be
compared with NWSM:.
NLD (SUM) -- This is the number of working sets loaded during the
interval. It should correspond to the rate given by the variable
NLOD:.
NRSP (SUM) -- The summation of the number of responses counted for
each job during the interval.
RESP (AV) -- Summary value for the RESP column is the average response
time for those responses measured (requiring less than 2 seconds
of CPU time) during the interval.
SR (AV) The summary value for the SR column is the average stretch
ratio for interactions which required less than 2 seconds of CPU
time.
WSS -- This is the arithmetic sum of the WSS value for each job. This
represents the maximum amount of memory which would have been
required during the interval if all active processes achieved
their largest size at the same time and were all in memory.
SYS WSS = SUM all jobs (JOB WSS)
UPGS (AV) -- This summary value represents the average number of pages
needed by the active processes at any specified point in time.
Specifically it is the working set page integrals for all jobs
divided by the interval time.
SYS upgs = SUM all jobs (integral FKWSP dt) /T interval
WATCH Page 23
SWPR (SUM) -- This is the total number of swap reads done by jobs on
the system in response to page faults. This does not include
pages preloaded by the working set manager.
DSKR (SUM) -- This is the total number of disk pages read which caused
processes to wait.
TPF (AV) -- The average time required to wait for a page fault (swap
or disk) to be resolved.
IFA (AV) -- The average amount of compute time each job spends between
page faults.
4.7.2 Additional System Variables And Computations -
Other information in this section includes additional system statistics
not available in the system statistics section and computations of
various other variables. It looks like the following:
TOTRC: 1992 LOKPGS: 104 SHR PGS: 245 AVAIL MEM: 1888
NRUN MIN,MAX: 1 11
SUMNR MIN,MAX: 1879 2073
NRPLQ MIN,MAX: 28 170
SYS MEM DMD = 255.2
SWAP RATIO (SUM WSS / AV MEM) = 1.15
ACTIVE SWAP RATIO (DMD/AVMEM) = 0.14
MEM UTILIZATION ((UPGS+SHRPGS)/AVMEM) = 0.25
AV WS SIZE = 28.84
AV CPU TIME (MS) PER INTERACTION = 40.57
THINK TIME (SEC) PER INTERACTION = 1.43
TOTRC (CNT) -- The number of physical memory pages available. This is
the total physical memory minus the number required by the
resident monitor.
LOKPGS (CNT) -- The current number of pages locked down by the monitor
beyond the resident monitor pages. Out of this set of pages
comes the terminal buffers, magtape buffers, line printer
buffers, and other pages locked down during certain file system
operations.
SHR PGS (CNT) -- This is the number of physical memory pages being
shared by more than one process at the end of the interval. It
is included in the count "AVAIL MEM".
AVAIL MEM (CNT) -- This is the difference between "TOTRC" and "LOKPGS".
This is the actual number of pages available for use by user
programs.
WATCH Page 24
NRUN MIN, MAX - These values are the minimum and maximum number of
simultaneously active processes during the interval.
SUMNR MIN, MAX -- These values are the minimum and maximum number of
pages belonging to working sets in memory during the interval.
NRPLQ MIN,MAX: -- The minimum and maximum number of pages on the
replaceable queue during the interval.
SYS MEM DMD = -- The system average memory demand derived by computing
the integrals of the memory forecast for each process during its
active period, summing over all processes and dividing by the
interval time. Whereas the "system summary UPGS" is the average
amount of memory actually in use at any point in time, this value
is the average amount forecast at any point in time.
SYS MEM DMD = SUM all jobs (integral FKNR dt) / T interval
SWAP RATIO (SUM WSS / AV MEM) = -- The Swap Ratio is the system WSS
divided by the amount of available main memory. If this is
greater than one, it represents the amount by which main memory
would have to be increased to avoid any swapping.
ACTIVE SWAP RATIO (DMD/AVMEM) = -- The active swap ratio is the system
average core demand divided by the amount of available main
memory. If this number is greater than one, it represents the
amount by which main memory would have to be increased to hold
all jobs wanting to run simultaneously.
MEM UTILIZATION ((UPGS+SHRPGS)/AVMEM) = -- The memory utilization is the
system used pages divided by the amount of available main memory.
For active swap ratios greater than 1, this indicates how well
the monitor is doing in keeping memory used.
AV WS SIZE = - The average working set size is computed from the
integrals computed from the working set demands over the active
period of each process divided by the sum of the active periods
of each process.
AV CPU TIME (MS) PER INTERACTION -- This is average amount of CPU time
which a job spends between each response.
THINK TIME (SEC) PER INTERACTION -- This is the average time spent by
the user between the time the system requests a response and when
that response is received.
WATCH Page 25
4.8 Disk I/O Statistics
DISK I/O
CHN,UNIT SEEKS READS WRITES
0,6 380 485 300 PS #1
0,7 REL4 #0
1,0 49 185 34 SNARK #0
1,1 2 6 LANG #0
1,2
2,3 57 68 32 MISC #0
2,4
2,5 602 652 449 PS #0
These statistics display the following information:
CHN,UNIT -- The channel number and the unit number on that channel to
which the disk is connected.
SEEKS -- The number of times the disk heads had to be moved to get to
the next request during the interval. If multiple requests can
be answered on the same cylinder, then no seek will take place.
READS -- The number of pages read on this unit during the interval.
WRITES -- The number of pages written on this unit during the
interval.
--no title-- The name of the structure and its relative unit number
within the structure.
4.9 Tune Mode Statistics
Tune Mode is designed to display some of the more interesting
statistics on one line so that a system programmer can easily monitor
the changes in load during a test period. This mode is useful when a
very short interval is desired (around 10 seconds). The information is
abstracted from the "system statistics" and "system utilization
statistics" sections and includes the following information:
USED SWPW SKED CTXS WAKE TDIO NRUN NWSM NLOD USED....
52.8 2.4 6.8 41.2 16.6 13.1 2.2 62.5 6.87 34.8
....GRDY BRDY SWPR DSKR DSKW RPQW OTHR IMEM NLD NRSP RESP SR
41.7 2.3 19.5 1.8 2207.9 0 198 0.05 2
The definition of the variables on the first row can be obtained from
the system statistics section and the definitions on the second row from
WATCH Page 26
the system utilization statistics section.
5.0 USING WATCH TO HELP ANALIZE THE SYSTEM LOAD
Thus far the discussion in this document has centered around
explanation of the variables in the WATCH output. This section presents
some heuristics with which systems can be evaluated. The heuristics are
computations which normally provide some meaningful insight into the
effectiveness of a specific configuration in satisfying the demands of
the workload. They also help to determine which users are consuming
more of the system resources than they should. These heuristics are not
guarenteed to be accurate for all systems. They are useful, however, in
that they serve as checks or indicators by which the pattern of system
resource usage can be examined for potential improvements.
5.1 Some Good Numbers
When variables have the values indicated in this list, the system
is usually in balance. Since it is quite possible for one variable to
appear in balance, while others are not, this information is only a
guideline.
1. NCOR = 30 PER MINUTE .... expensive if higher
2. NRUN/NBAL = 1 .... All processes wanting to run can fit in
memory.
3. SWPR less than 20% .... Swap reads/writes are overhead and
thus should (ideally) be a small component of disk usage.
4. SWPW close to 0 .... Since this variable represents processes
waiting for memory when no others can run, utilization of the
system can normally be increased by adding memory until this is
a small value. However, if the load has a large I/O component,
additional memory may merely shift the CPU idle time from SWPW
to FILW.
5. SKED = 15% .... Scheduler overhead detracts from cycles going
to user programs. Programs which do very little work each time
they are scheduled generally drive this value up. If this
value is high, then it is important to determine if there is a
set of applications which could be reprogrammed in order to do
more work between interactions. Programs which become active
as each character is typed (like some screen formatting
software) should be viewed with suspicion.
WATCH Page 27
6. FILW close to 0 .... This is CPU idle time caused by processes
waiting on disk I/O to complete. Often more memory will permit
larger numbers of programs to be resident and thus absorb some
of the CPU. Other times, reconfiguring the disk access
patterns to spread the disk I/O more evenly across the disks
and channels will lower this value.
7. BSWT/NBAL small .... If a large proportion of processes in
memory are waiting on the disk the CPU will not be utilized.
When this relationship is small, the CPU can be well utilized.
8. NREM = 0 .... Since this counts the times when runnable
processes are removed from the balance set, performance is best
when it is zero and can degrade rapidly otherwise.
9. FPGS large... If the number of free pages drops below 50 then
the system will probably begin spending resources to garbage
collect more often. This statistic along with NREM can be used
to indicate a system overload.
10. DMRD+DMWR less than 20 per second ... Because drum
reads/writes utilize a percentage of the disk system's
bandwidth, higher throughput is possible when swapping is low.
Normally swapping of less than 30 pages per second does not
cause any visable effect. If the normal load contains a large
amount of user disk I/O, then swapping at rates higher than 20
will decrease the system throuput. If the normal load is
mostly interactive or computational, then higher swapping rates
can be sustained.
5.2 Per Job Computations
The following are helpful in understanding the characteristics of
various jobs on the system.
----LOOK AT EACH JOB----
If the job HAS A LARGE (100 P) WORKING SET
THEN .......
If the jobs wakes up more than once every 3 - 5 seconds
o It may cause thrashing
o It should be smaller during interactions
o It should run longer between interactions
WATCH Page 28
If the job is larger than 1/2 user memory
o There may be a mismatch between jobs
o You should check for other large jobs
o Make the job smaller
o Schedule it when the load is light
If the job WAKES UP FREQUENTLY (more than once every 3 seconds)
THEN ....
o This job IS loading the system
o Understand why it wakes up frequently
- high speed terminal doing output ?
(Increase terminal buffer size)
- Waking on every terminal input character ?
(weigh cost versus benefit)
- Doing Magtape I/O ?
(does not load system)
o If possible, reduce this wakeup rate
----LOOK AT SYSTEM WIDE VALUES----
If there are LARGE DIFFERENCES IN I/O between disk units ?
o Move I/O bound application to lightly loaded disk.
o Move PS to 2 pack structure
If OVERALL DISK RATE is more than
o 60 pages per second for one channel
o100 pages per second for two channels
then another channel normally increases system throughput.
If "SWPW" is HIGH
THEN .....
o The system may need more memory
o The system may need more swapping bandwidth
If "FILW" is HIGH
THEN .....
o The system needs more I/O bandwidth
o Organize the packs to get more out of channels
o Consider getting more channels or disk units
WATCH Page 29
6.0 USATG WATCH IN YOUR INSTALLATION
6.1 Running WATCH To Obtain Trend Information
It is recommended that all sites run WATCH as part of their normal
operation. It can be started under PTYCON in the following manner.
PTYCON>Define 10 as W
PTYCON>Connect W
@login ...........
@append WATCH.TXT(TO) WATCH.HISTORY
@delete WATCH.TXT
@WATCH
Output to file: WATCH.TXT
Print monitor statistics ? Y
Print job summary ? Y
Time period (MM:SS): 5:00
This will collect a sample every 5 minutes. This time period is
recommended in that it is small enough to be able to see most usage
changes, yet long enough that an excessive amount of data is not
collected. In the example, watch data is appended to a history file
which can be rolled off on tape once a week and then used manually or
via software to correlate changes in the workload.
6.2 Running WATCH To Analyze A Workload
Whereas the normal information is adequate for spotting trends in the
workload, it is generally does not contain enough detail to be able to
isolate dominant components.
The process of running WATCH to collect the expanded information is
sufficiently expensive that it is normally done only when specific
information is being requested in order to determine how to tune an
application or the configuration. When WATCH is collecting "ALL" the
information, it should be run with about a 2 minute interval in order to
be able to see very short term changes in the workload. Shorter
intervals are possible, but WATCH itself may become a more dominent part
of the workload than desired (especially on a 2020).
6.3 Accounting For The "TRAP:" Time
It was stated in the document that normally the TRAP: variable
represents the percentage of USED: time which is spent handling page
faults. This allows the installation to bill users for this activity
and has the effect of penalizing users who cause excessive page
faulting. Such users will find their programs more expensive to run
when there are other users on the system than when the system is fairly
WATCH Page 30
idle. Over the long run, this has the effect of pushing the larger page
faulting programs into periods of time when the system would other wise
be dormant in order to capitalize on lower costs.
In some environments, this is not a desired effect. The site may elect
to absorb the page faulting revenue (up to 10%) in order to provide
users with more reproducible run times. This is often a desire in
service bureaus. This can be done at the time the monitor is built by
setting the following flag in the PARAM0.MAC file.
IPTIMF==0
The normal setting for this flag is 1.
[End of Document]
WATCH
1.0 INTRODUCTION
WATCH is a TOPS-20 data collection tool that can be used to gather
the information necessary to analyze both system and job performance.
WATCH periodically samples many system variables; writing them in a
format which is useable for analysis. It is advantageous to collect
watch statistics anytime the system is running. These statistics are
often useful when usage trends are being analyzed in order to plan
system growth. Any user can run WATCH and obtain most of the system
information and some of the job information. These statistics are
normally sufficient for determining overall system performance and for
spotting short and long term usage trends. Expanded system and job
information is available for users who are running with WHEEL or
OPERATOR privileges enabled. This expanded set of statistics provide
the much more detailed information which is often required to observe
and tune the workload or an individual application.
The WATCH output consists of 9 different display sections:
1. Heading -- This section contains the date, time, number of jobs
logged in, and the time interval over which the data sample was
collected.
2. System Statistics -- This section contains system wide
statistics which reflect the resource utilization of the CPU,
disk, and memory.
3. Load Averages -- Load averages indicate the number of runnable
processes over specified intervals. This section indicates the
load average for the system, for the interactive and
computational queues, and for each class (when class scheduling
is in use).
4. Directory Cache -- A cache of the most recently used
directories is kept by the monitor. This section displays
statistics which indicate the usefulness of this cache.
5. Normal Per-Job Information -- Per job statistics are displayed
which relate the amount of CPU resource distributed to each job
along with statistics concerning class utilization (when the
class scheduler is in use).
6. Expanded Per-Job Information -- In addition to the CPU
information, this display presents many statistics which show
the states in which the job spent time, which show how large
the job is, and which provide disk and swapping information.
WATCH Page 2
7. System Utilization Statistics -- The system utilization
statistics includes a summary of the expanded per-job section,
includes additional system statistics, and includes
computations of several key variables.
8. Disk I/O -- Disk I/O statistics are displayed on a per-drive
basis. Included in these statistics are the number of seeks,
reads, and writes performed by each drive.
9. Tune Mode Display -- This display is a single line and contains
some of the more revealing System Statistics and summary
statistics from the system utilization section. It is a useful
"quick and dirty" display for users who are monitoring changes
in the system load.
2.0 RUNNING WATCH
WATCH can be run by all users. Users with OPERATOR or WHEEL
privileges enabled can obtain more information than other users. WATCH
is run by typing either "WATCH" or "R WATCH". When WATCH starts, it
will identify itself with a message like:
WATCH 4(3), /H for help.
Information will then be requested in the following order:
Output to file:
Print monitor statistics?
Print job summary ?
Tune mode?
Time period (MM:SS):
Some of these requests will not be made if previous answers indicate
that the information is unnecessary. The following sections will
illustrate how to answer these prompts to get combinations of the
displays outlined in the introduction.
2.1 Output File And Interval Time
WATCH requests the name of the output file and the time between
samples with the following prompts.
Output to file:
Time period (MM:SS):
WATCH Page 3
When the prompt "Output to file:" is displayed, the user may enter
either a filename (including TTY: if output to the terminal is desired)
or a "/H". The /H will cause WATCH to display help information
containing short descriptions of each of the variables in the System
Statistics section, the Load Averages section, and the Normal per-job
section. It will then request the output file name again.
The response to the request "Time period (MM:SS):" may either be
answered with an interval of time (in minutes and seconds) or with a
carriage return. In the latter case, WATCH will take a sample and write
the output each time the user types an additional carriage return.
2.2 Responses And The Resulting Displays
This section will show the legal responses to the prompts and the
resulting displays. It will also provide some reasons how various
displays might be used.
-----------------------------
Print Monitor Statistics ? YES
Print Job Statistics ? YES
All users may request this display which includes:
1. Heading
2. System Statistics
3. Load Averages
4. Normal Per-Job Statistics
This is normally used by a site collecting data for analyzing the
the long term trend in the workload.
-----------------------------
Print Monitor Statistics ? ALL
Only ENABLED users with WHEEL or OPERATOR privileges may
request this display which includes:
1. Heading
2. System Statistics
3. Load Averages
4. Directory Cache Statistics
5. Extended Per-Job Statistics
6. System Utilization Statistics
7. Disk I/O Statistics
This display is used by a person who is analyzing short term
changes in a workload or application.
-----------------------------
Print Monitor Statistics ? YES
Print Job Summary ? NO
WATCH Page 4
All users can request this display which includes:
1. Heading
2. System Statistics
3. Load Averages
This display is often used when WATCH data is being written
to the TTY and changes in system statistics are being monitored.
-----------------------------
Print Monitor Statistics ? NO
Print Job Summary ? YES
All users can request this display which includes:
1. Heading
2. Load Averages
3. Normal Per-Job Statistics
This display is normally used when a job is being monitored
on a TTY.
-----------------------------
Print Monitor Statistics ? NO
Print Job Summary ? NO
Tune Mode ? NO
All users can request this display which includes:
1. Heading
2. Load Averages
This display is normally used on a TTY to monitor changes
in the load average.
-----------------------------
Print Monitor Statistics ? NO
Print Job Summary ? NO
Tune Mode ? YES
Only ENABLED users with WHEEL or OPERATOR privileges may
request this display which includes:
1. Tune Mode statistics lines with column headings every
page.
This display is used by an analyst engaged in tuning either
the system or an application.
-----------------------------
WATCH Page 5
2.3 WATCH Operation
After the user enters the necessary information, WATCH will display
the message:
WATCH IN OPERATION --
and will then take its first sample. This message will be seen
immediately after entering the interval time unless the user is
requesting displays which require privileges (extended and tune mode).
Then a 10-30 second pause will be detected (depending on system load)
while the SNOOP break points necessary to collect the information are
inserted into the TOPS-20 monitor.
When the user desires to stop collecting statistics, the following
procedure should be followed:
^C
^C
CLOSE
RESET
At this point, the output file can be printed. Examples of the
information included in the output file are used in the rest of the
document.
3.0 A SHORT GLOSSARY
The following terms are used throughout this document and should be
interpreted in light of the following definitions.
3.1 EXEC
The TOPS-20 command EXECutive is a program which prompts the user for
and inteprets the TOPS-20 command language. A copy of the EXEC is
started in response to the control-c typed by the user to get the
system's attention before LOGIN. The EXEC is a process which remains
part of the user's job until after the LOGOUT command is processed.
3.2 Job
A job consists of the sequence of commands and programs which are
initiated by a user. It begins at the time the user logs into the
system and terminates when that user logs out. Associated with each job
on the system is a record of information which is accessed by a "job
number". This record will contain the accounting statistics, the
WATCH Page 6
resource ownership information, and the control information associated
with the job. Because the job number is the means by which a job is
identified and referenced, a term like "job 12" is often used in place
of the more accurate phrase "the job which can be identified with the
number 12".
A user logging in is given an available job number which is used by the
system to keep track of that user. The job number is from 1 to the
maximum built into the monitor and is recycled when the user logs out.
This maximum number of jobs which can be executing simultaneously is
fixed at the time the monitor is compiled and linked. At this time the
memory required to keep track of the specified number of jobs and their
resources is computed and reserved. Most often the number of jobs that
a specific monitor can handle has been chosen so that the system
resources are not over committed under normal use.
3.3 Process
The term "process" is used in this document instead of the terms "task"
and "fork" which are sometimes used in other TOPS-20 literature. This
term refers to a program running in its own virtual address space. A
job may consist of one or more processes. A process is created when a
virtual address space is allocated for the program and is terminated
when that address space deallocated.
When a job begins, it will consist of one process, the TOPS-20 EXEC.
The EXEC will create additional processes whenever thue user initiates a
function which cannot be performed by the EXEC. Thus when the user
types "RUN FOOBAR", a process is created, and the program FOOBAR is
started. When FOOBAR finishes and the user types RESET, the memory for
the additional process will be dealocatted and the process terminated.
Thus it is common for a job to consist of two processes, the EXEC and a
user program.
One of the capabilities which a program may exercise is to create and
control and "inferior" process (or processes). An "inferior" process
can be stopped (frozen) and deleted by its creating (superior) process.
Each process created by the EXEC or a user program would each have its
own virtual address spaces, would assign its own devices, and would
contain its own program. The maximum number of processes which may be
created within a single job is fixed at the time the monitor is built.
3.4 Active Process
An active process is one which is competing for usage of the CPU, the
disk, or a resource for which the wait time is reasonably short (less
than 100 milleseconds). Active processes are often spoken of as being
"runnable". Processes which are not active are waiting on events like
WATCH Page 7
terminal input, ENQ/DEQ locks, and other unpredictable events. Such
processes are sometimes said to be "blocked".
Normally, when the EXEC creates a process, loads a program into that
processes address space and passes control to that process, it (the
exec) waits until the program completes. Thus even though the job
consists of two processes, only one at a time is competing for
resources. This is not always the case. Typing a CONTROL-T often
causes the EXEC and the user program to compete for the CPU at the same
time. In analyzing the impact of a complex job with several processes,
it is important to determine if the job consists of processes which
compete for resources simultaneously. The expanded user information
will help you determine the impact of jobs with competing processes. A
job which consists of several competing processes can place as much
demand on the system as an equivalent number of jobs with non-competing
processes.
3.5 Balance Set
All active processes in the system are on the "go list". These
processes need to be in memory in order to use the CPU, or to terminate
their short term wait state. The subset of active processes that are
scheduled to be simultaneously in memory form the BALANCE SET. Once
selected for the balance set, the working set manager (part of the
TOPS-20 scheduler) will try to find room for a process in memory by
swapping out processes that are either no longer in the balance set or
which are of lower priority than the incoming one.
3.6 Swapping
The term "swapping" is used to describe the disk I/O which is initiated
by monitor as part of its memory management service. Swapping is
contrasted to user initiated disk I/O which is performed in order to
move data to and from application programs.
3.7 Page Fault
When a user program or the monitor references a virtual address which is
not currently allocated a place in physical memory, a page fault will
occur. The monitor will resolve this page fault by finding the desired
page and making it part of the processes address space. The procedure
for finding the page may require that the monitor bring the page in from
the disk. However, if the page is already in memory, the monitor will
not perform a disk read, but will merely make the page part of the
process's address space. This will occur if the page is on the
replacable queue (available to be used for some other purpose) or is
WATCH Page 8
being used by another process.
4.0 DESCRIPTION OF VARIABLES USED IN WATCH
4.1 The Heading Section
The heading section is mostly a time stamp for other information
displayed. It is in all displays except for the "tune mode" display.
The following is an example:
SUMMARY at 10-May-80 11:29:26
for an interval of 2:01.5 with 64 active jobs.
The first line contains the date and time when the sample was taken.
The second line includes the time interval between this sample and the
previous one. If the user had specified a time interval (rather than
merely entering a carriage return), then the interval here should be
nearly equal to the time specified. Under a heavy load, the actual
interval may be longer.
In this example, the interval was 2 minutes, 1.5 seconds and there
were 64 users logged in.
4.2 System Statistics Section
Example from the system statistics section.
USED: 87.6 IDLE: 0.0 SWPW: 0.0 SKED: 10.0
SUSE: 80.0 TCOR: 0.1 FILW: 0.8 BGND: 1.8
NTRP: 22.3 NCOR: 2.02 AJBL: 58.44 NREM: 0
TRAP: 3.2 NRUN: 5.7 NBAL: 5.7 NWSM: 61.0
BSWT: 2.8 DSKR: 23.0 DSKW: 4.8 SWPR: 6.4
NLOD: 16.12 CTXS: 64.1 UPGS: 1992. FPGS: 147.
DMRD: 4.1 DMWR: 4.7 DKRD: 10.7 DKWR: 4.3
TTIN: 11.0 TTOU: 401. WAKE: 18.3 TTCC: 1.51
TDIO: 23.8 RPQS: 4.4 GCCW: 6.4 XGCW: 0
KNOB: 11
QUEUE DISTRIBUTION PERCENTAGE: 0.19 20.48 21.28 11.04 2.69 31.74
The statistics in this section are expressed either as percentages
(%), as averages (AV), or as rates (Px). The rates are in units per
minute (PM) or in units per second (PS).
The display includes "CPU usage statistics" from which the
distribution of the CPU resource can be determined. The usage
statistics are all percentages based on the wall clock time of the
WATCH Page 9
interval. They should sum to 100% (+/- roundoff).
CPU Usage Statistics
USED , IDLE , SWPW , SKED , TCOR , FILW , BGND
The description of each of the variables in the example follows.
Following the example, they are presented line by line in left to right
order.
USED: (%) -- Percentage of interval during which the CPU was
executing instructions on behalf of some user. This includes
user, JSYS, page fault, and interrupt processing.
IDLE: (%) -- Percentage of the interval during which the CPU was idle
because there were no active processes on the system. If this
number is non-zero, then the system can accommodate some
additional