Trailing-Edge
-
PDP-10 Archives
-
SRI_NIC_PERM_SRC_3_19910112
-
stanford/5-swskit/scheduler.memos
There are 6 other files named scheduler.memos in the archive. Click here to see a list.
+---------------+
! d i g i t a l ! I N T E R O F F I C E M E M O R A N D U M
+---------------+
TO: TOPS20 Monitor Meeting List
cc: Dave Braithwaite
Norm Valentine
DATE: 11 Apr 78
FROM: Dan Murphy
DEPT: LCG S.E.
EXT: 6356
LOC: MR1-2/E37
FILE: SBIAS
SUBJ: SCHEDULER CONTROLS FOR INTERACTIVE/COMPUTATIONAL BIAS
One of the administrative scheduler controls which has been
discussed for TOPS20 is a way to bias the system more toward
interactive jobs or compute-bound jobs. The behaviour of the
system in this regard is the result of a number of somewhat
arbitrary decisions as well as such basic limitations as machine
speed, swap speed, job load etc. The decisions which are made at
present are the result of nothing significantly more scientific
than a coin flip, but generally based intuitively on the goal of
making the CPU resource distribution fair and the response time
adequate to good. Each release has a somewhat different feel of
responsiveness, although it was never the specific intent to
change it.
Therefore, recognizing that the choice of an operating point on
the continuum between maximum responsiveness and maximum
computational throughput properly rests with the system
administrator, we intend to provide a means of globally affecting
certain scheduling policies. At the extremes, some systems wish
to ensure good response even in the presence of many compute-bound
jobs, and are willing to see the computational jobs get little or
no service in order that the system remains responsive. Other
sites consider the interactive jobs (e.g. program development)
merely background and wish to minimize context switching and
swapping overhead. Some sites may wish to run in different modes
during different parts of the day.
This machanism does not provide any absolute priorities, rather it
biases the system toward one class of jobs or the other. It is
not a substitute for a class scheduling system, but can work with
such a system by affecting the handling of jobs within a class or
which do not otherwise fall within any class rules.
The most primitive level of control is effected by specifying the
value of each of a set of flags which in turn affect a scheduler
decision. The exact interface to the system administrator has not
been determined as yet, but the following are some possibilities:
1. Document the typical effect of each of the flags and allow
Page 2
the operator to specify an arbitrary combination.
2. Define a set of numeric values representing a scale from
most interactive to most computational, one of which would
be selected by the operator.
I suspect that the latter approach is better suited to most
customers, but that a few, perhaps with SWS help, would want to
determine their own setting. Listed below are the current default
values and several typical values on either side.
The following flags and controls are defined: (note that a 1
value is always the more interactive setting regardless of the
default value)
SKCYTF - bit 18 (default = 1)
This flag modifies the length of the two basic scheduler cycles.
Normally, the scheduler has a 20 ms cycle during which it
processes terminal input and may switch processes if a page fault
has completed. It also has a 100 ms. cycle during which it
checks various queues and may swap processes in or out if
necessary to conform to scheduling parameters. Setting this flag
to 0 (more computational) will change these cycles to 80 ms. and
400 ms. respectively, thus reducing the context switching and
scheduling overhead, and giving longer uninterrupted processing
time to compute-bound jobs.
SKIOCF - bit 19 (default = 0)
This flag is not strictly relevant to computational vs.
interactive jobs, but rather controls the effect of disk file IO.
Normally, a process is charged a certain amount against its
current quantum (but not against its actual run time) for each
block because of disk IO. This tends to prevent an IO-bound jobs
from remaining on high-priority queues for too long while
generating high file traffic. Setting this flag to 1 eliminates
the quantum charge; hence the job is charged for only such CPU
time as it actually uses. This favors IO-bound jobs whether they
be short-lived (interactive) or long-lived (computational).
SKHTF - bits 20-21 (default = 10)
This field allows one of four possible values to be used for
balance set hold time. Balance set hold time is that quantum time
(as distinct from actual runtime) which must be used by a process
when once loaded into the balance set. If the hold time has not
been used, the process will not be preempted and swapped out in
favor of a higher priority process. The default setting of 1
second is sufficient to prevent thrashing without causing undue
degradation of response. Less hold time will improve response,
while more will improve throughput, but only under circumstances
where not all runnable jobs can fit in core simultaneously.
Page 3
Defined values are: 00 = infinite hold time (i.e. process runs
to completion); 01 = 5 seconds; 10 = 1 second; 11 = no hold
time.
SKHQRF - bit 22 (default = 1)
SKLQRF - bit 23 (default = 0)
These two flags affect the operation of the "balance set
partitioning" algorithm of AJBALS which normally acts to prevent
either large computational jobs or interactive jobs from excluding
the other. This algorithm computes a "reserve" of pages for each
of the two classes as a function of the number of processes in
that class, up to a maximum depending on the size of physical
core. Setting SKHQRF to 0 eliminates the high-queue process
reserve thus favoring large computational jobs, while setting
SKLQRF to 1 eliminates the low-queue process reserve thus favoring
interactive jobs. The default values include both reserves and
thus cause the partitioning to be fully operative. The complement
of the default values eliminates all effects of balance set
partitioning.
SKBQEF - bit 24 (default = 1)
SKBQRF - bit 25 (default = 1)
These two flags affect the handling of processes in the balance
set. Scheduling of processes in the balance set is done on the
basis of two queues which are distinct from the normal process
priority queue. The balance set scheduler always selects the
first runnable process found by scanning these queues in the order
head to tail, higher priority to lower priority. A selected
process is run for a short quantum and then requeued within the
balance set. A low-queue process (one on the lowest regular
priority queue) is always placed on the lower priority balance set
queue. A high-queue process is normally placed on the higher
balance set queue when entering the balance set, but setting
SKBQEF to 0 will cause it to be placed on the lower queue. After
having completed a balance set quantum, a high-queue process is
normally placed on the higher balance set queue. Setting SKBQRF
to 0 will cause it to be placed on the lower queue. The release 3
monitor as shipped effectively has the default values 10, i.e.
high-queue processes enter the balance set on the high balance set
queue, but are moved to the lower balance set queue after the
first quantum. At least one site has felt this to be a decrease
in responsiveness from release 2, hence I propose to make the
default be as indicated, i.e. high-queue processes continue to
get balance set priority until they drop to low queue.
SKRQ1F - bit 26 (default = 1)
Setting this flag to 0 eliminates any normal use of queue 0 thus
giving the effect of 3-queue scheduling with quanta of 2 seconds,
2 seconds, and 4 seconds. (Queue 0 is 0.3 seconds.) Under mixed
Page 4
loads, this is mildly favorable to computational jobs but has one
other property which is important in certain applications. With
this modification, a process will be given at least 2 seconds of
computation after a wakeup before possible rescheduling and
swapout, rather than 0.3 seconds for the normal queue 0. This has
been shown to significantly reduce overhead in systems which are
running many identical jobs which are interactive but which may
require substantial compute time for some interactions (e.g. a
data base query/update application). In general, the effect of
this modification is to increase the response time for trivial
interactions, but decrease it for those requiring more than 0.3
seconds of compute time.
SKTTPF - bit 27 (default = 1)
Normally, a process which blocks for terminal input for 1 second
or more will be placed on the highest queue regardless of system
load. This means that properly timed terminal interactions can
give a job more than its fair share of the machine. Setting this
flag to 0 eliminates any special consideration for terminal input
blocks and causes the normal requeue algorithm to be used.
SKWCF - bit 28 (default = 0)
A process unblocked after a low-speed IO wait is placed on a
higher priority queue than it was on when it blocked. The longer
the wait, the higher the queue, and a long enough wait will give
the highest queue. (See discussion in SCHED.MAC before NEWST.)
This procedure tends to maintain fairness among compute-bound,
short interaction, and long interaction processes. Normally, the
actual process wait time is divided by the short-term load average
so as to require longer waits for high queue when the system is
more heavily loaded. Setting SKWCF to 1 eliminates this divide
and thus causes interactive jobs to be favored, particularly as
the load average increases.
STANDARD VALUES
The following five standard values should provide a sufficient
degree of control for most systems. The system administrator
should be able to select an integer on the scale 0 (most
interactive) to 4 (most computational) to produce the indicated
flag settings.
Value B18-28 (binary)
0 101 111 111 111
1 101 011 111 110
2 101 010 111 100 (system default)
3 101 000 100 100
4 000 100 100 000
Page 5
All of the flags described above have been implemented at present,
but some may have to be deleted or added in the course of other
planned work on scheduler controls, e.g. group scheduling.
+---------------+
! d i g i t a l ! I N T E R O F F I C E M E M O R A N D U M
+---------------+
DATE: 5-May-78
TO: TOPS20 Meeting List
FROM: Dan Murphy
DEPT: LCG S.E.
EXT: 6356
LOC: MR1-2/E37
SUBJ: Working Set Swapping
This document describes the implementation of "Working Set
Swapping" in TOPS20. "Working Set Swapping" means that
process working sets are treated as units to be swapped into
or out of core as determined by scheduler decisions. It is
in contrast to "Demand Page Swapping", wherein a process
working set will be loaded into core page by page as each
page is referenced. For additional discussion and
background, see "WORKING SET SWAPPING" D. Murphy 29-Jul-76,
attached.
EXISTING ALGORITHMS
TOPS20 at present (through Release 3) operates as follows:
1. When a process is selected for the balance set by the
scheduler, an amount of physical core equal to the
estimated size of its working set is reserved.
Initially, only the PSB, JSB, UPT, and (in release 3
only) stack page are swapped in from the swapping
device. No attempt is made to run the process until
these pages have all been read into core. Then, the
process context is setup and the process started.
Obviously, the first page referenced is that to
which the PC points. If that page is not already in
core, a page fault immediately occurs, and the
process is blocked while the page is read. A
similar process occurs on subsequent references
until all of the pages needed are in core.
2. A process leaves the balance set upon any normal
dismiss or when selected by the scheduler in favor
of a higher priority process. No pages are
explicitly swapped out at that time however.
Asynchronously, when the count of available physical
core pages (NRPLQ) drops below a certain value, the
core status tables are scanned and pages are written
out (GCCOR). Generally, the pages selected for
writing during this scan are those which are owned
by processes no longer in the balance set. The
Page 2
process use bits in CST0 serve to prevent a page
being removed from core if at least one process
using it is in the balance set. The scan of the CST
is continued only until a sufficient number of
pages, typically less than 1/4 of available core
have been removed. Note also that any selected page
which has not been modified according to CORMB in
CST0 is not actually written but rather placed
immediately on the replaceable queue.
The result of this is that process pages tend to
stay in core for some time after the process has
left the balance set. Hence, if it should reenter
the balance set within that time, the pages are
immediately available and need not be read again.
Even pages which have been selected will spend some
time in the disk write queues and some additional
time in the replaceable queue and may be reused
during this time. This saves both the elapsed time
that the swapin reads would otherwise take plus the
compute time that would otherwise be required to
scan the CST, process the writes and reads, etc.
3. The management of the process working set occurs
asynchroneously to the above. A routine (XGC) is
invoked periodically in process time to check all
pages assigned to the running process. If,
according to the age field in CST0, the page has not
been referenced in the recent past (the working set
"window"), it is swapped out and the working set
size estimate reduced.
A 512-bit (15-word) mask is maintained to represent
the actual process working set. This mask maps the
512-page process address space, and a 1 bit
indicates that the corresponding page is present in
the working set. The mask is not otherwise used by
the monitor, and working set pages in the monitor
address space are not represented at all.
REPRESENTATION OF THE WORKING SET
The first problem to be solved is to find a viable
representation of the process working set. The mask
approach described above does not include the monitor
address space, nor does it gracefully expand to support
extended addressing. Merely to map 32 user sections would
take 480 words. Any alternate representation must be
processor-efficient for two specific operations:
1. Adding specific pages to the data base as they are
referenced.
2. Scanning to produce a list of all pages in the working
Page 3
set for swapin or swapout.
These two requirements, plus any reasonable size limitation,
quickly eliminate all simple structures. The proposed
representation is as follows: (those familiar with the KL10
cache and page table organization may note some
similarities)
A virtual address consists of 23 bits plus the user/monitor
bit. For purposes of page management, the low-order 9 bits
(the word-within-page) are uninteresting, and we have 15
which must be remembered in the data base.
--- -------------------------------------
! ! ! ! !
! ! ! ! !
! ! ! ! !
--- -------------------------------------
U SECTION PAGE
The low-order 7 bits are used as an index to select an entry
in a 128-entry table. Each table entry is 9 bits, and is
used to hold the remaining 8 bits of the 15-bit virtual page
address plus a valid bit.
! !
INDEX (7) ! !
! !-------------------------------!
! ! ! ! ! !
!------>! ! ! ! !
! ! ! ! !
!-------------------------------!
!V U SECTION (5) PG !
! (2)!
! !
Hence, a given address is in the data base if the entry
selected by its low-order 7 bits is valid and equal to the
user, section, and high-order page bits. The list of pages
in the data base may be produced by scanning the table and,
for each valid entry, concatenating the data bits from the
table with its index.
There is potential conflict however, in that all virtual
addresses with the same low-order 7 page bits will attempt
to use the same table entry. To reduce this problem, we use
four parallel tables, each formatted as above. A particular
virtual address may be represented in any of the four
fields; hence, up to four addresses with identical
low-order bits may be remembered without conflict.
This data base is implemented as a 128-word table, with each
word holding the four entries for that index value. This
table is located in the second PSB page along with the
monitor process stack.
Page 4
This structure supports a maximum of 32 user sections and 32
monitor sections as described. The parameters may readily
be changed however to support more sections. For example,
the full 12-bit sections field could be supported by
increasing the field width from 9 to 16 bits to hold 12
section + 2 page + 1 user + 1 valid bits. The division
between the index and the associative fields can also be
moved to vary the table length and number of associative
entries. These modifications are invisible to users and
have no effect on any external data structures; hence they
may be done at any time.
Even with four associative entries, some conflicts will
occur. As described above, the data base can hold any set
of addresses in a single user (or monitor) section with no
conflicts. The possibility of conflicts arises with sets of
addresses from multiple sections, e.g. user 0, 200000,
400000, monitor 200000, 400000. The action taken when there
is no free entry to store a new address is to shift the 4
entries at that index position left by one entry thus losing
the leftmost entry, and store the new address in the
rightmost entry. If this occurs several times, the effect
is to have the most recently used entries in the data base.
The lost entries represent pages which are presumably still
in the working set but are not represented in the data base.
The process may continue to use them, but they will not be
seen for swapout, swapin, or XGC operations. This will tend
to reduce the efficiency of these operations, but not
significantly if the number of lost entries is small, and it
has no effect on the behavior of programs in any respect
other than time. We should occasionally monitor the number
of lost entries as an indication of the effectivness of the
working set cache.
BASIC OPERATIONS
PAGE FAULTS - The virtual address of each page fault will be
added to the working set cache.
XGC - XGC will be completely revised. Rather than scanning
the CST to find physical core pages owned by the running
process, it will scan the working set cache to find virtual
addresses in the working set. It will trace and resolve the
page pointers, and check the age field on physical core
pages found. "Old" pages will be swapped out, and any
virtual addresses which do not resolve to "current" assigned
physical core pages will be removed from the WS cache.
SETMPG, etc. - SETMPG and MSETMP will remove the given
addresses from the WS cache if no new pages are being
mapped. This should reduce the load on the WS cache from
various monitor functions such as directory mapping. It
does not appear to be necessary to do the same for user PMAP
operations.
Page 5
SWAPOUT - When a process leaves the balance set, it will be
placed on a list of processes awaiting swapout. When GCCOR
runs, it will select processes from this list (FIFO) and
swap each working set out. The swapout of a working set is
similar to XGC and is implemented with most of the same code
accessed via a second entry point, WSSWPO. During this
operation, the working set cache is scanned, and non-current
pages are swapped out as with XGC. A list is constructed of
current pages which need writing, one or more sets of
sequential swap addresses are assigned, and the pages are
given to the swap device driver as a group.
Unmodified pages are placed immediately on the replaceable
queue and are not written. This appears to be more
efficient than re-writing the pages with sequential swap
addresses even through it does not achieve as much swap
device locality.
Note that pages are not removed from core unless core is
needed, and then only as many processes from the swapout
list are removed as may be necessary to bring the
replaceable queue up to the desired level. This maximizes
the probability that a process may return to the balance set
while its pages are still in core and avoid the overhead of
swapping them out and in again.
During the swapout scan, some pages may be discovered in the
"read completed" state. This is generally because the pages
were read at swapin but not subsequently referenced. If
significant process time has elapsed since the swapin (or
last XGC), or if the process is leaving the balance set
because of a dismiss, the pages are assumed to have left the
working set. Otherwise, we assume that the process has just
not had time to reference them and they are retained.
SWAPIN - When a process is selected for entry into the
balance set, its PSBs, UPT, and JSB are swapped in if
necessary. The process is removed from the swapout list if
present, and the fact of its presence remembered. When the
overhead pages are all in core, the process begins execution
immediately if it was on the swapout list since its working
set is still in core. If not, the preload routine is called
to scan the WS cache and initiate reads on all pages. Since
the WS cache was scanned and checked at swapout, only valid
current pages should be loaded during this operation. The
scan will quickly place all necessary reads in the swap
device queues, so maximum optimization efficiency should
occur. No explicit test is made for the completion of these
reads; the process is merely allowed to run and will page
fault on any pages not yet in core. After one or two
faults, sufficient time should have elapsed for the
requested reads to have completed.
GCCOR - Part of the GCCOR operation has been described above
in connection with swapout. That operation will remove from
Page 6
core all pages which are in process working sets, but will
miss those which are not in the WS cache because of
conflicts. It will also miss others which are assigned to
no process but are locked in core, etc. Therefore, after
each swapout of a particular process, the contents of FKWSP
(number of assigned pages in CST3) will be checked. If it
is still non-0, a scan of CST3 (using FKGC) will be done to
remove the remaining pages. Also, the traditional GCCOR
scan will be performed occasionally to remove unassigned
pages and update the balance set share count. This scan
will not look at any pages assigned to processes however,
since these should always be handled by WSSWPO or FKGC.
PAGE AGING - A swapout-swapin cycle will lose all of the age
information for the working set pages, although, as
explained above, all non-current pages will be removed at
swapout. Since a swapout is equivalent to a full XGC cycle,
the next XGC will not occur until significant process time
after the subsequent swapin. This time is comparable to the
working set window, so the process will have referenced all
pages which are still in the working set. The XGC scan will
remove any read-complete (but unreferenced) pages found. In
principle, it would be of some value to retain all of the
age information and write it onto the swap device along with
the working set pages, but in practice it does not appear to
be worth the processor time and storage.
+---------------+
! d i g i t a l ! I N T E R O F F I C E M E M O R A N D U M
+---------------+
TO: Dave Braithwaite
Jim Flemming
Dave Kiarsis
Norm Valentine
TOPS20 Monitor Memo List
DATE: July 17, 1978
FROM: Arnold Miller
DEPT: DEC20 S. E. Dept.
LOC: MR1-2 EXT:6473
PDM: ASM-78-001-01
SUBJ: TOPS20 Release 4 Scheduler Changes
This is a brief description of the monitor changes that will be
made to support the scheduler controls project. The changes
described herein represent a baseline from which to begin
testing and tuning of the scheduler controls. The
user-interface to the features, including EXEC commands and new
JSYSes, will be specified in a subsequent document.
New scheduling queues
Two new queues have been added to the scheduler. In previous
releases of the monitor, the scheduler maintains four queues.
Every process resides on one of the queues, representing its
activity in the recent past. Queues zero to two represent
"interactive" activity and queue three represents
"compute-bound" activity. The transitions among the queues are
controlled by several factors including: amount of continuous
computing, dismissing and several heuristics.
A need has been identified for reserving two scheduler queues
representing the two extremes of priority. A high priority
process needs to always be given special consideration so that
its response and service is prompt and predictable. By the
same token, a low priority process must not interfere with the
response and service of other "normal" processes on the system.
The two new queues are zero and five. All "normal" processes
reside on one of the queues from one to four inclusive. The
transition rules for these "normal" queues are identical to the
rules for the queues in previous monitor releases.
Queue zero is the "high priority" queue. Besides tending to
Page 2
reside at or near the top of the GOLST when runnable, processes
on queue zero always enter the balance set on the balance set
high priority queue. Note that a process on queue zero will
not run to the exclusion of all other processes in the system,
but will tend to be run more often than the processes that are
selected by the round-robin scheduler.
Queue five is the low priority queue. Processes on this queue
tend to reside at or near the bottom of the GOLST when
runnable. When a low priority process is selected for
inclusion in the balance set, it is treated as if it resided on
one of the "normal" queues.
The SJPRI and SPRIW JSYSes are used to set a job or a process
on one of the "special" queues.
Class Scheduler
The class scheduler is an extension to the "CPU runtime
guarantee" that exists in previous monitors. The existing
feature allows the system administrator to specify that a given
job is to receive a certain percentage of the processor. This
mechanism is limited to specifying one job per guarantee, and
the guarantee is over a short interval with no regard to the
history of the job. The major limitation of the present
facility is that it is not possible to designate a "class" of
jobs as having a guarantee. Normally, this is what is required
and cannot be achieved with the present capability.
The monitor support for the class scheduler allows any number
of jobs to be placed in a class. The monitor will support at
least eight classes, and will allow the "guarantee" per class
to be assigned by the user. The guarantees are in percentages
just like the present mechanism. Percentages can be assigned
in units of .5% from .5% to 100%, however the sum of all
assigned percentages may not exceed 100%.
The class scheduler code uses the assigned class and class
percentage to influence a process's priority. That is, a
process that is ahead of its percentage will tend to have its
priority reduced and a process that is behind its percentage
will tend to have its priority increased. Therefore, a
process's relation to its "target" percentage is a new metric
that becomes part of the heuristic that dictates the transition
rules among the scheduler's queues. Following is a description
of this new metric and how it is employed.
The scheduler periodically computes a job's "utilization" as a
decaying average of the CPU time used by the processes of the
job. The computation performed is:
U(I+1)=U(I)*e^(-t/C)+F(1-e^(-t/C))
where:
Page 3
U(I) is the current process utilization
F is the fraction of the CPU used by the process in the
recent interval, "t"
t is the time since the last computation of U
C is the "decay" interval. After time C, the
utilization decays by e (approximately one-half). This
number represents the amount of "history" that the
monitor uses in determining a process's behavior. The
number is chosen to insure as uniform a behavior as
possible and may be set by the user.
The computation is performed periodically for all jobs. The
utilization function for each of the classes is the sum of the
utilizations of the jobs in the class. The class utilization,
CU, is used to compute the "distance" of each job from its
ideal or target utilization. The "distance" is the metric that
is used in adjusting process priorities. The distance, CD, of
each job is expressed as:
CD = (CP/N-CU)/(CP/N)
where:
CD is the "distance" of a job from its target
utilization
CP is the class's ideal utilization. CP is specified
by the user when the class is defined.
N is the "population" of the class. N is currently the
number of logged-in jobs belonging to the class.
This function is designed to be nearly exponential in
character. That is, CD increases rapidly as the job's
utilization becomes insufficent and decreases rapidly as the
job's utilization increases. With these characterisitcs, CD is
an accurate measure of the priority of each job.
The priority of a process at any time is a function of the
process's queue number and of its job's CD value. The priority
is expressed as a single number whose value is used to position
the process on the GOLST. Also, CD is used to compute the
process's balance set quantum.
The intention of the class scheduler is to create a number of
independent scheduling groups that behave according to the
scheduler's micro-controls. That is, within each class, the
jobs compete with one another as if they were running alone on
Page 4
their own computer that is CP percent as fast as the actual
processor. Each class is allowed to consume CP percent of the
processor by running its jobs in an order prescribed by the
scheduler's micro-controls.
In summary, the class scheduling action can be expressed as a
simple set of changes that creates individual scheduling
partitions according to the class guarantees. These changes
are:
1. The priority function of each process is affected
by its job's CD function. The application of CD to the
priority is to raise a process's priority if its job is
behind its "target" and to lower the priority if its
job is ahead of "target".
2. The balance set quantum assigned to a process on
the balance set queue is also affected by the process's
associated CD. The effect is to give porportionately
higher quanta to processes who belong to behind target
jobs and to give proportionately lower quanta to
processes who belong to ahead of target jobs. The
result of this is to favor running processes that are
behind target over processes that are on or ahead of
target.
As a final note on the class scheduler, something should be
said about "windfall". Windfall is the term adopted to
describe unused portions of the machine. If a particular class
is unrepresented (i.e. has no logged in jobs), or has
insufficient jobs to use its share of the machine, the
remaining CPU is windfall. The scheduler algorithm is designed
to distribute windfall proportionately over the active classes.
That is, higher guarantee classes will receive proportionately
more of the available windfall than lower guarantee classes.
It is assumed that a user-mode control program will be doling
out windfill according to administrative fiat and will not rely
on the monitor's allocaton of unused class shares.
Other Scheduler Changes
In trying to address the problems expressed by customers at
DECUS, we will be adding numerous data collecting routines to
the monitor in release 4. A large number of these routines
will be in the scheduler as the progress of a program in the
system depends heavily on its treatment by the scheduler. The
model for these routines is a SNOOP program now being developed
that gathers information from release 3 and release 3A
monitors.
Finally, the attendees of the session on the scheduler controls
at DECUS requested that the scheduler be able to provide
guaranteed and consistent service. This implies not only class
Page 5
guarantees, which we have addressed with the class scheduler,
but also assurances that a user or class of users receives no
more than a certain level of service. In fact, the consensus
was that it was better to throw away machine cycles rather than
allow certain programs to run beyond their class guarantee.
While this is contrary to our beliefs and philosophies, the
release 4 scheduler allows individual jobs to be designated as
not receiving more than their class share of the machine. The
specific algorithm used is that any process of a job so
designated will not be run if its class is now ahead of its
target utilization and the process is compute-bound (i.e. on
scheduler queues four or five).
+---------------+
! d i g i t a l ! I N T E R O F F I C E M E M O R A N D U M
+---------------+
TO: Monitor Memos List
Don Allen (BBN)
Dave Braithwaite
Norm Valentine
DATE: Sept. 15, 1978
FROM: Arnold Miller
DEPT: DEC20 S. E. Dept.
LOC: MR1-2 EXT: 6473
PDM: ASM-78-002-03
SUBJ: Scheduler Controls User Interface
Abstract
Two schemes for setting up a class scheduling environment are
presented. The first uses the accounting system of TOPS20 to
define the valid classes for users. The second uses the newly
implemented GIVOK/GETOK mechanism to define the valid classes.
New commands for the n-CONFIG file are presented that define
class percentages and that specify if class scheduling is to be
used. New TOPS20 command language commands are not presented
in this document but are alluded to in several places. A new
TOPS20 monitor call is defined for setting and reading the new
scheduler controls parameters.
Page 2
Glossary
Bias Controls A set of indicators that influence the
scheduler's decisions. In general, the
bias controls make the scheduler favor
either compute bound or interactive
programs. The bias controls apply only
if the class scheduler is not being
used.
Class A newly defined unit of scheduling. A
class consists of an ideal percentage
of the processor and a collection of
users. A class may also be identified
by a set of valid accounts.
Class percentage The amount of the processor that the
members of the class are entitled to.
This is also known as the "target
percentage".
"Dregs" queue As described in reference [2], this is
a special queue whose occupants are
allowed to run only when no normally
scheduled processes are available to
run.
Job Utilization The average amount of the processor
that the job has accrued. This
quantity is computed periodically by
the scheduler and is used to determine
the priority of the job.
Policy program A user mode program responsible for
handling GETOK requests from the
monitor. This program may handle a
wide range of requests and decisions at
the system manager's whim. One set of
requests that it may be directed to
handle is the assignment of classes to
jobs.
Windfall The amount of the processor not claimed
by active classes. Windfall occurs if
one or more classes has no logged in
jobs or if one or more classes cannot
use all of its class percentage.
Windfall may be manifested either as
idle time or as excess class percentage
awarded to the active classes.
Windfall Policy program A program that uses a complex series of
input data to determine which groups
Page 3
are to receive portions of the
available windfall. This program may
be distinct from the Policy program
that handles GETOK requests from
TOPS20.
0.0 References
1. Scheduler Controls for Interactive/Computational Bias
Dan Murphy
April 11, 1978
Document available from author
2. TOPS20 Release 4 Scheduler Changes
Arnold Miller
July 17, 1978
PDM: ASM-78-001-01
1.0 Introduction
The TOPS20 class scheduler allows the system manager to divide
his/her computer into several individually scheduled classes.
Each class is assigned a percentage of time to use the
processor. The scheduling of other system resources (e.g. I/O
channels) are unaffected by the class scheduler. The number of
users in a class is also determined by the manager.
This scheme parallels the "project control" concept that was
introduced with release 3 of TOPS20. The class scheduler
allows the system administrator to allocate portions of the
processor to individual groups. These groups can then be
administered and controlled by individual project managers.
Furthermore it allows the system administrator to set up a
separate class for batch jobs so that they can be controlled
and scheduled independently of time-sharing jobs.
The basic element in the class scheduler is a class. A class
consists of a percentage (or target) and a number of users. A
class's percentage indicates the ideal amount of the processor
that it is to receive. Many factors can affect the monitor's
ability to provide this level of service such as:
1. The class does not make sufficient demand for the
processor. This can be caused by two conditions. One,
the class is under-represented; i.e. its current
population cannot use the amount of the processor
allocated to it. Two, the class's jobs require another
resource (such as disk I/O) that is not available.
This latter case might be solved by partitioning all of
the system's resources (I/O, memory and the processor)
acccording to the class targets. This extension of
partitioning all of the system's resources according to
a common formula is a consideration for a future
release of TOPS20.
Page 4
2. The class's demand exceeds its target and there are
excess shares available on the system. This can arise
in several ways: the administrator has not allocated
all of the processor to existing groups, or one or more
groups is not using all of its shares (see item 1
above). The occurrence of unused portions of the
processor is known as windfall. In order to allocate
windfall in a controlled fashion, a specially written
program must exist to determine the amount of available
windfall, determine each active group's needs (this may
be a function of time) and distribute the available
windfall accordingly. If no such program is present,
the monitor will either distribute windfall
proportionately among the active classes or not at all.
2.0 The User Interface
2.1 New monitor call SKED
The monitor's data base can be read and modified by means of a
new monitor call, SKED. This monitor call allows a program to
set and read all of the items affecting the class scheduler.
Also, it allows a program to set and read the "bias control"
settings that influence scheduling decisions when the class
scheduler is not being used. The calling sequence is:
SKED
AC1/ function
AC2/ address of function-dependent argument block where the
first word is the length of the block.
Returns: +1 always
Illegal Instruction trap on error.
The available functions include both privileged and
non-privileged operations.
Functions (specific offsets and formats will be specified in
the near future):
.SKRBC Read bias control setting. This function
returns a single number indicating the current
bias control setting. AC2 points to a word to
receive the value.
.SKSBC Set bias control setting. This function sets
the bias control to the specified value. This
function is privileged.
Page 5
.SKSCS Set the percentage or target of a class. The
arguments are a floating-point number in the
range 0 to 1.0 and a class number. The sum of
the defined class percentages may not exceed
1.0. This funtion is privileged.
.SKRCS Read class parameters. This function returns
the class percentage and utilization for the
designated class.
.SKSCJ Set the class of a job. This function takes a
pair of numbers, the job to set and the desired
class. If setting the class of the calling
job, this function is not privileged. If
setting the class of another job it is
privileged. In either case, the job must be
allowed to reside in the selected class. The
calling job may be designated as -1.
.SKRJP Read all of the class scheduler parameters for
a given job. This function returns: the class
the job is in, the job's current utilization,
the job's target utilization, the class's
percentage, and the class's utlilization.
.SKRRA Read class run or load averages. The monitor
maintains the load averages for each active
class and will make them available through this
function. The returned data are four words per
class: the class number, the one minute
average, the five minute average, and the
fifteen minute average.
.SKBCR Read the class setting for batch jobs. A -1
indicates that there is no special class for
batch jobs.
.SKBCS Set batch class. This function allows the
specification of the class in which all batch
jobs will run. A -1 indicates no special class
for batch jobs. If this is set, it overrides
the valid classes for any user running a batch
job.
.SKBBG Run all batch jobs on the "dregs" queue. This
function applies only if the class scheduler is
not being used. The argument is either a zero
(clear) or a non-zero (set). A non-zero
indicates that batch jobs should be run on the
"dregs" queue.
.SKDDC Set system class default. The argument is the
class to be used for all users that are not
otherwise assigned classes (either within the
accounting data base or within the the policy
Page 6
program's data base).
.SKICS Start or stop the class scheduler. If the
class scheduler is being started, this function
also specifies the mode in which class-to-user
assignments are made and whether windfall is to
be allocated to the active classes or withheld
from the active classes. See the following
discussion for details.
2.1 Setting a job's class
A job's class is set in one of two ways: by the account to
which the job is charging its use of the processor or by the
GIVOK/GETOK policy program. The selection is made when the
.SKICS of SKED is executed to begin class scheduling. If
accounts are being used, then the accounting data base contains
the default class for each account. When a job selects an
account (either by logging in or by executing a CACCT), the
accounting data base provides the class for the job. There is
only one class for each account in the accounting data base.
Therefore a job's class may not be changed with the .SKSCJ
function of SKED. Only by changing the account will the job's
class be changed.
If the GIVOK/GETOK policy program is being used to select
classes, then each time the .SKSCJ function is executed the
monitor will request that the policy program verify the
specified class. Futhermore, in this mode, the default class
assigned at LOGIN must also be specified by the policy program.
Regardless of the manner in which class assignments are made
and verified, each time a job changes its class a USAGE entry
will be made indicating the end of a session.
The two methods are provided so as to be as flexible as
possible. In an installation using account validation, it will
be most convenient to associate accounts with classes. This
way one, static data base suffices for all of the
administrative requirements. In an installation in which
account validation is not being used or in which the user
population changes, the GIVOK/GETOK mechanism may be called
for.
2.2 Establishing a policy data base
2.2.1 Using Account Validation
2.2.1.1 ACTGEN changes
Since there are two modes of establishing job-to-class
correspondence, there are two methods for establishing a class
Page 7
scheduling data base. If accounts are being used to identify
valid classes, then the accounting data base must contain the
valid class for each account. This is done with a new switch
on the ACCOUNT command. The switch, CLASS, specifies the class
that is valid for a particular account. Therefore:
ACCOUNT 341/CLASS:3
would declare that class 3 is the valid class when using
account 341. ACTGEN will place this information in the
system's accounting data base for use by the class scheduling
routines.
2.2.1.2 n-CONFIG commands
For any account that is not given a valid class, the system
default class (class 0) will serve as the valid class.
Once the accounting data base is properly updated, the class
scheduler is invoked via the n-CONFIG file. Three new commands
are provided:
BATCH-CLASS n
ENABLE CLASS-SCHEDULING (using) ACCOUNTS (windfall to be) m
CREATE (class) n (to have percentage) k
The first command specifies the class to be used for batch
jobs. This may be omitted, in which case batch jobs are
treated identically to time-sharing jobs.
The second command directs the system to do class scheduling.
The argument ACCOUNTS indicates that the accounting data base
contains the class assignments. The last argument m is one of
the following keywords: "WITHHELD" or "ALLOCATED". If
windfall is WITHHELD, no class will be allowed to accumulate
more than its target percentage of the processor. This may
imply that the processor will be idle even though some jobs are
ready to run. The default is ALLOCATED.
The last command is a set of commands that defines the class
percentages. Any class not specified in a CREATE command is
assumed to have a 0% share of the processor. Jobs in such a
class will occasionally be allowed to run, but their relative
priority is low (note that if windfall is WITHHELD, any job in
a 0% class will not be able to run a compute-bound program
since all of the compute-bound use of the processor by a 0%
class is derived from available windfall).
2.2.2 Using a policy program
The alternate way of assigning classes to jobs is with the
GIVOK/GETOK policy program. Using this scheme, the .SKSCJ
Page 8
function of SKED will perform a GETOK function to verify the
class change. Also, the LOGIN JSYS will execute a GETOK to
obtain the default class for the newly logged-in job. As with
the scheme employing accounts, each class change will generate
a USAGE file entry. Given the time constraints of release 4,
it is unlikely that we will produce a GIVOK/GETOK program that
implements a class assignment scheme. However, this paper will
attempt to outline a method of establishing such a data base as
follows:
2.2.2.1 n-CONFIG commands
To begin using class scheduling, the n-CONFIG comand is as
follows:
ENABLE CLASS-SCHEDULING (using) POLICY-PROGRAM (windfall to be)
m
If this method is selected and no policy program is running on
the system, all jobs will be placed in the same class (class 0)
and no .SKSCJ functions will be allowed. See section 2.2.1.21
for a more complete description of the n-CONFIG commands.
3.0 Windfall Allocation
As noted earlier, in the absence of any controls the class
scheduler will distribute the available windfall
proportionately to the active classes. However, it may be
desirable to alter this scheme so that windfall is given to
classes that are "needy" or "favored". A "needy" class is one
that has not used its share of the processor in the recent past
and that requires the additional percentage now. A "favored"
class is one singled out by the system administrator as being
the recepient of a large portion of the available windfall.
Both of these examples have valid application. The example of
the needy class would be a consideration in a system that sells
shares of its machine and wants to guarantee that over one
billing period (say a month) each of its subscribers is
guaranteed its exact percentage of the available computing. A
favored class might be one that consists of people doing
critical work that by its nature takes precedence over other
projects using the computer.
From the above description, the components of a windfall policy
maker are: medium or long-term usage data, favored class
status, and a set of heurisitic functions that prescribe the
desired long-term behavior of the system. Note that the class
scheduler implements relatively short-term policy (several
minutes at most) and is oblivious to the needs and requirements
of long-term business strategy. Therefore an independent
windfall policy program is required that can digest the
components, apply the desired functions, and adjust the initial
class percentages so as to produce a "fair" scheduling policy.
Page 9
At present this is considered beyond the scope of release 4.
However, there is nothing to prevent customers from
implementing their own policy programs and from our learning
from their experiences.
4.0 Additional Features
If the class scheduler is not being used, one may still elect
to limit the scheduling of batch jobs. The n-CONFIG command:
BATCH-BACKGROUND
specifies that all batch jobs will run on the "dregs" queue.
If it is required that batch jobs receive favorable treatment,
the class scheduler must be used.
Also, if the class scheduler is not being used, the bias
control value may be set by the n-CONFIG command:
BIAS-CONTROL n
where n is a decimal number representing the desired setting (
see reference [1] for the values of n).
THOUGHTS ON SWAPPING STRATEGY
Dan Murphy - 20-Oct-78
We traditionally think of forks as being in one of three
major states, 1. blocked, 2. on GOLST but not in balset,
3. on GOLST and in balset. There is another state which
exists but which the scheduler does not explicitly
recognized, that is not in balset but working set in core.
Within that state, there are two sub-states, on GOLST and
BLOCKED.
This state exists in fact for some time after the fork has
nominally been removed from the balset. There is the
complementary state, in balset but working set not yet in
core which may also be of interest.
The routines AJBALS, LOADBS, and REMBSJ/REMBSF control the
movement of forks into and out of the balset. They
implicitly assume that their decisions are implemented
instantly. E.g., AJBALS will remove one fork and load
another into the space "released" by the first in one pass.
The movement from the state of BSBAR+WS (BS = fork in
balset, BSBAR = fork not in balset, WS = fork working set in
core, WSBAR = fork working set not in core) to BSBAR+WSBAR
is implicitly controlled by GCCOR. There is increasing
evidence that the decisions relative to this transition have
a major impact on system performance (response and
throughput). What happens here determines the probability
that a fork coming into the balset will find some or all of
its working set pages in core already.
We know that this probability is already quite high from
available data. For example, FRIDAY output for a typical
1-minute period on 2102 indicates that the average fork
working set size was 54 pages and that there were 900
wakeups during the interval. Hence, if every wakeup
required that the full working set be swapped in, there
would have been 48,600 disk reads during the period, or
about 810 pages/second. The actual reported disk read rate
was 68 pages/second.
Over the years, various changes to GCCOR have been made
which improved performance. It appears that these all had
the effect of increasing this probability.
1. Originally, GCCOR scanned all of the CST and threw out
everything not needed. It was changed (circa 1972) to stop
after collecting some reasonable number of pages.
2. Early in the working set swapping project, I implemented
a queue of fork which had left the balset. When swapping
out working sets, GCCOR would use this queue on a FIFO
basis.
Page 2
3. Recently, I tried doing GCCOR on the basis of a global
value in the age field of CST0. This produced a significant
improvement in some cases relative to the working set
swapout list. However, it tended to remove pages from forks
still in the balset which hadn't been run for a while thus
making it even more difficult for them to run.
I am now looking for a procedure which will do a better job
of picking the right fork to swapout (and swapin) while at
the same time generally reflecting scheduler policy
decisions and neither thrashing nor wasting memory, etc. It
occurred to me at a seminar on paging techniques last week
that one could use the same kinds of algorithms to select a
whole working set for removal as have been traditionally
used for page replacement, e.g. LRU, FIFO, etc. That is,
for any fork with a working set now in core, there is some
probability that that fork will want to run in the next
interval T, and if we have to swap out a fork, we should
pick the one with the lowest probability.
Assuming we have a reasonable probability function, the next
question is how to make use of it in relation to AJBALS
decisions. In particular, what happens under heavy load
when there are always more forks wanting to run than can fit
in core. At present under these conditions, GCCOR is busy
throwing out everything possible and the probability of
finding a page still in core is non-0 only because of the
finite time it takes the disk to write the changed pages and
the time for clean pages to percolate up the replacable
queue.
Here is the basic operation of this approach:
1. A swapin/swapout manager (which I will call GCCOR since
it now does the swapout part of that function) would
periodically perform the following algorithm
1. IF there are no forks waiting for swapin THEN done,
ELSE
2. IF there is sufficient available memory for the
best fork (i.e. fork size .le. NRPLQ+IOIP) THEN
swapin the fork and go back to step 1, ELSE
3. compare the probability value of the working sets
in memory to that of the working sets waiting to be
swapped in.
4. IF the minimum P in memory is less than the maximum
P not in memory THEN swapout the working set with the
minimum P and go back to step 1, ELSE done.
2. The routine AJBALS works generally as it does now,
except that it does not actually cause swaps into or out of
memory, it merely determines at any instant what the balance
Page 3
set should be. The probability value for each fork working
set is then maintained by integrating BS over time (BS = 1
if fork in balset, 0 if fork not in balset). Keep in mind
that being in the balset or not does not instantly determine
whether the fork's working set is in memory or not, nor
whether the fork can be selected for running. A fork may be
selected for running IFF its working set is in memory and if
it is runnable (on the GOLST). Unlike at present, a fork
may be run even if it is not in the balset! For example, a
fork which blocks for a short time and then wakes up before
its working set has been swapped out may be run immediately
even if AJBALS does not decide to put it into the balset.
If, because of scheduling policy, it remains out of the
balset, its probability value will decrease to the point
where GCCOR will swap it out and it will be unable to run.
It seems to me that this approach has the following
characteristics:
1. It eliminates the need for the balset hold quantum which
we now use to prevent thrashing when forks block and unblock
frequently. With the new approach, a high priority fork
which unblocked would not immediately force out a lower
priority fork, since the former fork would have initially a
0 or small probability value while the latter one would have
a high value. AJBALS would immediately declare the high
priority fork to be in the balset and the lower priority
fork out, but the time constants of the probability
calculation would determine how soon GCCOR would actually
swap.
2. Poorly behaved forks which wake up frequently appear to
cause considerable swap activity and system degradation at
present. Under this new approach, these forks would have a
high probability value and so would not be swapped out as
frequently. In particular, we can distinguish between a
fork which has run 1 second out of 5 for the last minute
from one which has run 1 second out of 30 for the last
minute, even though both just blocked. We will then swap
out the 1/30 fork based on the probability that it will not
need to run again as soon as the 1/5 fork.
3. This algorithm would appear to be of particular value
with working set swapping, since with WSS a decision to swap
in or out a fork may be much more costly in terms of swap
channel capacity than under straight demand paging.
I am interested in comments on the above, including
pathological cases, other situations where it might win or
lose, any statistical analysis, etc. One way or the other,
I will probably try it experimentally in the next few weeks.
New Swapping Logic Implementation (Preliminary)
D. Murphy 8-Dec-78
The new swapping logic (along the lines of my 20-Oct memo)
has been merged into <4.MONITOR>. Some of the traditional
things in the scheduler data base are now different. Most
importantly:
1. There are no balance set tables (BALSET, NBW, etc.). Data
formerly contained therein now lives in fork tables or
does not exist.
2. The LH of FKPT does not mean what it used to, i.e.
whether the fork is on the GOLST or WTLST. That
is determined as follows.
3. There is a new fork table called FKSWP. It contains
various bits and fields relating to fork state,
including:
B0 (FKWSL) - "working set loaded", on if fork's PSB, etc.
are locked in memory; off otherwise. Much like
what used to be "in the balance set".
B1 (FKBLK) - "fork blocked", on if fork is blocked and on
the waitlist; off if fork is on GOLST.
B2 (FKIBS) - "in balance set", as indicated, keeping in mind
however, that in balance set does not necessarily mean
in memory.
B3 (BSWTB) - "balset wait", on if fork is in short-term wait,
e.g. on page fault. Wait test is in FKPGST as before.
B4 (BSNSK) - "no sked", on if fork is nosked but only when
fork is not actually running.
B5 (BSCRSK) - "critical section", on if fork is in critical
section and fork is not actually running.
B6 (FKIBSH) - "IBS hold", on if fork entered balset since
last update to history
B18-35 (FKHST) - "history", value which is a function of
fork's recent history, in particular whether fork
has been in balset or not.
To a first approximation, determining the state of a fork from a
crash dump or with EDDT goes as follows:
1. Look at FKPT to see if fork exists; B0 on means that it
does not.
2. Look at FKSWP to see if fork is blocked or runnable (B1);
3. If blocked, check wait test (FKSTAT);
4. If runnable, see if in memory (B0 of FKSWP);
5. If in memory, see if waiting (B3 of FKSWP);
6. If waiting, check wait test (FKPGST).