Trailing-Edge
-
PDP-10 Archives
-
QT020_T20_4.1_6.1_SWSKIT_851021
-
swskit-documentation/mos.mem
There are 5 other files named mos.mem in the archive. Click here to see a list.
+---------------+
! d i g i t a l ! I N T E R O F F I C E M E M O R A N D U M
+---------------+
TO: TOPS20 List A
DATE: June 19, 1978
FROM: Arnold Miller
DEPT: DEC20 S. E. Dept.
LOC: MR1-2 EXT: 6473
SUBJ: Release 3A MOS memory project
1.0 Overview
The project known as MOS memory scheduled for release 3A of
TOPS20 encompasses a range of memory-related changes to the
monitor. The catalyst for scheduling the project was the need
to support MF20 MOS-type memories on the DECsystem20. The
required work to achieve this support in software was minimal.
In fact, the TOPS20 monitor will run successfully on a machine
with MF20 memories even without the stipulated changes.
However, the engineering goals designed into the MF20 project
emphasized reliability and availablity. In order to achieve
these goals, some amount of monitor support was required.
The MF20 memories are designed to detect and correct single-bit
errors and to detect double-bit errors and treat them as memory
parity errors. The correction is performed by an ECC
correction algorithm in the hardware that computes a seven-bit
ECC whenever data is stored into a word of memory and that
verifies the correctness of the retrieved data on a read. In
order to predict and forestall a serious memory failure, it is
important that any errors be reported and analyzed. Since the
hardware will correct single-bit errors, it is not necessary to
take any corrective action when single-bit errors occur.
However, should a more serious error occur, it is important to
know where all faulty areas in the memory are so that the best
possible correction can be applied.
Therefore the principal requirement of the monitor was to
report the relevant information whenever an MF20 encountered
either a single-bit correctable error or a double-bit error.
This information will be gathered and delivered to a user-mode
diagnostic, TGHA, that will use the raw error data to compile a
history file of memory failures. This history file will be
used for several purposes: as an ongoing compilation of
failures that field-service can use to assess the performance
of the hardware and as a set of statistical data that TGHA can
use to determine when software intervention is justified to
reduce the memory error rate. The former use is a traditional
Page 2
function of error data and parallels the use of the information
put into the SYSERR file by the monitor. The latter use of
this file is peculiar to the recovery procedures available with
the MF20 and is the central mechanism in the planned for
high-availability of the MF20.
1.1 TGHA error recovery
Each 64K section of MF20 memory has a "spare" MOS RAM. In
fact, the spare RAM is actually a series of RAMs that can
substititue for any other series of RAMs in the same 64K
section. The intention is that when a given type of failure
gives errors at or above a predetermined threshold, the
software will instruct the MF20 to replace the defective RAMs
with the spare RAMs. Since there is not sufficient spare
storage to replace all of the RAMs in a 64K section of memory,
TGHA must decide how best to apply the spare RAMs to achieve
the maximum effect. The failures that a memory is experiencing
are reported to TGHA by the monitor and saved in the memory
history file by TGHA. Therefore, at any time TGHA can assess
the most serious failure in the memory and can apply the spare
bit to correct that error. The goal is to keep as much memory
as possible on-line and functioning with the minimum possible
error rate. Therefore, if a given MF20 experiences no
double-bit errors but a pattern of single-bit errors that
indicate that a particular RAM is malfunctioning, TGHA could
replace the malfunctioning RAM with the spare RAM (note that
this type of correction is largely perfunctory and serves only
to minimize the amount of error reporting the monitor is
obligated to perform). If subsequently another RAM or pair of
RAMs begin producing double-bit errors, TGHA would be required
to use the spare bit RAM to eliminate this more serious
failure.
If TGHA determines that it is not possible to use the spare bit
to eliminate all serious errors it may resort to disabling
pages of memory that exhibit high error rates. Due to the
physical layout of the MF20 boards, it is possible to encounter
MOS failures that result in one of the following situations:
1. Row error- One bit out of every fourth word
throughout one page will be bad.
2. Column error- One bit per page throughout 64 K of
memory will be bad. This type of error is the one most
likely to be corrected by use of the spare bit RAM.
3. Bit error- A single bit in a RAM is bad. This will
result in a single error in a single word of memory.
Multiple bit errors per controller will result in a
scattering of bad pages throughout a 64K range of
memory.
4. Matrix error- an entire RAM goes bad producing a
single bad bit in every fourth word in a 16K block of
Page 3
memory.
Therefore, it is fairly evident that if the spare bit RAM is in
use to eliminate a row or matrix error, and one or more column
or bit errors is detected, the only recourse open is to disable
the individual pages that contain the errors.
1.2 TOPS20 monitor support of TGHA
In order to allow TGHA to achieve the highest possible error
detection and correction, some specific support was added to
TOPS20. In summary, the changes are:
1. The memory error handling in APRSRV was enhanced to
gather information about MOS-specific errors and to
save this information for TGHA. The monitor also
corrects MOS single-bit errors by writing the corrected
data back into memory, but is not required to either
compute the ECC or correct the retrieved data (the MF20
itself has the complete ECC computation and correction
algorithm built into its logic). The error information
gathered by the monitor is saved on a special memory
queue until TGHA requests it.
2. The kernel routines in PAGEM used by PMCTL to place
pages on-line and off-line were enhanced to include a
much larger collection of memory than previously
available. The code in PAGEM now has the ability to
place most of the pages in physical memory off-line.
Furthermore, it is possible to predict the pages that
cannot be disabled. Also, PAGEM and SCHED were
modified to allow previously non-existant memory to be
placed on-line and correctly accounted for. In
partiular, the appearance of previously non-existant
memory will allow process working sets to increase and
will affect the scheduling and partitioning of the
balance set.
3. The PLOCK JSYS was implemented to permit a
user-mode program to gain exclusive control of one or
more pages of physical memory and to load a portion of
its address space in that memory. While this was
specifically intended to facilitate the analysis and
correction of MOS-related errors, the feature is
available for core memory pages as well.
4. Several new functions were added to the DIAG JSYS
to enable TGHA to perform its spare-bit substitution.
These functions are provided to insure that the desired
spare bit RAM can be substituted in the desired portion
of memory. It is likely that TGHA can perform the
substitution without the use of these functions, but
only through these functions is the operation always
guaranteed to succeed.
Page 4
5. A new SYSERR entry was defined and implemented for
memory parity errors. This new entry type permits more
information to be gathered and reported and permits
easy extension for future needs.
6. All DTE protocols were modified to allow TOPS20 to
inhibit each of the front-ends from examining or
depositing memory. This change is necessary to insure
that TGHA can complete its spare-bit substitution
without any portion of memory being referenced.
7. The monitor runs TGHA at system start-up and
whenever a MOS memory error is detected.
1.3 Supported configurations
It is possible to mix core and MOS memory on the same
DECsystem20. However, it is not possible to mix core and MOS
on the same memory bus. A single SBUS will support a minimum
of 64K of MB20 memory and a maximum of 256K of MB20 memory. A
single FBUS (the extended form of the SBUS required to support
MF20 MOS memory) can support a minimum of 256K of MOS memory
and a maximum of 1024K of MOS memory. It is possible to have a
MOS-only configuration with two FBUSes and up to 1024K of MOS
memory on each FBUS. However, this maximum configuration
requires that another bay be added to the machine to hold the
memory logic. (Note that configuration data is subject to
engineering changes. The data given here is based on the most
recent expectations).
2.0 Design of new monitor features
This section describes the monitor changes done for release 3A.
Where applicable, some discussion is included to indicate
features specific to the support of MF20 memories.
2.1 New SYSERR entry for memory parity errors
The MF20 memory controllers respond to a significantly larger
number of SBUS diag requests than do MA20 or MB20 controllers.
An SBUS diag is used to communicate information to and from the
memory controllers and is the tool used to configure memory.
The monitor executes SBUS diag functions whenever a memory
parity or SBUS error occurs so it may record the state of
memory in the SYSERR file. In order to accomodate the new
controller numbers, and in order to be able to record all of
the important controller information, the format of the "memory
parity error" SYSERR entry has been changed. The new block
type is 162 and is structured as follows:
!---------------------------------------!
! !
Page 5
! APR status word !
!_______________________________________!
! !
! error register !
!---------------------------------------!
! !
! PC flags !
!---------------------------------------!
! !
! PC !
!---------------------------------------!
! !
! count of bad words found !
!---------------------------------------!
! !
! logical AND of bad addresses !
!---------------------------------------!
! !
! logical OR of bad addresses !
!---------------------------------------!
! !
! logical AND of bad data words !
!---------------------------------------!
! !
! logical OR of bad data words !
!---------------------------------------!
! -# of controllers,,offset to !
! SBUS data area !
!---------------------------------------!
! !
! first ten bad addresses !
!---------------------------------------!
! !
! first ten bad data words !
!---------------------------------------!
! memory references for first !
! ten bad words !
!---------------------------------------!
and the format of the SBUS DIAG area, located at the offset
pointed to be the tenth word is:
!---------------------------------------!
! !
! -# of SBUS diags,,controller !
!---------------------------------------!
! SBUS diags for controller !
.
.
.
!---------------------------------------!
! -# of SBUS diags,,controller !
!---------------------------------------!
Page 6
! SBUS diags for controller !
.
.
.
!---------------------------------------!
This new format allows for any number of controllers and any
number of SBUS diags for each controller. TOPS20 will no
longer produce the old format of SYSERR entry (entry type 161)
since this new type completely supercedes it.
2.2 Recording MOS-related errors
Previously, the monitor did not distinguish an SBUS error from
an MB parity error; and treated both of these as parity
errors. The MF20 controllers report single-bit correctable
errors as SBUS errors and, therefore, it is important that the
monitor distinguish an SBUS error as distinct from a hard
memory error.
An SBUS error now causes the monitor to poll all the memory
controllers (both SBUS and FBUS controllers) looking for the
occurrence of MOS correctable or double-bit errors. For each
controller that responds with one or both of these error
conditions, the monitor will acquire a block of resident free
space and store the relevant SBUS diag information in this
block. The free block is then placed on a special memory queue
until TGHA requests the information (See PMCTL changes in
section 2.5.3).
In order to prevent an onslaught of errors from the same memory
fault, the monitor directs the offending controller to suppress
the reporting (but not the correction) of single-bit errors
occurring in the section of memory in which the last error was
detected. The controller remains this way until TGHA can be
run to analyze the fault. TGHA will reenable error reporting
once it has completed its analysis and applied whatever
correction is indicated. The hope is that TGHA can determine
the full extent of the failure from the information recorded by
the monitor. However, if further research is warranted, TGHA
could use the PLOCK JSYS to acquire the offending memory pages
and complete the analysis itself.
If the monitor detects a hard error in any of the controllers,
it will assume that a parity error has occurred and force the
parity error recovery scan. Whether or not an SBUS error
occurs, an MB parity error is treated as it always has been.
2.3 Placing pages off-line
The kernel routines in PAGEM that maintain the states of
physical memory pages have been enhanced to allow a privileged
user program to acquire or disable most pages of physical
Page 7
memory. It is theoretically possible to provide the tools for
manipulating any arbitrary page of memory, but the cost to
include those pages now excluded is disproportionately high. A
reasonably accurate list of the pages not available is: the
EPT, the monitor's UPT, any page containing a CST0 entry, any
page containing an SPT entry, any page containing the DTE
protocol communications region, the page containing MMAP and
any page belonging to the resident free space pool. In certain
specialized monitors, e.g. TOPS20AN there are additional pages
that cannot be controlled. A reasonable estimate of the size
of the areas listed is: CST0- one word for every page of
memory supported, 2 to 4 pages; SPT, 4 pages; MMAP, one page;
resident free pool, at least 2 pages; DTE communications
region- 1 page.
It is possible that a page not within one of the stated regions
will be unavailable when requested. While this should be rare,
the manner in which pages are allocated and used within TOPS20
can create situations that would make certain pages
unavailable. It is extremely likely that failure to acquire a
page on request is due either to the page being one of the ones
excluded by the list above, or its being off-line or already
assigned by some other privileged program.
Since these kernel routines are used by the PMCTL and the PLOCK
JSYSes, the restrictions described here apply to these JSYSes
as well.
2.4 PLOCK JSYS (Previously LOCK JSYS)
This is a newly-implemented JSYS that allows a privileged
program to acquire physical memory and place a designated
section of its address space in the memory. The JSYS allows
the program to specify the memory pages to be used, or it will
select the pages under a "free choice" directive. The
specification is:
PLOCK
AC1/ starting page address if locking, -1 if unlocking
AC2/ fork handle,,starting page number
Presently, only .FHSLF is accepted
AC3/ Flags,,count
Where FLAGS are:
LK%PHY If locking, the value in AC1 is the first physcial page
desired. If this bit is off and AC1 is not -1, then it
is "free choice".
LK%CNT The right-half of AC3 is a repeat count
Page 8
LK%AOL Allow locking into off-line pages
LK%NCH Don't cache the pages
Requires WHEEL, OPERATOR, or MAINTENANCE capability.
Returns: +1 Always
Illegal instruction trap on error.
The following operational points should be noted:
1. If PLOCK is unable to honor any one of the pages
within the repeat count, it will unlock the entire
region specified by the arguments.
2. It is permissable to "unmap" a page that was locked
with PLOCK. This will unlock the process' page and
return the now unlocked physical page to its previous
state.
3. PLOCK uses the kernel routines in PAGEM to acquire
the designated physical page. The page selected by the
user must be capable of being placed off-line in order
for PLOCK to acquire it.
2.5 DIAG JSYS changes
In order to ensure that TGHA can use the spare bit in the MF20
memories, the DIAG JSYS has acquired two new "memory control"
functions. Before describing the JSYS change, a brief
discussion of the problem being solved is called for.
2.5.1 TGHA and the spare bit (a fable)
When TGHA decides to use the spare bit in one of the MF20
controllers, it must ensure that the pages affected by the
change do not cause parity errors. If one or more of the
affected pages is in use by the monitor (i.e. neither assigned
nor on the replaceable queue), then the bit to be replaced has
a value that is significant. Besides being part of a
thirty-six bit data word that some program has generated, it
was also used as a component in the computation of the word's
parity and ECC. Therefore, if TGHA merely exchanges the spare
bit for one of the active bits in the word, there is a 50%
chance that the new bit disagrees with the one replaced. If
all other bits in the word are good, then the worst that will
happen is that a single-bit correctable error has been
introduced into the word and the hardware will correct the
error the next time the word is referenced (very clever, you
say?). However, if the word has one other faulty bit, then the
spare bit may have introduced an uncorrectable double-bit
error. It is likely that this latter case will occur since a
very important use of the spare bit will be to eliminate
Page 9
chronic double-bit errors by replacing one of the faulty bits.
All of this theory having been digested, the obvious solution
is to disable all of the pages that are to be affected by the
spare bit substitution, perform the substitution, and then
place the pages on-line again. If this can be achieved, then
the revitalized pages will not contain useful data and cannot
cause parity errors. However, as we have observed, TOPS20
cannot guarantee that any arbitrary page can be placed
off-line. Furthermore, it is unacceptable for these few pages
to suddenly develop uncorrectable errors.
Several alternate techniques exist for performing the spare bit
substitution error-free. However, they all require that TGHA
be able to reference the memory in question, save the current
data, and replace it after the spare bit has been substituted.
While the techniques vary in complexity and direction, they all
require that TGHA can gain control of the memory to be
affected, read the data from the memory, perform some number of
SBUS diag functions, and replace the memory all without any
other program or hardware device modifying the memory. It is
this capability that is provided by the new DIAG "memory
control" functions.
2.5.2 New DIAG functions
The two new DIAG functions are designed to provide the exact
functionality required by TGHA and are not provided as general
"memory control" functions.
2.5.2.1 Get memory
AC1/ -count,,address
At address:
word 0 .DGGEM
word 1 First page in user address space
word 2 First physical memory page number
word 3 number of pages
word 4 user address of AR/ARX parity trap routine
Returns: +1 Always. Note conditions below
This function, upon a successful return, performs the
following:
1. TOPS20 has requested that all of the front ends
refrain from accessing common memory. This request is
made through a newly devised DTE protocol state that
allows primary protocol to remain in effect but that
"freezes" all DTE interactions.
2. The hardware PI system has been turned off. No
scheduling can occur.
Page 10
3. The time base and the interval timer have been
turned off. This is to insure that the meter board
does not modify the EPT.
4. All DTE byte transfers have completed. This is to
insure that the DTE/microcde does not change memory or
update any values in the EPT.
5. All RH20 activity has ceased. This is to insure
that no memory is modified and the channels do not
store into their logout areas in the EPT.
6. The designated pages of the process's address space
have been set up to address the designated physical
memory. Note that this is not the same as having
requested the pages with PLOCK. By using this
function, the data in the physical memory pages have
been retained and the ownership of the pages remains
unchanged. This is theoretically illegal within the
rules of TOPS20 paging, but the fact that TGHA is now
running with complete control of the computer obviates
the exception.
7. The CST0 entries for each of the designated
physical pages have been saved and set as follows:
a. The age is set to the present age of the
requesting process
b. The "process use" field is set to all ones.
c. The "modified" bit is set to "one".
The effect of this is that should the microcode attempt
to store either the page's age or the current "use"
bit, the data in the CST0 entry will be unchanged.
8. The entire address space of the requesting process
has been locked in memory. Actually, only pages that
exist at the time of the JSYS call are locked.
Therefore, the process must ensure that all of the
pages it needs exist and are private when the JSYS is
executed.
9. If TGHA specified an AR/ARX trap address, the
monitor has set up the proper dispatch. In the event
of an AR or ARX parity error trap during TGHA's
execution, the monitor will immediately trap to the
user address given in the DIAG call.
This function is available to TGHA for performing its spare bit
substitution. As noted earlier, it is likely that TGHA can
place the affected physical pages off-line and therfore perform
its spare bit substitution without the aid of the DIAG JSYS.
Page 11
2.5.2.2 Release memory
This is the only JSYS that should be executed following the
successful execution of a .DGGEM DIAG JSYS. It is intended to
undo the restrictions imposed by its companion function and
allow timesharing to proceed. The calling sequence is:
AC1/ -count,,address
at address:
word 0 .DGREM
no other arguments are required. The monitor has "remembered"
the values that need to be restored.
The amount of time that the system is unavailable following a
.DGGEM function and up to a .DGREM function is beyond the
monitor's control. TGHA is in complete control of the system
and will direct the monitor when it may proceed with normal
operation.
2.5.3 PMCTL changes
2.5.3.1 .MCRME function
The monitor collects information about MOS memory errors. This
information is held in resident free space buffers until TGHA
is scheduled to read it. TGHA will use a new function of the
PMCTL JSYS, .MCRME, that will provide the memory information.
The format of the new function call is:
AC1/ .MCRME
AC2/ length of argument block
AC3/ address of block
The monitor will store the error information into the block
addressed by AC3 and update AC2 on return. If not all of the
outstanding error information will fit, the monitor will return
as much as it can. It will be TGHA's responsibilty to reissue
the JSYS until the monitor indicates that no more error
information is available. It is important to note that the
error information is retained by the monitor only until TGHA
reads it after which the monitor will discard the information.
Returns: +1 always
error:
PMCLX4 No more error information
The format of the returned data is:
!---------------------------------------!
Page 12
.PMMTP ! type (9) ! count (9)! controller # !
!---------------------------------------!
.PMMRG ! error register at error !
!---------------------------------------!
The remainder of the block is dependent on the type of entry.
For type .PMMER, MOS memory error, the format is:
!---------------------------------------!
.PMMSY ! syndrome of the error !
!---------------------------------------!
.PMMBN ! block number of the error !
!---------------------------------------!
.PMMSB ! spare bit number !
!---------------------------------------!
.PMMEA ! error address !
!---------------------------------------!
.PMMSN ! 4 words containing the four !
! 32-bit serial numbers from !
! the PROMs. !
!---------------------------------------!
2.5.3.2 Extension to .MCRPS function
In order to ease TGHA's job, PMCTL has been extended to provide
the status of many contiguous memory pages with one monitor
call. The new format of the .MCRPS function is:
.MCPPN -count,,first page
or
0,,first page
.MCPST status of "first page"
.
.
.
If the contents of the .MCPPN offset is negative, then the
left-half is assumed to be the negative count of the number of
pages for which status is required. In order for the JSYS to
succeed, the count given in AC2 must be greater than or equal
to the the number of pages for which status is required plus
one.
2.6 Running TGHA
The monitor will run TGHA under job 0 as part of the system
Page 13
start-up procedure. This is done so that TGHA can verify that
all double-bit errors have been "mapped" out and that the MOS
memories are properly configured. It is possible that THGA
will attempt to remove double-bit errors by use of the spare
bit. When TGHA is run at system initialization, it will be
started at offset one of its entry vector.
Any time the monitor detects a correctable or uncorrectable MOS
error, it will ensure that TGHA is running. If TGHA is not
running, it will be run and started at offset zero of its entry
vector.
TGHA can be started manually from a time-sharing terminal to
perform certain specific functions. Running TGHA from a
terminal does not affect the monitor's decision to run TGHA to
analyze memory errors or reconfigure the MOS memories.
3.0 References
1. Monitor Calls Reference Manual
Order No. AA-4166B-TM
2. MF20 Software Plan
Jeff Barry
3. MF20 functional Specification, May 18,1977
Derrick Chin, Jeff Barry, Peter Lawrence
Paul Guglielmi, Bob Hamel
4. MF20 MOS RAM error analyzer (TGHA)
Program Specification
Jeff Barry