Trailing-Edge - PDP-10 Archives - BB-L288A-RM - swskit-documentation/mos.mem
There are 5 other files named mos.mem in the archive. Click here to see a list.

! d i g i t a l !   I N T E R O F F I C E  M E M O R A N D U M

TO:     TOPS20 List A

                                       DATE: June 19, 1978

                                       FROM: Arnold Miller

                                       DEPT: DEC20 S. E. Dept.

                                       LOC:  MR1-2   EXT: 6473

SUBJ: Release 3A MOS memory project

1.0     Overview

The project known as MOS memory scheduled  for  release  3A  of
TOPS20  encompasses  a  range  of memory-related changes to the
monitor.  The catalyst for scheduling the project was the  need
to  support  MF20  MOS-type  memories  on the DECsystem20.  The
required work to achieve this support in software was  minimal.
In  fact, the TOPS20 monitor will run successfully on a machine
with  MF20  memories  even  without  the  stipulated   changes.
However,  the  engineering goals designed into the MF20 project
emphasized reliability and availablity.  In  order  to  achieve
these goals, some amount of monitor support was required.

The MF20 memories are designed to detect and correct single-bit
errors and to detect double-bit errors and treat them as memory
parity  errors.   The  correction  is  performed  by   an   ECC
correction  algorithm in the hardware that computes a seven-bit
ECC whenever data is stored into a  word  of  memory  and  that
verifies  the  correctness of the retrieved data on a read.  In
order to predict and forestall a serious memory failure, it  is
important  that any errors be reported and analyzed.  Since the
hardware will correct single-bit errors, it is not necessary to
take  any  corrective  action  when  single-bit  errors  occur.
However, should a more serious error occur, it is important  to
know  where all faulty areas in the memory are so that the best
possible correction can be applied.

Therefore the principal  requirement  of  the  monitor  was  to
report  the  relevant  information whenever an MF20 encountered
either a single-bit correctable error or  a  double-bit  error.
This  information will be gathered and delivered to a user-mode
diagnostic, TGHA, that will use the raw error data to compile a
history  file  of  memory  failures.  This history file will be
used for  several  purposes:   as  an  ongoing  compilation  of
failures  that  field-service can use to assess the performance
of the hardware and as a set of statistical data that TGHA  can
use  to  determine  when  software intervention is justified to
reduce the memory error rate.  The former use is a  traditional
                                                         Page 2

function of error data and parallels the use of the information
put into the SYSERR file by the monitor.   The  latter  use  of
this file is peculiar to the recovery procedures available with
the MF20 and is  the  central  mechanism  in  the  planned  for
high-availability of the MF20.

1.1     TGHA error recovery

Each 64K section of MF20 memory has  a  "spare"  MOS  RAM.   In
fact,  the  spare  RAM  is  actually  a series of RAMs that can
substititue for any other  series  of  RAMs  in  the  same  64K
section.   The  intention  is that when a given type of failure
gives  errors  at  or  above  a  predetermined  threshold,  the
software  will  instruct the MF20 to replace the defective RAMs
with the spare RAMs.   Since  there  is  not  sufficient  spare
storage  to replace all of the RAMs in a 64K section of memory,
TGHA must decide how best to apply the spare  RAMs  to  achieve
the maximum effect.  The failures that a memory is experiencing
are reported to TGHA by the monitor and  saved  in  the  memory
history  file  by TGHA.  Therefore, at any time TGHA can assess
the most serious failure in the memory and can apply the  spare
bit  to correct that error.  The goal is to keep as much memory
as possible on-line and functioning with the  minimum  possible
error   rate.   Therefore,  if  a  given  MF20  experiences  no
double-bit errors but  a  pattern  of  single-bit  errors  that
indicate  that  a  particular RAM is malfunctioning, TGHA could
replace the malfunctioning RAM with the spare  RAM  (note  that
this  type of correction is largely perfunctory and serves only
to minimize the  amount  of  error  reporting  the  monitor  is
obligated  to perform).  If subsequently another RAM or pair of
RAMs begin producing double-bit errors, TGHA would be  required
to  use  the  spare  bit  RAM  to  eliminate  this more serious

If TGHA determines that it is not possible to use the spare bit
to  eliminate  all  serious  errors  it may resort to disabling
pages of memory that exhibit high  error  rates.   Due  to  the
physical layout of the MF20 boards, it is possible to encounter
MOS failures that result in one of the following situations:

        1.  Row  error-  One  bit  out  of  every  fourth  word
        throughout one page will be bad.

        2.  Column error- One bit per page throughout 64  K  of
        memory will be bad.  This type of error is the one most
        likely to be corrected by use of the spare bit RAM.

        3.  Bit error- A single bit in a RAM is bad.  This will
        result  in  a  single error in a single word of memory.
        Multiple bit errors per controller  will  result  in  a
        scattering  of  bad  pages  throughout  a  64K range of

        4.  Matrix error- an entire RAM goes  bad  producing  a
        single  bad  bit in every fourth word in a 16K block of
                                                         Page 3


Therefore, it is fairly evident that if the spare bit RAM is in
use  to eliminate a row or matrix error, and one or more column
or bit errors is detected, the only recourse open is to disable
the individual pages that contain the errors.

1.2     TOPS20 monitor support of TGHA

In order to allow TGHA to achieve the  highest  possible  error
detection  and  correction,  some specific support was added to
TOPS20.  In summary, the changes are:

        1.  The memory error handling in APRSRV was enhanced to
        gather  information  about  MOS-specific  errors and to
        save this  information  for  TGHA.   The  monitor  also
        corrects MOS single-bit errors by writing the corrected
        data back into memory, but is not  required  to  either
        compute the ECC or correct the retrieved data (the MF20
        itself has the complete ECC computation and  correction
        algorithm built into its logic).  The error information
        gathered by the monitor is saved on  a  special  memory
        queue until TGHA requests it.

        2.  The kernel routines in PAGEM used by PMCTL to place
        pages  on-line  and off-line were enhanced to include a
        much  larger  collection  of  memory  than   previously
        available.   The  code  in PAGEM now has the ability to
        place most of the pages in  physical  memory  off-line.
        Furthermore,  it  is possible to predict the pages that
        cannot  be  disabled.   Also,  PAGEM  and  SCHED   were
        modified  to allow previously non-existant memory to be
        placed  on-line  and  correctly  accounted   for.    In
        partiular,  the  appearance  of previously non-existant
        memory will allow process working sets to increase  and
        will  affect  the  scheduling  and  partitioning of the
        balance set.

        3.   The  PLOCK  JSYS  was  implemented  to  permit   a
        user-mode  program  to gain exclusive control of one or
        more pages of physical memory and to load a portion  of
        its  address  space  in  that  memory.   While this was
        specifically intended to facilitate  the  analysis  and
        correction   of  MOS-related  errors,  the  feature  is
        available for core memory pages as well.

        4.  Several new functions were added to the  DIAG  JSYS
        to  enable  TGHA to perform its spare-bit substitution.
        These functions are provided to insure that the desired
        spare bit RAM can be substituted in the desired portion
        of memory.  It is likely  that  TGHA  can  perform  the
        substitution  without  the  use of these functions, but
        only through these functions is  the  operation  always
        guaranteed to succeed.
                                                         Page 4

        5.  A new SYSERR entry was defined and implemented  for
        memory parity errors.  This new entry type permits more
        information to be gathered  and  reported  and  permits
        easy extension for future needs.

        6.  All DTE protocols were modified to allow TOPS20  to
        inhibit  each  of  the  front-ends  from  examining  or
        depositing memory.  This change is necessary to  insure
        that  TGHA  can  complete  its  spare-bit  substitution
        without any portion of memory being referenced.

        7.  The  monitor  runs  TGHA  at  system  start-up  and
        whenever a MOS memory error is detected.

1.3     Supported configurations

It is  possible  to  mix  core  and  MOS  memory  on  the  same
DECsystem20.   However,  it is not possible to mix core and MOS
on the same memory bus.  A single SBUS will support  a  minimum
of  64K of MB20 memory and a maximum of 256K of MB20 memory.  A
single FBUS (the extended form of the SBUS required to  support
MF20  MOS  memory)  can support a minimum of 256K of MOS memory
and a maximum of 1024K of MOS memory.  It is possible to have a
MOS-only  configuration  with two FBUSes and up to 1024K of MOS
memory on  each  FBUS.   However,  this  maximum  configuration
requires  that  another bay be added to the machine to hold the
memory logic.  (Note that  configuration  data  is  subject  to
engineering  changes.  The data given here is based on the most
recent expectations).

2.0     Design of new monitor features

This section describes the monitor changes done for release 3A.
Where  applicable,  some  discussion  is  included  to indicate
features specific to the support of MF20 memories.

2.1 New SYSERR entry for memory parity errors

The MF20 memory controllers respond to a  significantly  larger
number  of SBUS diag requests than do MA20 or MB20 controllers.
An SBUS diag is used to communicate information to and from the
memory  controllers  and  is the tool used to configure memory.
The monitor executes SBUS  diag  functions  whenever  a  memory
parity  or  SBUS  error  occurs  so  it may record the state of
memory in the SYSERR file.  In  order  to  accomodate  the  new
controller  numbers,  and  in order to be able to record all of
the important controller information, the format of the "memory
parity  error"  SYSERR  entry  has been changed.  The new block
type is 162 and is structured as follows:

        !                                       !
                                                         Page 5

        !         APR status word               !
        !                                       !
        !          error register               !
        !                                       !
        !          PC flags                     !
        !                                       !
        !              PC                       !
        !                                       !
        !        count of bad words found       !
        !                                       !
        !       logical AND of bad addresses    !
        !                                       !
        !       logical OR of bad addresses     !
        !                                       !
        !       logical AND of bad data words   !
        !                                       !
        !       logical OR of bad data words    !
        !       -# of controllers,,offset to    !
        !                       SBUS data area  !
        !                                       !
        !       first ten bad addresses         !
        !                                       !
        !       first ten bad data words        !
        !       memory references for first     !
        !       ten bad words                   !

and the format of the SBUS DIAG area,  located  at  the  offset
pointed to be the tenth word is:

        !                                       !
        !       -# of SBUS diags,,controller    !
        !       SBUS diags for controller       !
        !       -# of SBUS diags,,controller    !
                                                         Page 6

        !       SBUS diags for controller       !

This new format allows for any number of  controllers  and  any
number  of  SBUS  diags  for  each  controller.  TOPS20 will no
longer produce the old format of SYSERR entry (entry type  161)
since this new type completely supercedes it.

2.2     Recording MOS-related errors

Previously, the monitor did not distinguish an SBUS error  from
an  MB  parity  error;   and  treated  both  of these as parity
errors.  The MF20  controllers  report  single-bit  correctable
errors  as SBUS errors and, therefore, it is important that the
monitor distinguish an SBUS  error  as  distinct  from  a  hard
memory error.

An SBUS error now causes the monitor to  poll  all  the  memory
controllers  (both  SBUS  and FBUS controllers) looking for the
occurrence of MOS correctable or double-bit errors.   For  each
controller  that  responds  with  one  or  both  of these error
conditions, the monitor will acquire a block of  resident  free
space  and  store  the  relevant  SBUS diag information in this
block.  The free block is then placed on a special memory queue
until  TGHA  requests  the  information  (See  PMCTL changes in
section 2.5.3).

In order to prevent an onslaught of errors from the same memory
fault, the monitor directs the offending controller to suppress
the reporting (but not the  correction)  of  single-bit  errors
occurring  in the section of memory in which the last error was
detected.  The controller remains this way until  TGHA  can  be
run  to  analyze the fault.  TGHA will reenable error reporting
once  it  has  completed  its  analysis  and  applied  whatever
correction  is  indicated.  The hope is that TGHA can determine
the full extent of the failure from the information recorded by
the  monitor.   However, if further research is warranted, TGHA
could use the PLOCK JSYS to acquire the offending memory  pages
and complete the analysis itself.

If the monitor detects a hard error in any of the  controllers,
it  will  assume that a parity error has occurred and force the
parity error recovery scan.   Whether  or  not  an  SBUS  error
occurs, an MB parity error is treated as it always has been.

2.3     Placing pages off-line

The kernel routines  in  PAGEM  that  maintain  the  states  of
physical  memory pages have been enhanced to allow a privileged
user program to acquire  or  disable  most  pages  of  physical
                                                         Page 7

memory.   It is theoretically possible to provide the tools for
manipulating any arbitrary page of  memory,  but  the  cost  to
include those pages now excluded is disproportionately high.  A
reasonably accurate list of the pages not  available  is:   the
EPT,  the  monitor's UPT, any page containing a CST0 entry, any
page containing an SPT  entry,  any  page  containing  the  DTE
protocol  communications  region,  the page containing MMAP and
any page belonging to the resident free space pool.  In certain
specialized monitors, e.g.  TOPS20AN there are additional pages
that cannot be controlled.  A reasonable estimate of  the  size
of  the  areas  listed  is:   CST0-  one word for every page of
memory supported, 2 to 4 pages;  SPT, 4 pages;  MMAP, one page;
resident  free  pool,  at  least  2  pages;  DTE communications
region- 1 page.

It is possible that a page not within one of the stated regions
will be unavailable when requested.  While this should be rare,
the manner in which pages are allocated and used within  TOPS20
can   create   situations   that   would   make  certain  pages
unavailable.  It is extremely likely that failure to acquire  a
page on request is due either to the page being one of the ones
excluded by the list above, or its being  off-line  or  already
assigned by some other privileged program.

Since these kernel routines are used by the PMCTL and the PLOCK
JSYSes,  the  restrictions described here apply to these JSYSes
as well.

2.4     PLOCK JSYS (Previously LOCK JSYS)

This is a  newly-implemented  JSYS  that  allows  a  privileged
program  to  acquire  physical  memory  and  place a designated
section of its address space in the memory.   The  JSYS  allows
the  program to specify the memory pages to be used, or it will
select  the  pages  under  a  "free  choice"  directive.    The
specification is:


AC1/    starting page address if locking, -1 if unlocking

AC2/    fork handle,,starting page number

        Presently, only .FHSLF is accepted

AC3/    Flags,,count

Where FLAGS are:

LK%PHY  If locking, the value in AC1 is the first physcial page
        desired.  If this bit is off and AC1 is not -1, then it
        is "free choice".

LK%CNT  The right-half of AC3 is a repeat count
                                                         Page 8

LK%AOL  Allow locking into off-line pages

LK%NCH  Don't cache the pages

Requires WHEEL, OPERATOR, or MAINTENANCE capability.

Returns:        +1 Always

Illegal instruction trap on error.

The following operational points should be noted:

        1.  If PLOCK is unable to honor any one  of  the  pages
        within  the  repeat  count,  it  will unlock the entire
        region specified by the arguments.

        2.  It is permissable to "unmap" a page that was locked
        with  PLOCK.   This  will  unlock the process' page and
        return the now unlocked physical page to  its  previous

        3.  PLOCK uses the kernel routines in PAGEM to  acquire
        the designated physical page.  The page selected by the
        user must be capable of being placed off-line in  order
        for PLOCK to acquire it.

2.5     DIAG JSYS changes

In order to ensure that TGHA can use the spare bit in the  MF20
memories,  the  DIAG JSYS has acquired two new "memory control"
functions.   Before  describing  the  JSYS  change,   a   brief
discussion of the problem being solved is called for.

2.5.1   TGHA and the spare bit (a fable)

When TGHA decides to use the spare  bit  in  one  of  the  MF20
controllers,  it  must  ensure  that  the pages affected by the
change do not cause parity errors.   If  one  or  more  of  the
affected pages is in use by the monitor (i.e.  neither assigned
nor on the replaceable queue), then the bit to be replaced  has
a   value  that  is  significant.   Besides  being  part  of  a
thirty-six bit data word that some program  has  generated,  it
was  also  used as a component in the computation of the word's
parity and ECC.  Therefore, if TGHA merely exchanges the  spare
bit  for  one  of  the  active bits in the word, there is a 50%
chance that the new bit disagrees with the  one  replaced.   If
all  other  bits in the word are good, then the worst that will
happen  is  that  a  single-bit  correctable  error  has   been
introduced  into  the  word  and  the hardware will correct the
error the next time the word is referenced  (very  clever,  you
say?).  However, if the word has one other faulty bit, then the
spare bit  may  have  introduced  an  uncorrectable  double-bit
error.   It  is likely that this latter case will occur since a
very important use of  the  spare  bit  will  be  to  eliminate
                                                         Page 9

chronic double-bit errors by replacing one of the faulty bits.

All of this theory having been digested, the  obvious  solution
is  to  disable all of the pages that are to be affected by the
spare bit substitution,  perform  the  substitution,  and  then
place  the  pages on-line again.  If this can be achieved, then
the revitalized pages will not contain useful data  and  cannot
cause  parity  errors.   However,  as  we have observed, TOPS20
cannot  guarantee  that  any  arbitrary  page  can  be   placed
off-line.   Furthermore, it is unacceptable for these few pages
to suddenly develop uncorrectable errors.

Several alternate techniques exist for performing the spare bit
substitution  error-free.   However, they all require that TGHA
be able to reference the memory in question, save  the  current
data,  and replace it after the spare bit has been substituted.
While the techniques vary in complexity and direction, they all
require  that  TGHA  can  gain  control  of  the  memory  to be
affected, read the data from the memory, perform some number of
SBUS  diag  functions,  and  replace the memory all without any
other program or hardware device modifying the memory.   It  is
this  capability  that  is  provided  by  the  new DIAG "memory
control" functions.

2.5.2   New DIAG functions

The two new DIAG functions are designed to  provide  the  exact
functionality  required by TGHA and are not provided as general
"memory control" functions. Get memory

AC1/    -count,,address

At address:

word 0          .DGGEM
word 1          First page in user address space
word 2          First physical memory page number
word 3          number of pages
word 4          user address of AR/ARX parity trap routine

Returns:        +1 Always.  Note conditions below

This  function,  upon  a  successful   return,   performs   the

        1.  TOPS20 has requested that all  of  the  front  ends
        refrain  from accessing common memory.  This request is
        made through a newly devised DTE  protocol  state  that
        allows  primary  protocol  to remain in effect but that
        "freezes" all DTE interactions.

        2.  The hardware PI system has  been  turned  off.   No
        scheduling can occur.
                                                        Page 10

        3.  The time base and  the  interval  timer  have  been
        turned  off.   This  is  to insure that the meter board
        does not modify the EPT.

        4.  All DTE byte transfers have completed.  This is  to
        insure  that the DTE/microcde does not change memory or
        update any values in the EPT.

        5.  All RH20 activity has ceased.  This  is  to  insure
        that  no  memory  is  modified  and the channels do not
        store into their logout areas in the EPT.

        6.  The designated pages of the process's address space
        have  been  set  up  to address the designated physical
        memory.  Note that this  is  not  the  same  as  having
        requested   the   pages  with  PLOCK.   By  using  this
        function, the data in the physical  memory  pages  have
        been  retained  and  the ownership of the pages remains
        unchanged.  This is theoretically  illegal  within  the
        rules  of  TOPS20 paging, but the fact that TGHA is now
        running with complete control of the computer  obviates
        the exception.

        7.   The  CST0  entries  for  each  of  the  designated
        physical pages have been saved and set as follows:

                a.  The age is set to the present  age  of  the
                requesting process

                b.  The "process use" field is set to all ones.

                c.  The "modified" bit is set to "one".

        The effect of this is that should the microcode attempt
        to  store  either  the  page's age or the current "use"
        bit, the data in the CST0 entry will be unchanged.

        8.  The entire address space of the requesting  process
        has  been  locked in memory.  Actually, only pages that
        exist  at  the  time  of  the  JSYS  call  are  locked.
        Therefore,  the  process  must  ensure  that all of the
        pages it needs exist and are private when the  JSYS  is

        9.  If TGHA  specified  an  AR/ARX  trap  address,  the
        monitor  has  set up the proper dispatch.  In the event
        of an  AR  or  ARX  parity  error  trap  during  TGHA's
        execution,  the  monitor  will  immediately trap to the
        user address given in the DIAG call.

This function is available to TGHA for performing its spare bit
substitution.   As  noted  earlier,  it is likely that TGHA can
place the affected physical pages off-line and therfore perform
its spare bit substitution without the aid of the DIAG JSYS.  
                                                        Page 11 Release memory

This is the only JSYS that should  be  executed  following  the
successful  execution of a .DGGEM DIAG JSYS.  It is intended to
undo the restrictions imposed by  its  companion  function  and
allow timesharing to proceed.  The calling sequence is:

AC1/    -count,,address

at address:

word 0          .DGREM

no other arguments are required.  The monitor has  "remembered"
the values that need to be restored.

The amount of time that the system is unavailable  following  a
.DGGEM  function  and  up  to  a  .DGREM function is beyond the
monitor's control.  TGHA is in complete control of  the  system
and  will  direct  the  monitor when it may proceed with normal

2.5.3   PMCTL changes .MCRME function

The monitor collects information about MOS memory errors.  This
information  is  held in resident free space buffers until TGHA
is scheduled to read it.  TGHA will use a new function  of  the
PMCTL  JSYS,  .MCRME, that will provide the memory information.
The format of the new function call is:

AC1/    .MCRME
AC2/    length of argument block
AC3/    address of block

The monitor will store the error  information  into  the  block
addressed  by  AC3 and update AC2 on return.  If not all of the
outstanding error information will fit, the monitor will return
as  much as it can.  It will be TGHA's responsibilty to reissue
the JSYS  until  the  monitor  indicates  that  no  more  error
information  is  available.   It  is important to note that the
error information is retained by the monitor  only  until  TGHA
reads it after which the monitor will discard the information.

Returns:        +1 always


PMCLX4  No more error information

The format of the returned data is:

                                                        Page 12

 .PMMTP ! type (9) ! count (9)! controller #    !
 .PMMRG !       error register at error         !

The remainder of the block is dependent on the type  of  entry.
For type .PMMER, MOS memory error, the format is:

 .PMMSY !       syndrome of the error           !
 .PMMBN !       block number of the error       !
 .PMMSB !       spare bit number                !
 .PMMEA !       error address                   !
 .PMMSN !       4 words containing the four     !
        !       32-bit serial numbers from      !
        !       the PROMs.                      !
        !---------------------------------------! Extension to .MCRPS function

In order to ease TGHA's job, PMCTL has been extended to provide
the  status  of  many  contiguous memory pages with one monitor
call.  The new format of the .MCRPS function is:

        .MCPPN          -count,,first page


                        0,,first page

        .MCPST          status of "first page"




If the contents of the .MCPPN  offset  is  negative,  then  the
left-half  is assumed to be the negative count of the number of
pages for which status is required.  In order for the  JSYS  to
succeed,  the  count given in AC2 must be greater than or equal
to the the number of pages for which status  is  required  plus

2.6     Running TGHA

The monitor will run TGHA under job 0 as  part  of  the  system
                                                        Page 13

start-up  procedure.  This is done so that TGHA can verify that
all double-bit errors have been "mapped" out and that  the  MOS
memories  are  properly  configured.   It is possible that THGA
will attempt to remove double-bit errors by use  of  the  spare
bit.   When  TGHA  is  run at system initialization, it will be
started at offset one of its entry vector.

Any time the monitor detects a correctable or uncorrectable MOS
error,  it  will  ensure  that TGHA is running.  If TGHA is not
running, it will be run and started at offset zero of its entry

TGHA can be started manually from a  time-sharing  terminal  to
perform  certain  specific  functions.   Running  TGHA  from  a
terminal does not affect the monitor's decision to run TGHA  to
analyze memory errors or reconfigure the MOS memories.

3.0     References

1.      Monitor Calls Reference Manual
        Order No.  AA-4166B-TM

2.      MF20 Software Plan
        Jeff Barry

3.      MF20 functional Specification, May 18,1977
        Derrick Chin, Jeff Barry, Peter Lawrence
        Paul Guglielmi, Bob Hamel

4.      MF20 MOS RAM error analyzer (TGHA)
        Program Specification
        Jeff Barry