Google
 

Trailing-Edge - PDP-10 Archives - BB-H311B-RM - swskit-documentation/dumps.mem
There are no other files named dumps.mem in the archive.
*** Editors note:
	This material was provided too late to be sufficiently reviewed,
	corrected and edited as desired.  We apologize for the spelling
	errors and form and some of the content level.  In the short time
	available I have added some few notes to the text on things that
	came up on a first reading.  Hopefully, this article can be
	improved for future SWSKITs (along with everything else), and
	included in the HANDBOOK portion.
				   -JGZ-




        1.0     Introduction . . . . . . . . . . . . . . . . . . . . 1
        2.0     Necessary Preparations . . . . . . . . . . . . . . . 1
        2.1       Creating The Dump File . . . . . . . . . . . . . . 1
        2.2       The BOOT . . . . . . . . . . . . . . . . . . . . . 2
        2.3       Getting A Front-end Dump . . . . . . . . . . . . . 3
        3.0     General Information  . . . . . . . . . . . . . . . . 3
        3.1       The Basic Materials  . . . . . . . . . . . . . . . 4
        3.2       Identifying The Type Of Crash  . . . . . . . . . . 5
        3.2.1       The BUGHLT Itself  . . . . . . . . . . . . . . . 5
        3.2.2       The Monitor's Stacks . . . . . . . . . . . . . . 6
        4.0     JSYS Context (UPDL)  . . . . . . . . . . . . . . . . 7
        5.0     PAGER Context (TRAPSK) . . . . . . . . . . . . . . . 9
        6.0     PSI Context (PIPDB)  . . . . . . . . . . . . . . .  11
        7.0     Scheduler Context (SKDPDL) . . . . . . . . . . . .  12
        8.0     DTE Interrupt Context (DTESTK) . . . . . . . . . .  13
        9.0     I/O Queueing (PHYPDL)  . . . . . . . . . . . . . .  14
        10.0    Device Interrupt Context (PHYIPD)  . . . . . . . .  14
        11.0    APR Interrupt Context (MEMPP)  . . . . . . . . . .  16
TOPS-20 Crash Dump Analysis                                       Page 1


     1.0  Introduction

          The  purpose  of  this  article  is  to  provide  some   basic
     guidelines  for those who have never analyzed a TOPS-20 crash dump.
     The information contained in this  article  refers  exclusively  to
     Version  4  of  the  TOPS-20  Monitor,  although  most of the basic
     principles will also apply to  earlier  versions  of  the  Monitor.
     None  of  the  concepts  included in this article can be considered
     highly advanced,  indeed  it  is  doubtful  that  there  exists  an
     "advanced" methodology in crash dump analysis.  Such techniques are
     the result of nothing more than the continual exercise of the basic
     skills.   In  all  cases, the person who is to perform the analysis
     must be familiar with the internal structures of the Monitor, which
     requires  their  attendance  at  one of the TOPS-20 Monitor courses
     offered by Educational Services.  Obviously, one must know where to
     look  for  a potential problem before hoping to solve it.  For this
     reason, this article  assumes  that  the  reader  has  an  in-depth
     knowledge  of  the  basic  structures  of the TOPS-20 Monitor.  Any
     comments or sugestions to improve  the  content  of  this  material
     would be most welcome.



     2.0  Necessary Preparations

          Obviously enough, dumps do not simply appear as a result of  a
     crash.   There are certain prerequisites to obtaining a dump, which
     will be discussed in this section.



     2.1  Creating The Dump File

          TOPS-20 will not, as a rule, create  a  dump  of  the  Monitor
     unless  the  system is properly prepared to do so.  This means that
     there must first exist a file called PS:<SYSTEM>DUMP.EXE that  will
     accomodate  the  dump.   This file can be found on the distribution
     tape for TOPS-20, or it can be created by using the MAKDMP program,
     which  will  accept  the  memory size from the user, and create the
     proper sized file.  The file must contain a  sufficient  number  of
     pages  equal to the total number of pages of physical memory in the
     Decsystem-20.  For example,  a  system  that  has  1024K  words  of
     memory,  let's say a 2060, should have a DUMP.EXE file that is 2048
     pages long.  It is important to remember that the number  of  pages
     in  the  dump  file  must be twice the size of the machine's memory
     capacity in K words.  In addition, unless this file already  exists
     before the crash that we wish to capture, we will be unable to save
     the image of the  system,  because  the  BOOT  program  hasn't  the
     ability to create such a file on it's own.


*** Ed. note - not quite right; can dump to other files and note that the
	right size is that of DUMP.CPY.  The DUMP.EXE on the distribution
	tape should be large enough for every one to start with.
	BOOT accepts a filename, and under bad circumstances, a dump can
	be written to a pack other than the PS: at the time of the crash.
TOPS-20 Crash Dump Analysis                                       Page 2


     2.2  The BOOT

          Normally, when the system has crashed for whatever reason,  it
     will  reload  itself  using  the  BOOT  program.   This Auto-reload
     feature can be suppressed, by giving the "SET NOT RELOAD" or "CLEAR
     RELOAD"  command  to  the  PARSER.  The PARSER must first be set in
     PROGRAMMER mode, via the "SET CONSOLE  PROGRAMMER  command.   These
     commands  do not apply to 2020's, of course.  There is no clear-cut
     method of preventing the 2020 from auto-reloading  after  a  crash.
     but  is  difficult  to  prevent on 2020's.  BOOT has a limited file
     system capability when creating the file to contain the  dump,  and
     in  this  manner  avoids  complicating  a possibly compromised file
     structure during the reload.   It  is  for  this  reason  that  the
     DUMP.EXE  file must already exist on the public structure, for BOOT
     can find it there, but it can not create it if it does not  already
     exist.  Also, because BOOT resides in main memory of the host (KL10
     or  KS10)  processor,  small  portions  of  the  Monitor  will   be
     overwritten  when  BOOT  is loaded into memory.  Currently, BOOT is
     written into that  area  of  the  resident  Monitor  that  normally
     contains pure code, and as such is not usually of much consequence.
     When one needs to refer to this portion of  the  code,  either  the
     listings or fiche should be used.

*** Ed. note - 2020 no reload requires putting a magic number in a magic
	location in the 8080 memory that varies with the rev.  Perhaps
	the easiest way to get the same sort of effect is to change the
	BUGHLT code not to request the reload.


          If for some reason the system fails to auto-reload, then it is
     still possible to obtain a copy of the dump.  To do this, the front
     end must have at least loaded the BOOT  program,  and  the  console
     will display the BOOT prompt:

                        BOOT>

     BOOT has a number of commands that may be used  to  manipulate  the
     contents  of  the  processor  memory;  in this case, the command we
     will use will cause BOOT  to  copy  the  contents  of  memory  into
     PS:<SYSTEM>DUMP.EXE:

                        BOOT>/D
                        BOOT>


*** Ed. note - First, if BOOT is not there, you are not doomed yet, if you
	can get it loaded by the normal means (boot from switches, MCR KLI,
	etc.).  Second, if the /D doesn't seem to take (IO ERROR, etc.),
	you could even put up a KLAD pack or other to get the dump written
	somewhere if it is important.

          At this point the system may be brought up normally,  and  the
     analysis of the dump may begin.

          Similarly,  a  KL-10  system  may  be  set  to  suppress   the
     auto-reload facility, and the CTY will prompt with the KLI> prompt.
     Simply typing the word "BOOT"  will  load  the  BOOT  program  into
     memory.   There  are cases where the system may be completely hung,
     and it is  unclear  how  to  best  initiate  an  orderly  shutdown.
     Obviously, it is always possible to type the control-backslash (^\)
     character at the CTY to get into the  front-end  parser,  but  then
     what  can  be  done?   The  front-end parser allows the operator to
     force the processor to jump to a specified  location,  and  in  the
     case  described  above, this feature may be used to force a BUGHLT.
     This can be done after typing ^\, with the following commands:

                PAR>SET CONSOLE PROGRAMMER
TOPS-20 Crash Dump Analysis                                       Page 3


                 CONSOLE MODE: PROGRAMMER
                PAR>JUMP 71
                PAR>

*** Ed note - Equivalent stuff is available for the 2020 after the ^\
	gets you to KS10> mode.  See the KS console information section
	of the SWSKIT HANBOOK for figuring out exact typein.

     causing the console to return to USER mode, connected to the KL-10.
     This  will  be  followed immediately by a KPALVH BUGHLT (Keep Alive
     Halt), and the system will perform  the  usual  BUGHLT  procedures.
     The  above  command  forces  the  processor to jump to location 71,
     which in turn will cause the BUGHLT, sweeping the cache  to  ensure
     all  of the dump taken will contain valid data.  Simply forcing the
     processor to halt, and then reBOOTing and getting a dump will cause
     the  cache to be invalidated, and random locations in the dump will
     not contain valid data.



     2.3  Getting A Front-end Dump

          The front-end will generally create a crash dump  file  called
     PS:<SYSTEM>0DUMP11.BIN,  containing  the  core image of the PDP-11.
     If the front-end is hung, and none of the terminals are usuable, it
     is  still  possible  to  obtain  a dump of the -11.  By setting the
     HALT/ENABLE switch of the -11 to the HALT position, and  then  back
     to the ENABLE position, the KL-10 will force the -11 to reload.  In
     the process of reloading the -11, the KL will indicate to  the  -11
     that  it has reloaded, and send the necessary information to set up
     the terminals, and unit record devices connected to the  -11.   The
     -11 will, in the process of reloading, dump the old core image into
     the 0DUMP11.BIN file mentioned earlier.   In  the  event  that  the
     problem  will  be  the  subject of an SPR, the front-end crash dump
     should also be included on the DUMPER tape with the SPR.



     3.0  General Information

          It would not be practical to define a  method  of  approaching
     each  BUGHLT in the system, but the state of the system at the time
     of the crash may be defined in terms of the data structures that it
     accesses.   By  looking  at  the Monitor's stack, the status of the
     current job, and process, and the condition of the Monitor's tables
     that were in use by the code that BUGHLTed, we can define a limited
     number of "types" of crashes, e.g.,  a  scheduler  crash,  a  pager
     crash,  an  APR  or  device interrupt crash.  Each crash will occur
     while the Monitor is using a specific subset of the  internal  data
     structures  of  the system.  We will attempt to limit the number of
     "types" of crashes based upon the function being performed  by  the
     Monitor  at  the time of the crash.  In the sections following this
     general information, we will suggest some of  the  areas  to  check
     when  looking  at  each  type  of  crash.   This information is not
     complete, but  contains  some  of  the  information  that  is  more
     significant in each particular context.
TOPS-20 Crash Dump Analysis                                       Page 4


     3.1  The Basic Materials

          The most important materials  in  looking  at  dumps  are  the
     source listings of the Monitor.  Either in the form of fiche, or in
     machine-readable format, it is absolutely essential to have  access
     to  listings of the Monitor to be able to analyze any dump, because
     without these listings you would simply be working in darkness.  In
     order   to   understand   the   significance  of  any  BUGHLT,  the
     circumstances of the BUGHLT must be known, as well  as  the  reason
     the  Monitor  could not continue.  To find out this information, we
     must look in the listings.  After the system has re-BOOTed,  it  is
     always  a  good  idea to take note of the console output, including
     the name of the BUGHLT, and any other  associated  console  output.
     Try  to  be  sure  that  no unusual messages, other than the BUGHLT
     itself, appeared on the console within a reasonable period of  time
     before  the  system  crashed.   BUGCHK's, BUGINF's, and "Problem on
     device..." type messages are always significant.  Similarly, a copy
     of  the output from the SYSERR program will be helpful in revealing
     any failing hardware that should be investigated first.  Always try
     to eliminate the possibilty of a hardware problem FIRST, especially
     if the site has had any recent problems in this area.   These  last
     two  points  are  significant in determining the environment at the
     time of the crash, and, in the event that the  dump  will  be  made
     part of an SPR, the information will become essential.

          Naturally, it  will  be  necessary  to  have  a  copy  of  the
     MONITR.EXE  file  that  was  running when the crash occurred, and a
     copy  of  FILDDT  to  look  at  the  dump.   With  these  materials
     collected, we can hopefully make a valid analysis of the dump.

          Here is a list, then, of the necessary and  helpful  materials
     needed to look at dumps:

      o  The MONITR.EXE file

      o  The DUMP.CPY file from the crash

      o  A copy of FILDDT.EXE from the distribution tape

      o  A copy of the SYSERR output

      o  A complete set of Monitor and Exec Fiche or listings

      o  The CTY output from the crash

      o  The Monitor Calls Reference Manual

      o  A copy of the SWSKIT tape

      o  Any other TOPS-20 Manuals that may be appropriate, such as  the
         Operator's Guide, or the Installation Guide.

      o  The TOPS20.BWR file
TOPS-20 Crash Dump Analysis                                       Page 5


     3.2  Identifying The Type Of Crash

          The Monitor performs several basic operations, each  of  which
     has  its  own  set of tables and data structures.  These operations
     can be defined as:

      o  JSYS processing

      o  Page faults

      o  PSI Service

      o  Scheduling

      o  DTE interrupt Service

      o  Initiating I/O transfers (queueing)

      o  Device interrupt Service

      o  APR interrupt Service




     3.2.1  The BUGHLT Itself

          There are specific  areas  in  any  crash  dump  that  can  be
     examined  to  determine the status and context of the system at the
     time of the crash.  The most  obvious  of  these  is  the  location
     called  BUGHLT,  which  will  contain the address whence the BUGHLT
     code was called.  It is good practice to remember when  looking  at
     this  address  that  there  are  portions  of the monitor that were
     overwritten by the BOOT program,  when  the  dump  was  taken,  and
     therefore, the contents of the address that called the BUGHLT code,
     that is, the  location  whose  address  is  contained  in  location
     "BUGHLT",  may  not  point  to  the same code that the fiche or the
     listings indicate.  A good example of such a BUGHLT  is  a  PTNIC1,
     one  that  is  a  part  of the APRSRV code, which is overwritten by
     BOOT.

          As of Release 4, all of the BUGHLT's, as well as the  BUGCHK's
     and  BUGINF's  in  the  Monitor are defined and documented in a new
     module called BUGS.MAC.  This module not only  contains,  for  each
     BUGHLT,  etc.,  the  name and a string describing the type of halt,
     but also a description of the circumstances that cause the halt, or
     check,  etc.,  to occur.  There is a new argument to the macro that
     creates the BUGHLT's, etc., that is supposed  to  indicate  whether
     the  problem is hardware or software related.  You will find either
     the word "HARD" or "SOFT" in this location of the Macro  call.   In
     addition,  the  additional  information  supplied  in  BUGCHK's and
     BUGINF's now has a string associated with it  that  indicates  what
     the  additional  information  actually  represents.   Finally,  one
     argument to the  BUGDEF  (bug  definition)  Macro  is  a  narrative
     documentation  of  circumstances  that  can cause the problem being
TOPS-20 Crash Dump Analysis                                       Page 6


     seen.  Needless to say, this sort of information is  invaluable  to
     anyone  looking  at  a  crash  dump.  Unfortunately, not all of the
     documentation of the BUG's was completed, and as a result, many are
     indicated  as  being  "HARD"  problems, when actually they are not.
     Those BUGDEF's that include the narrative description  of  the  BUG
     have  been  completed,  but  those that do not may indicate falsely
     that the problem is hardware related.

          The BUGHLT's are performed by using the XCT instruction  of  a
     location  that  contains a JSR BUGHLT instruction.  In the location
     following the JSR BUGHLT, is the name  of  the  BUGHLT,  in  SIXBIT
     format,  such  as  "PTNIC1".   Finally  in  the  event  of multiple
     BUGCHK's, BUGINF's or even nested BUGHLT's, the  location  "BUGNUM"
     contains  the  number of BUGHLT's, BUGCHK's, and BUGINF's since the
     last system start-up.  This location is most helpful in obtaining a
     clearer  view  of  the circumstances of the crash.  The case of the
     BUGHLT code itself causing a BUGHLT is extremely  unusual,  but  in
     certain  cases of extreme degradation of the system's data bases or
     "pure" code pages, this is a possibility.



     3.2.2  The Monitor's Stacks

          The next piece of valuable information  is  contained  in  the
     stack  pointer,  P.   This  location  will  point  to  one of eight
     possible monitor stacks, and will give a  strong  indication  about
     the  context  of the monitor at the time of the crash.  Identifying
     the type of BUGHLT will usually be a  direct  indication  of  which
     stack  will  be  in  use,  however under certain circumstances, the
     monitor may crash while changing from one  stack  to  another,  and
     such  a  circumstance could provide a useful insight into the state
     of the system just before the crash.  The following are  the  names
     of  the  eight possible monitor stacks, and the context under which
     each of them is used:


     UPDL      This  is  the  user  stack,  in  that  it  is  used  when
               processing  a  a  user's JSYS in exec mode.  Whenever any
               user executes a JSYS, this area in his PSB  is  used  for
               the  stack.   Also, those processes under job 0 which run
               in exec mode will also use this stack.

     TRAPSK    This stack is used by the paging code whenever a  process
               page  faults.   Normally a page fault will occur while in
               the midst of performing some other function,  such  as  a
               JSYS, and the stack pointer at the time of the page fault
               will be in location TRAPAP, which in turn  will  in  this
               case point to UPDL.

     PIPDB     This is used by the software interrupt handler.

     SKDPDL    This stack is used by the scheduler.
TOPS-20 Crash Dump Analysis                                       Page 7


     DTESTK    This stack is used by the DTE interrupt service routines.

     PHYPDL    This stack is used by  PHYSIO  code  in  the  process  of
               queing I/O request blocks (IORB's).  These IORB's are the
               means by which RH20 data transfers are initiated.

     PHYIPD    This stack  is  used  by  the  PHYSIO  interrupt  service
               routines, and therefore is the interrupt-level equivalent
               of PHYPDL.  It is important to remember  that  these  two
               stacks  are  independent of each other, and should not be
               confused.

     MEMPP     This stack is used when processing APR interrupts


          The stack that is being used, and the  section  of  code  that
     executed  the  BUGHLT  will  indicate  the  type of BUGHLT that has
     occurred, file  system  BUGHLT's  will  be  observed  either  while
     performing  a JSYS, servicing an interrupt, or otherwise attempting
     to access a file system that has corrupted to the  point  of  being
     unusable.



     4.0  JSYS Context (UPDL)

          When a process executes a JSYS, the Monitor performs the  JSYS
     by  dispatching through a table called JSTAB to the proper routine.
     These routines are named by convention as the JSYS  name,  preceded
     by  a  ".",  thus  the  routine  to perform the JSYS PMAP is called
     ".PMAP::".  This name is always a global  symbol.   The  last  JSYS
     executed  in  user  context is saved in the PSB for the process, in
     location KIMUU1, and KIMUU1+1.  The second of these locations  will
     contain  the  dispatch offset in JSTAB;  this number, when combined
     with the JSYS opcode (104000,,0), is the last JSYS performed by the
     user.  This, then, will point indirectly through the JSTAB table to
     the place where the user JSYS began processing.  By  following  the
     code,  and examining the stack, it is often possible to reconstruct
     the events leading to the crash.  The stack will contain two copies
     of  the  user's  program  counter  (PC) and flags in the first four
     locations of UPDL.  The PSB location MPP  will  contain  the  stack
     pointer  at  the  time  of  last  JSYS,  and  each time the Monitor
     performs a JSYS internally, this data is pushed onto the stack, and
     set to the current value of P.

     Initial JSYS stack set-up:

        UPDL/   PC
        UPDL+1/ flags
        UPDL+2/ PC
        UPDL+3/ flags


     JSYS in Monitor context (nested JSYS):
TOPS-20 Crash Dump Analysis                                       Page 8


        UPDL+n/ INTDF           ;old interrupts-deferred flag
              / MPP             ;previous PC, or level of nesting
              / PC of JSYS
              / PC flags


          Some other useful locations in JSYS context are:



                             JSB Locations


     USRNAM    This contains the name of the user, in ASCII.


                             PSB Locations

     JOBNO     Contains the number of the job for this process.

     FORKN     Contains the fork number for the top fork of the  job  in
               the  left  half  of  the word, and the fork number of the
               current fork in the right.

     INTDF     Contains -1 if process is OKINT, 0 or greater if NOINT

     NSKED     Contains 0 if process is OKSKED, 1 or greater if NOSKED.


        Monitor Fork Table - indexed by the current fork number

     FKCNO     Contains the SPT offset that points to the second page of
               the  PSB  in the left half of this word, and the process'
               core number in the right half.  The core number  is  used
               to  set  a  bit  in  the Process Use Register in the CST0
               table for each page that the process owns.

     FKINT     Contains the  pseudo-interrupt  communications  register,
               with  flags  in  the  left  half  describing  the type of
               request, and the channel number of  the  request  in  the
               right half.

     FKINTB    Contains the pseudo-interrupt  channel  requests  pending
               since the fork's last PSI interrupt.

     FKJOB     Job number of the fork in the left half,  and  SPT  index
               for the JSB in the right half.

     FKJTQ     Part of a doubly linked list of forks  that  are  waiting
               program  software interrupt the Monitor.  JTLST points to
               the top fork on the list.

     FKNR      Contains in bits 0-8 the age stamp value at the last time
               local  garbage collection was performed, in bits 9-17 the
               current age stamp, to be  loaded  into  the  pager's  age
TOPS-20 Crash Dump Analysis                                       Page 9


               register  when  the  fork  runs,  and  in  bits 18-35 the
               reserve woking set size.

     FKPGS     Contains the SPT indices for the process page  table,  in
               the left half, and the PSB in the right half.

     FKPGST    Contains the address of the routine to test  for  balance
               set  wait  satisfied in the right half, with test data in
               the left.  If the fork is not in the  balance  set,  this
               contains  the  time  of  day that the fork entered a wait
               list.

     FKPT      Part of a linked list of forks on a particular  schedular
               list,  such  as GOLST, WTLST, etc.  The right half of the
               word contains the address of  the  next  element  in  the
               list,  and  the  left  half  contains  the address of the
               beginning of the list, i.e., GOLST, WTLST, TTILST, etc...

     FKQ1      Contain the  fork's  remaining  run  quantum.   When  the
               quantum  expires, the fork is moved to a lower run queue,
               and given the appropriate new quantum.

     FKQ2     Contains the fork's schedular queue level  number  in  the
               left  half,  and  the  list  address, i.e.  GOLST, WTLST,
               etc., in the right.

     FKSTAT    Contains the address of the schedular test routine  which
               will determine when the fork is available to be placed on
               the GOLST.

     FKTIME    Contains the time of day, in internal  format,  that  the
               fork was placed on its current run queue.

     FKWSP     Contains the number of physical  pages  assigned  by  the
               fork  in  the right half, and the working set size of the
               fork when the fork entered the balance set in the left.



     5.0  PAGER Context (TRAPSK)

          Page faults trap through the user's UPT, by  placing  the  old
     flags  and  PC  for  the  process  in  locations  UPTPFL and UPTPFO
     respectively, and taking the new PC from location  UPTPFN.   UPTPFN
     will  usually contain the address PGRTRP, which is the beginning of
     the page fault code.  The location being referenced  and  therefore
     causing  the  page  fault  is stored in UPTPFW, also called TRAPS0.
     This contains the virtual address that page faulted in bits  13-35.
     Bit  0  of  this  word indicates if the location is in user or exec
     (monitor) address space.  If this bit is set,  the  address  is  in
     user  address space.  The PGRTRP code copies TRAPS0 into TRAPSW, in
     case of recursion.  This code will determine the nature of the page
     fault,  and  attempt  to  resolve  it.   UPTPFL and UPTPFO are also
     called TRAPFL and TRAPPC respectively.  The old  stack  pointer  is
     saved  in  location  TRAPAP.   The  new  stack,  TRAPSK,  is set up
TOPS-20 Crash Dump Analysis                                      Page 10


     according to the context of the page  fault,  i.e.,  user  context,
     monitor  context,  or  recursive  page fault.  A page fault in user
     mode causes the stack to be set up with the runtime, return PC, and
     return PC flags in the first three locations of the stack:

                TRAPSK/         runtime
                TRAPSK+1/       return PC
                TRAPSK+2/       return PC flags


          Page faults from monitor context have  the  following  initial
     stack set-up:

                TRAPSK/         AC1
                TRAPSK+1/       AC2
                TRAPSK+2/       AC3
                TRAPSK+3/       AC4
                TRAPSK+4/       AC7
                TRAPSK+5/       AC16
                TRAPSK+6/       TRAPSW
                TRAPSK+7/       runtime
                TRAPSK+10/      PC
                TRAPSK+11/      PC flags

     Recursive page faults will cause the following set up in TRAPSK, at
     the time of the page fault:

                / AC1
                / AC2
                / AC3
                / AC4
                / AC7
                / AC16
                / TRAPSW
                / PC
                / PC flags


          In examining a pager crash, it is usually a good idea to begin
     by  tracing  down the Monitor's table entries for the location that
     faulted.  This location is stored in location TRAPS0.  The identity
     of  the page causing the trap is stored in location TRPID, and will
     be in either of two forms:  page table number  in  left,  and  page
     number in right, or simply the page table number in the right.  The
     page table number is an SPT index, and the page number, if any,  is
     an  offset  into the page table pointed to by that SPT slot.  There
     are four Core  Status  Tables  (CST's)  indexed  by  physical  page
     number, that are used to keep track of each page in the machine.  A
     page fault crash will usually have bad data in either the SPT  slot
     indicated  in  TRPID,  or  one  of  the CST's for the physical page
     pointed to indirectly through that SPT  slot.   If  TRPID  contains
     PTN,,PN,  then  find location SPT+PTN.  This should have a physical
     page number in the right half.  Look at this physical page,  offset
     by  PN  in  TRPID  to  find the pointer to the page that caused the
     fault.  Shared and indirect pointers in this  location  will  point
TOPS-20 Crash Dump Analysis                                      Page 11


     through  another  SPT  location,  but  private  pointers will point
     directly at the physical page that we are looking  for.   If  TRPID
     contains just PTN, then SPT+PTN will point directly at the physical
     page we are looking for.  Knowing the physical page number,  it  is
     now possible to examine the CST tables for that page.


     CST0      Used principally by the  pager  hardware,  this  location
               will  contain  the Process Use Register, mentioned in the
               FKCNO table above, and the age stamp.

     CST1      Contains the system lock count, and  the  backup  address
               for  the  page.   The  lock count indicates the number of
               systen events necessary before the page will  be  swapped
               out,  and  the  backup  address for the page.  The system
               should never swap out a page with a non-zero lock  count.
               The  backup  address  can be a disk or drum address for a
               page in memory.

     CST2      Contains the home map location of the  page,  and  should
               match the contents of TRPID.

     CST3      Is used by the software  to  create  lists  of  pages  in
               various  states  of  use.   Those pages available for use
               will be on the Replaceable Queue, and linked together  in
               a doubly linked list.  Those pages awaiting swapping will
               be on a swappong device  queue,  and  part  of  a  singly
               linked  list.   Pages in use will contain the fork number
               of the owner in bits 3-14, and the local disk address for
               PHYSIO for the page.

     CST5      Contains the list of short I/O  Request  Blocks  (IORB's)
               associated with the page.


          A few other significant locations for page faults are:


     RPLQ      Points to the beginning of the Replaceable Queue in CST3.

     NRPLQ     Contains the number of pages on the Replaceable Queue.

     SWPLST    Points to the beginning of the PHYSIO swap list, in CST3.

     NOF       Contains the number of OFN's in use in the SPT.



     6.0  PSI Context (PIPDB)

          Take note of the Monitor fork tables in the  JSYS  section  of
     this  document.   The  locations FKINT and FKINTB will be useful in
     determining the type and timing of PSI interrupts  pending  at  the
     time  of the crash.  When a process has a PSI interrupt pending, it
     is flagged in the FKINT entry for that fork, and the scheduler will
TOPS-20 Crash Dump Analysis                                      Page 12


     take  note  of  this  event and set the PPC location in the PSB for
     that process to contain the address PIRQ.  This action takes  place
     at  location  SCHED5  in  the  scheduler.   The  next time that the
     process is ready to run, it will continue at location  PIRQ,  which
     will  set  up  the  PSI  stack,  PIPDB.   SCHED5 also moves the PSI
     request word from FKINT to PIMSK in the PSB.  Thus, it is  possible
     to check this location for the last PSI request that was scheduled.
     The old contents of PPC and PFL are stored in PIPC and PIFL by  the
     SCHED5  routine, so these will indicate the point where the process
     was interrupted.



     7.0  Scheduler Context (SKDPDL)

          Take note of the Monitor Fork tables in the  JSYS  section  of
     this  document.   The  scheduler  is  usually invoked in one of two
     ways:  through a software  interrupt  initiated  by  channel  3  PI
     routine, indicating that a set period of time has elapsed since the
     last scheduler cycle, or through the ENTSKD macro, which is used by
     a  running  process  that  is  about  to  dismiss.  In this way the
     scheduler is guaranteed to run at regular  intervals,  or  whenever
     the  system  is  idle.  The primary entry point to the scheduler is
     SCHED0.  It is through this location control  passes  whenever  the
     running  process  dismisses,  or  whenever one of the two scheduler
     clock cycles elapses.  Briefly, the hardware traps on  every  clock
     tick  through  location  TIMVIL in the EPT.  This location contains
     the instruction XPCW TIMINT.  Again, as  in  the  device  interrupt
     code,  this  instruction  causes  the  flags and PC to be placed in
     locations TIMINT, and TIMINT+1, and control passes to the  location
     in  TIMINT+2,  which  in  this  case  is TIMIN0.  TIMIN0 determines
     whether or not it is time to run the scheduler, and  dismisses  the
     interrupt.   If  the  scheduler  is  to  be run, TIMIN0 initiates a
     software interrupt on channel 7, which causes a  trap  through  the
     EPT  location  KIEPT+56  to  PISC7R.   The  instruction executed in
     KIEPT+56 is an XPCW PISC7R, causing the old  PC  and  flags  to  be
     deposited  at  PISC7R,  and control to begin at PISC7+1.  The PISC7
     code sets up PPC and PFL to contain the  old  PC  and  flags,  from
     PISC7R,  and saves the process ac's at the time of the interrupt in
     a block of the  PSB  called  PAC.   Having  set  up  for  scheduler
     context,  the  PISC7  code  then  transfers  control  to the SCHED0
     routine.  Similarly, the ENTSKD macro does an XPCW ENSKR, causing a
     jump to the ENSKED routine that does the context switch.


*** Ed. note - The timer interrupt code is different for the 2020, but the
	effect is the same.  See APRSRV for the exact coding.


          Some other useful locations in scheduler context:

      o  GOLST     Points to the beginning of  the  GOLST  in  the  FKPT
         table.

      o  WTLST     Points to the Wait list in the FKPT table.

      o  TTILST    Points to the TTY input wait list in the FKPT table.
TOPS-20 Crash Dump Analysis                                      Page 13


      o  FRZLST    Points to the list of frozen forks.

      o  WT2LST    Points to the list of forks waiting to be  unblocked.
         (UNBLK1)

      o  TRMLST    Points to the list of forks waiting for another  fork
         to terminate.

      o  SUMNR     Contains the number of reserved  pages.   (locked  in
         memory)

      o  BALSHC    Contains the number of pages reserved due  to  shared
         access.

      o  INSKED    Set to non-zero if in the scheduler.




     8.0  DTE Interrupt Context (DTESTK)

          DTE interrupts also dispatch through  locations  in  the  EPT,
     depending  upon which DTE is interrupting.  For each DTE that could
     exist on a system (4), there is an eight word block in the EPT used
     to  keep  up-to-date  information for that DTE.  Not all of the DTE
     blocks will necessarily be used, however they will all exist in the
     EPT.   These blocks begin at location DTEEBP.  The format of one of
     these blocks is described below.  The DTE  interrupt  executes  the
     third  word  in this block, which contains a XPCW DTEN0.  This will
     cause the old PC and flags to be stored  at  location  DTEN0,  and,
     since  DTEN0+3 contains ".+1", the system will begin processing the
     interrupt at location DTEN0+4.  This part of the routine  will  set
     up  the  DTE  stack, DTESTK, and save the PC, flags, and AC's.  The
     flags and PC are stored at DTETRA,  and  the  AC's  are  stored  at
     DTEACB.  DTEN0 will then use INTDTE to process the interrupt.  This
     code can be found in the DTESRV module of the monitor.

     The DTE control block:

        DTEEBP/ To -11 byte pointer
        DTETBP/ To -10 byte pointer
        DTEINT/ "XPCW DTEN0"            ;dispatch for DTE-0
              / reserved
        DTEEPW/ Examine Protection Word
        DTEERW/ Examine Relocation Word
        DTEDPW/ Deposit Protection Word
        DTEDRW/ Deposit Relocation Word

     Note that the labels above  apply  only  to  DTE-0,  and  that  the
     remaining DTE's must be offset by DTE-number X 8.

          Some other useful locations in the EPT:

        DTEFLG/ Operation Complete Flag
        DTECFK/ Clock Interrupt Flag
TOPS-20 Crash Dump Analysis                                      Page 14


        DTECKI/ Clock Interrupt Instruction
        DTET11/ To -11 argument
        DTEF11/ From -11 argument
        DTECMD/ Command Word
        DTESEQ/ DTE20 Operation Sequence Number
        DTEOPR/ Operation In Progress Flag
        DTECHR/ Last Typed Character
        DTETMD/ Monitor TTY Output Complete Flag
        DTEMTI/ Monitor TTY Input Flag
        DTESWR/ Console Switch Register

     These location are found at offsets 444 through 457 in the EPT.



     9.0  I/O Queueing (PHYPDL)

          All disk and tape I/O is initiated through the PHYSIO code, by
     calling  PHYSIO  with  a  pointer to an I/O Request Block (IORB) in
     ac1, and the addresses of the Channel Data  Block  (CDB)  and  Unit
     Data Block (UDB) in ac2 (CDB,,UDB).  PHYSIO validates the arguments


*** Ed. note - The arguments other than the IORB are not provided by PAGEM
	requests.  PHYSIO figures them out.

     passed to it, and then determines whether the IORB belongs  on  the
     Position  Wait Queue (PWQ) or the Transfer Wait Queue (TWQ).  These
     two queues are pointed to by offsets UDBPWQ and UDBTWQ in  the  UDB
     for  the  device.   Note that these are offsets into the UDB, which
     will be in resident free space,  as  well  as  the  CDB's.   During
     processing, PHYSIO will keep the following information in the ac's:

        P1/     address of the CDB
        P2/     address of the KDB (for tapes) or 0
        P3/     address of the UDB
        P4/     address of the IORB being processed

     Since PHYSIO is called via the PUSHJ P, instruction,  the  previous
     PC  is not saved.  The P and Q ac's are stored on the stack via the


*** Ed note - The previous PC is easily found by looking at the previous
	stack.

     SAVEPQ macro.  PHYSIO does use a private  stack,  and  so  the  old
     stack  pointer is saved in PHYSVP.  Also, because PHYSIO does use a
     private stack, it is necessary for the process calling PHYSIO to be
     NOSKED.  Also take note of the fact that IORB's are associated with
     the physical pages of memory that are involved with the I/O through
     pointers  in  the CST5 table for those pages.  See the next section
     for more information in this area.



     10.0  Device Interrupt Context (PHYIPD)

          Device interrupts, in this context, refer  to  disk  and  tape
     interrupts,  those devices connected through the RH20's.  Each RH20
     channel has a "Channel Logout" area at the beginning of EPT.   This
     logout  area  is  four words in length for each channel, the fourth
     word of which contains an instruction to execute on  an  interrupt.
     This  instruction causes the system to dispatch to code actually in
     the CDB for the channel.


*** Ed note - the controllers used on the 2020 (RH11, etc.) do not work
	exactly as the RH20.  Refer to PHYH11 etc. for code info.  The
	code at the interrupt makes adjustments and picks up somewhat of
	the common line at the CDBINT level, although it is very special
	cased.
TOPS-20 Crash Dump Analysis                                      Page 15


          The previous PC and flags are saved in  the  area  immediately
     preceding  the CDB;  offset CDBINT (value -6) is the location where
     the flags and PC  are  stored.   When  the  interrupt  occurs,  the
     hardware executes the instruction in the channel logout area, which
     is "XPCW loc".  "Loc" is the address of the CDB for  this  channel,
     offset  by  CDBINT  (-6).   The XPCW instruction saves the flags at
     CDBINT(CDB), the PC at the next location, and gets  the  new  flags
     and  PC  from  the next two locations.  This area of the CDB, then,
     contains the following:

        CDBINT(CDB)/    old flags
            -5(CDB)/    old PC
            -4(CDB)/    new flags (0)
            -3(CDB)/    new PC ( ".+1")
            -2(CDB)/    MOVEM P1,CDBSVQ(CDB)    ; saved in CDB offset CDBSVQ
            -1(CDB)/    JSP P1,PHYINT           ; dispatch to interrupt code
        CDBSTS(CDB)/    status and configuration flags

     The PHYINT code, then, resolves the interrupt, and returns  to  the
     old  PC  by JRSTing through offset CDBJEN in the CDB.  This part of
     the CDB contains the following:

        CDBJEN(CDB)/    BLT 17,17
                   /    DATAO RH,CDBRST
                   /    XJEN CDBINT(P1)

     The last of these locations causes the system to  resume  where  it
     was interrupted.  During processing of the interrupt, the following
     information may be found:

        P1/     address of the CDB
        P2/     address of the KDB or 0
        P3/     address of the UDB
        P4/     address of the IORB or argument code:

                (P4) < 0 - schedule a channel cycle
                (P4) = 0 - dismiss interrupt
                (P4) > 0 - complete current request (IORB address)


          When the system is attempting to perform  I/O  to  or  from  a
     specific page of physical memory, that page is locked into core, by
     incrementing the lock count in the CST1 location for that page.  If
     a  device  error  occurs during the transfer of data for that page,
     then the CST5 entry for that page will  have  either  a  short  I/O
     Request  Block  (IORB) or a pointer to a long (Mag Tape) IORB.  The
     short IORB is only one word in length and is used for disk transfer
     requests,  i.e.,  swapping.   In  either case, the first word of an
     IORB, called IRBSTS, contains flags that describe  the  success  or
     failure  of  the  transfer.   It  may  be  helpful  to  check these
     locations in the event of a PHYINT crash.

          The following offsets contain useful  information  for  PHYSIO
     crashes:
TOPS-20 Crash Dump Analysis                                      Page 16


     In the UDB:

        UDBPS1/ cylinder number
        UDBPS2/ surface,, sector number
        UDBERC/ error retry count
        UDBERR/ status function for error retry

     In the CDB:

        CDBCNI/ status of channel when interrupt began.




     11.0  APR Interrupt Context (MEMPP)

          APR Interrupts, like Device interrupts, are  vectored  through
     the EPT, but in the case of the APR interrupts, the vector location
     is a part of the priority interrupt  scheme.   These  are  priority
     channel 3 interrupts, and dispatch through location KIEPT+45, which
     contains a XPCW PIAPRX.  This is the channel 3  interrupt  routine.
     This routine will attempt to resolve the interrupt, and in doing so
     will set up its own stack, MEMPP.  As in the  case  of  the  device
     interrupt, the XPCW PIAPRX will cause the PC and flags to be stored
     at locations PIAPRX and PIAPRX1, and the processor will  then  jump
     to  the  location  stored  in  PIAPRX1+1,  which  is  PIAPR.  PIAPR
     actually dismisses the APR interrupt, or BUGHLT's.  The  old  stack
     pointer,  at  the  time of the interrupt, is stored in MEMAP.  Ac's
     0-10 are saved at starting at location MEMPA.  One  unusual  aspect
     about  handling  APR  interrupts is that the PIAPR code changes the
     page fault trap vector, mentioned earlier, from PGRTRP  to  MEMPTP,
     in  UPTPFN,  to  handle  the  special  case  of a page fault in APR
     interrupt context.

*** Ed note - PI channel 3 is also used for some other features on the
	2020, and APRSRV should be referred to for those changes.