Trailing-Edge
-
PDP-10 Archives
-
BB-H311B-RM
-
swskit-documentation/handbook.mem
There are 5 other files named handbook.mem in the archive. Click here to see a list.
TOPS-20 TROUBLE-SHOOTING HANDBOOK
=================================
Release 4 Edition
February 1980
TOPS-20 Monitor Group
Marlboro Support Group
Software Services
TOPS-20 TROUBLE-SHOOTING HANDBOOK Page 2
INTRODUCTION
INTRODUCTION
------------
This document is the TOPS-20 Trouble-Shooting Handbook. It is a
collection of materials designed to increase the effectiveness of the
Software Specialist in the field in coping with TOPS-20 problems.
Some of the common "disasters" to befall TOPS-20 sites are discussed,
along with debugging methods in general. Though the information
contained herein is probably not sufficient to make a Specialist into
a TOPS-20 "wizard", it should help ease the communication burden
between the Specialist in the field and his counterpart in Marlboro
and lead to quicker resolution of problems.
This document contains materials from many sources, and presents
some information not available anywhere else. Certain sections may be
a bit dated, but an effort has been made to remove at least some of
the old/wrong stuff along with including new articles.
There is a continuing need to update this document as part of the
SWSKIT materials, and Specialists are encouraged to give the Marlboro
Support Group feedback on these materials.
TOPS-20 TROUBLE-SHOOTING HANDBOOK Page 3
TABLE OF CONTENTS
TABLE OF CONTENTS
1. INTRODUCTION 2
2. TABLE OF CONTENTS 3
3. POLICY STATEMENT 5
4. PRODUCING A GOOD SPR 6
5. USING SIRUS 9
6. MAPPING DIRECTORIES IN MDDT 16
7. RECOVERING FROM DIRECTORY ERRORS 19
8. MORE ABOUT DIRECTORY PROBLEMS 22
9. JSB AND PSB MAPPING 24
10. BREAKPOINTING MULTI-USER CODE 28
11. USING ADDRESS BREAK TO DEBUG THE MONITOR 30
12. RECOVERING FROM SYSTEM DISASTERS 33
13. LOOKING AT HUNG TAPES 39
14. A LOOK AT SOME OF THE DISK STUFF 43
15. NEW DISK FEATURES FOR FILDDT 47
16. TOPS-20 SCHEDULER TEST ROUTINES 50
17. KNOWN HARDWARE DEFICIENCIES LIST 57
18. KS10 CONSOLE INFORMATION 59
19. CRASH ANALYSIS 67
20. BUG'TYP MACRO CHANGES FOR VERSION 4 OF TOPS-20 86
21. MONITOR BUILDING HINTS 88
22. EXEC DEBUGGING 92
23. RECOVERING FROM A BAD EXEC 98
24. DEBUGGING THE GALAXY SYSTEM 99
TOPS-20 TROUBLE-SHOOTING HANDBOOK Page 4
TABLE OF CONTENTS
25. DEBUGGING MOUNTR 114
26. DEBUGGING PA1050 116
27. COPYING FLOPPY DISKS 117
28. THE SWSKIT TOOLS PROGRAMS 119
TOPS-20 TROUBLE-SHOOTING HANDBOOK Page 5
POLICY STATEMENT
LEGAL POLICY CONCERNING THE TOPS-20 SWSKIT
There is a great confusion concerning the materials that make up
the SWSKIT tape, and their legal standing. This memo is an attempt to
clear up some of those problems.
The SWSKITs are made up of an assortment of materials intended to
increase the effectiveness of the software specialist. These
materials include program sources not normally distributed or sold for
a premium; internal and company confidential documentation, which may
be in part incomplete or actually incorrect, but supplied for the
information value on subsystems which may be insufficiently documented
through the usual channels; documentation for specialists specially
produced by the corporate support people; and utility programs
produced and maintained to some extent by corporate support. In
addition, the SWSKIT may contain special or pre-release versions of
supported software provided for the incremental value a specialist may
obtain from the software under controlled circumstances. In time,
utilities from the SWSKIT may evolve into supported products.
All of the SWSKIT materials are proprietary to DIGITAL, and were
never intended to be just given to the customer. Obviously, the
materials which are otherwise sold cannot be given away; and the
company confidential materials should not be. While it is expected
that the tools programs may wind up being used at customer sites,
neither are they gifts to the customer. An effort must be made to
protect DIGITAL's rights to these proprietary materials. For
instance, a PL90 contract retains rights to all materials provided to
the customer. Deleting a tool program after use at a customer site
indicates intent. There should be an awareness that if a customer
incurs damages due to use of some program given to him by the
specialist, even though improperly used, then DIGITAL may be seen to
be at least in part responsible. This should be avoided.
In summary, the SWSKIT is a tool provided to increase the
effectiveness of the specialist, especially with regard to PL90 and
debugging activity, but the rights to all materials remain with
DIGITAL and the specialist should act accordingly.
THIS IS NOT A LEGAL DEPARTMENT DOCUMENT. CONSULT LEGAL IF YOU
HAVE ANY DEFINITE PROBLEMS REQUIRING RESOLUTION.
TOPS-20 TROUBLE-SHOOTING HANDBOOK Page 6
PRODUCING A GOOD SPR
PRODUCING A GOOD SPR
A software specialist is often asked to assist with the
submission of SPRs for a customer. It is always discouraging to
have problems getting an answer to an SPR for entirely
non-technical reasons. For that reason, below are some hints for
producing a "good" SPR which will help in getting the problem
solved more quickly.
1.0 THE SPR FORM
Much of the data on the SPR form is unimportant, until it is
omitted. The line of product data is one. Try to isolate the
problem to the correct component, since that will determine who
first receives the SPR. This will remove the time it takes for,
say the COBOL maintainer, to determine that the problem is not
really in COBOL, but in PA1050 or the monitor, and the time it
takes for the next maintainer to become familiar with the problem.
Something which crashes the system is always a monitor problem,
even if it is an EXEC command which causes the problem, or a short
BASIC program.
If you really have a problem, be sure to mark the "problem"
box, and don't use words like "we suggest you correct the
following situation...". If the people who handle the incoming
paperwork think they have a suggestion, it gets routed elsewhere,
and is never seen by the maintainers. A few problems have been
greatly delayed this way.
The priority boxes are not super-critical, but if you have a
problem which is holding up production, or crashing the system
several times a day, try to make a note of that somewhere in the
description of the problem. That should let the maintainer know
that a work-around may also be appropriate in the short term.
The phone number of the submitter could be important if the
problem is of such a nature that it proves not-reproducible, or
the complexity is such that futher clarification just to
understand the problem might be needed. Your number here as a
software specialist provides a more informal contact than direct
maintainer-to-customer confrontation, although the customer will
be contacted directly if that is most expedient.
The attachments--be sure to mark some of these boxes if you
send along supporting materials. Since these can get separated
from the form, this will help keep them from getting permanently
lost.
TOPS-20 TROUBLE-SHOOTING HANDBOOK Page 7
THE SPR FORM
The "DO NOT PUBLISH" box is for security problems and ways to
crash the system. We double-check this on incoming handling, but
if the box is checked you can be sure that the SPR will not be
published unanswered.
Describe the problem as clearly as possible in the space
provided. Try to provide enough detail to easily reproduce the
problem. Concentrate on the description of the problem, and any
diagnosis you may have made. Attempting to declare a "cure" is
not always good idea because the actual correction may be of an
entirely different nature for a number of reasons. However, if
you have something that works, the information could be of use.
Just don't count on that exact change being the actual fix. If
the problem is not reproducible from the description given,
chances are that something you left out is relevant to the
problem. Unless the problem directly concerns them, things like
logical names, mounted structures, and other features often
obscure the problem. For the purpose of the problem description,
a terminal listing of an occurrance is often highly desirable, and
it is sometimes a good idea to create a brand-new directory
without any fancy LOGIN.CMD setups or user groups and so on to
demonstrate the problem.
2.0 THE SUPPORTING MATERIALS
As above, the listing from a terminal session is often a very
good attachment. Try to include all the relevant information.
Again, sometimes things like logical names, file and directory
protections, user groups, and other job-state variables are
important and should be included. Inclusion of data such as
program version numbers and edit levels can be useful for products
with large numbers of edits. If you are complaining of monitor
problems, which patches you have installed could be useful
information. Terminal sessions should be as clear as possible.
It should be made obvious just what is going on or the maintainer
may just see a series of commands and think "So?". Concurrent or
after the fact commenting is one way to accomplish this.
Many times there is a program which exercises the bug.
Sometimes these programs are alright as they are, but often they
are giant COBOL monsters working on a multi-RP06 data base, and
very unwieldy for a maintainer to try to work with. If the
program can be reduced to a small subset, do so. Many monitor
problems often turn out to be reproducible from a set of arguments
to a single JSYS. If it is a question of incorrect output from
some program, it is helpful to send along all the files needed to
reproduce the problem, and the files of incorrect output. In the
case of programs with multiple edits to field-image, this speeds
up the maintainer, since he does not have to manually apply those
edits to attempt to recreate your versions, and he can also check
the installation of the edits, if that is appropriate. And in
case the problem proves to be not easily reproducible the bad
TOPS-20 TROUBLE-SHOOTING HANDBOOK Page 8
THE SUPPORTING MATERIALS
output can at least be examined for clues.
In the case of a monitor crash, the problem may have been
reduced to a program of less than one page. It is tempting to
type this on the front of the SPR and send it in that way. While
the maintainer can type in the program easily enough (if the copy
is both legible and correct), the submitter has been lax.
Sometimes, that short program will not cause a crash, even though
run thousands of times under varying conditions by the maintainer.
And even when it does cause the crash the first time, the
submitter has lengthened the turn-around by not sending the dump
from the crash along with the SPR. Sending the dump solves both
problems. If the problem is not reproducible with ease, the dump
is vital to further understanding. And having the dump to start
with speeds up the work of the maintainer who now does not need to
schedule stand alone to try to exercise the bug and cause a crash
so he has a dump to look at.
When sending a dump, always send the unrun monitor along with
it. If you don't, you are just causing a delay in handling the
problem while the maintainer tries it against the standard ones,
which involves finding tapes with the standard ones, and loading
them... If you are running an unpatched standard monitor, and you
refuse to send it, at least tell which one it is somewhere on the
form. The unrun monitor is also useful for checking the existence
and correct installation of patches when that becomes an issue.
The current preferred tape format is 9-track, 1600bpi, and in
standard DUMPER format, not in INTERCHANGE format, since file
information can be lost that way. Take the time to get a listing
of a directory of the tape and include it with the tape. It will
help to speed things up, as if it is obvious from the directory
that something is missing, faster feedback is generated. There is
also the indication that the tape will indeed be readable when
received, and will partly eliminate the usual first step of the
maintainer in getting a directory of the tape.
TOPS-20 TROUBLE-SHOOTING HANDBOOK Page 9
USING SIRUS
USING SIRUS
-----------
Did you know that you can dial into a Marlboro development system
and type out almost any patch that the Marlboro Support Group has made
to -10 or -20 software in the last three to four years? The program
which does this is called SIRUS, and with it you can:
1. Search through all the patches to a particular product, if
you know a problem exists but don't know what the patch is or
don't know if we've heard of the problem. If you find the
patch you want, you can then type it out.
2. Type out a particular patch to a particular product, if you
know what the edit number is.
3. Obtain the status of any SPR, including the entire answer if
it has been answered.
By using SIRUS, you can get patches whenever the system is up,
even if it's two A. M. and the Hotline is closed. You can print
patches in your local office without having to wait for a specialist
in Marlboro to mail you a copy. You can be sure that the patch you
have is correct. (Dictating patches over the Hotline is very prone to
errors.) Even if the problem you are experiencing cannot be found in
SIRUS, you can help us when you call by so stating. We immediately
know that the problem you are having is a new one.
There have been several articles about SIRUS in previous Large
Buffers, but none have been oriented towards specialists in the field.
This one is!
To use SIRUS, dial into system 1026 in Marlboro, log in, and then
run it. In more detail:
1. Dial into system 1026. Any of the following numbers will
reach system 1026 in Marlboro. They are all 300 baud lines.
231-1171 (DTN)
231-1172 (DTN)
(617)481-5606
(617)481-5632
(617)481-5635
(617)481-5636
(617)481-5637
(617)481-5638
Once the machine notices you, type "SET HOST 26" to insure
that you are connected to system 1026. If you get the
message "?Undefined Network Node", the machine is down (try
again later).
TOPS-20 TROUBLE-SHOOTING HANDBOOK Page 10
USING SIRUS
2. To login, type "LOGIN 10,#". When the machine requests a
name, type one in. You will not need a password.
3. To run SIRUS, just type "R SIRUS". SIRUS takes several
seconds to initialize itself and then prompts you with
"PRODUCT [H]*". At this point, type either "10<CRLF>" or
"20<CRLF>" depending on whether the customer of concern is
running TOPS10 or TOPS20. SIRUS then prompts you with
"[H] *". You are now at SIRUS command level.
SIRUS has many commands, but only a few are of interest to the
field specialist. They are:
1. H -- for Help. This may be typed anytime SIRUS precedes its
prompt with "[H]".
2. EX -- for Exit. Use this to exit SIRUS. Then type K/N to
logout, and hang up.
3. PP -- for Peruse PCOs. PCO stands for Product Change Order
and essentially means a patch. This command is used to look
through patches for a particular product if you aren't sure
which patch you want.
4. GP -- for Get PCO. This is used to type out a particular
patch once you know which one you want.
5. GS -- for Get SPR. Use this to retrieve information on a
particular SPR.
6. NP -- for New Product. Use this command if you type the
wrong answer to "PRODUCT [H]*" as mentioned above, or use it
in association with the PP command as described below. SIRUS
will prompt you for a product again.
The three most useful of these commands are PP, GP, and GS.
3.0 PP Command
Use this command to peruse the patches for a particular product
-- e.g. LINK or 603 (monitor) or BATCON -- if you want to find a
particular patch you know exists, or if you want to know if the
support group has heard of and fixed some problem you are experiencing
with a product. After you type "PP<CRLF>" SIRUS will prompt for a
component. Here type the program you're interested in -- LINK, BATCON
or whatever. A response of LIST will type the programs SIRUS knows
about and then prompt you for a component again.
TOPS-20 TROUBLE-SHOOTING HANDBOOK Page 11
PP Command
Once you type in the component, SIRUS prompts with "[H] PCO #:".
There are two reasonable responses to this. The first is ALL. (Type
NO to the subsequent question about a file.) This will give you a
short summary of all the patches available for this product, one line
per patch. This includes a PCO number, the SPR for which this patch
was written, the edit number corresponding to the patch (for the
TOPS10 monitor this is the MCO number), a keyword describing the bug,
the maintainer who wrote the patch, and the date it was made. The
other response you might type here is simply <CRLF>. In this case
SIRUS will type out the symptom of the newest PCO, and then prompt you
with "NEXT?". By continuing to type carriage returns, you can type
all the symptoms of all the patches for this product, from the newest
to the oldest. When you have found the patch you want (remember the
PCO number), type RETURN to get back to SIRUS command level.
If you did not find your symptom while perusing, and your product
exists on both TOPS10 and TOPS20, you should also search the PCOs for
the alternate operating system. To do this, type NP to SIRUS command
level, and then type in the other product number when SIRUS asks for
it. Then peruse PCOs for your product as you did before.
4.0 GP Command
This is used to print out a patch once you know the PCO number.
The PCO number is printed while you are perusing PCOs and is of the
form 10-product-nnn or 20-product-nnn. After typing GP to SIRUS
command level, SIRUS prompts for a PCO number. The leading "10-" or
"20-" is supplied by SIRUS, so your response should be of the form
"product-nnn".
In response, SIRUS types out information about the patch. The
two most useful data are labeled VLD and SAE. VLD stands for validity
and is the version of the software to which the patch applies. SAE is
Source After Edit and is the edit or MCO number of the patch. To get
the actual text of the patch, respond YES to SIRUS's question "Show
Write-up File?".
5.0 GS Command
This is used to get the status of an SPR. SIRUS will prompt for
an SPR number, and then will provide you with info about the SPR you
specified. This includes the site that submitted the SPR, the
specialist responsible for the SPR, and date received and the date
closed, if the SPR has been answered. If answered, it will also say
whether or not an auxiliary file was written for the SPR and what PCOs
(if any) were included. The aux file is an introductory paragraph
which is written for most SPR answers. For SPRs which do not require
patches, the aux file constitutes the entire answer. The aux file can
be typed by responding YES to "SHOW AUXILIARY FILE?". The PCOs can be
typed out with the GP command.
TOPS-20 TROUBLE-SHOOTING HANDBOOK Page 12
GS Command
Finally, if SIRUS begins to give you error messages such as "File
not found", EX from SIRUS and mount a special disk pack with the
monitor command "MOUNT SIRS:". Then try again. This gives you access
to more PCOs and aux files than are normally available.
For more information, see the example run of SIRUS below, in
which user input is shown underlined, or the article on SIRUS
published in volume 409 of the Large Buffer. Finally, SIRUS is for
use by DIGITAL personnel only. DO NOT give out instructions for its
use or the system 1026 phone numbers to customers.
.R SIRUS
- -----
SIRUS...3(3)
[WHEN '[H]' APPEARS YOU MAY TYPE 'HELP' FOR ASSISTANCE]
PRODUCT [H]* 20
--
[H] *PP
--
[H] COMPONENT TO PERUSE: D60SPL
------
[PCO LIMIT FOR 'D60SPL' IS 15]
[H] PCO #:<CR>
----
[20-D60SPL-015]
DATE: 09-JUL-79 BY: BENCE
VLD:
[SYMPTOM]
Jobs sent to the LPT queue from D60SPL are given a random
file name and are billed to OPERATOR.
NEXT?<CR>
----
[20-D60SPL-014]
DATE: 09-JUL-79 BY: WEISBACH
VLD:
[SYMPTOM]
TOPS-20 TROUBLE-SHOOTING HANDBOOK Page 13
GS Command
If the spooler is pausing, typing a GO can result in an
illegal instruction.
NEXT? ALL
---
DO YOU WANT A FILE? NO
--
PCO 015 SPR 12355 (6,022) KEY= LNAME BENCE 09-JUL-79
PCO 014 SPR 12225 OUTOUT (6,020) KEY= PAUSE WEISBACH 09-JUL-79
PCO 013 SPR 11660 LODVFU 6013(6,014) KEY= VFU WEISBACH 09-JUL-79
PCO 012 SPR 13244 D60CRE 103 (6,032) KEY= CARD L.NEFF 06-JUL-79
PCO 011 SPR D60CR4 103 (6,015) KEY= CARDS L.NEFF 03-JUL-79
PCO 010 SPR REQUEU 103 (6,030) KEY= CTQMFQ L.NEFF 14-JUN-79
PCO 009 SPR 12588 INTCTC 1 (6,026) KEY= CONTROL C TEEGARDEN 17-MAY-79
PCO 008 SPR 12881 OUTE.6 103 (6,025) KEY= REQUEUE NEFF 17-APR-79
PCO 007 SPR 12139 103 (6,019) KEY= ILLEGAL WEISBACH 27-OCT-78
PCO 006 SPR 12005 (0) KEY= SIMULTANEO BENCE 22-SEP-78
PCO 005 SPR 11672 ENDJOB 103 (6,018) KEY= QUASAR BENCE 18-SEP-78
PCO 004 SPR 11841 D60STK 103 (6,016) KEY= BAD WEISBACH 23-AUG-78
PCO 003 SPR 11476 TTYOUT 103 (6,010) KEY= OVERWRITE WEISBACH 12-MAY-78
PCO 002 SPR 11431 OUTE.6 (6,007) KEY= INTERRUPTS WEISBACH 12-APR-78
PCO 001 SPR 11456 D60SPL (6,006) KEY= BLANK WEISBACH 03-APR-78
[H] PCO #: RETURN
------
[H] *GP
--
[H] PCO #: 20-D60SPL-8
[20-D60SPL-008 RETRIEVED]
PROG: NEFF
COMPONENT: D60SPL
SER/SPR:20-12881
KEYS: REQUEUE /
ROUTNS: OUTE.6 /
VLD: 103(2304)
SBE %103 (6,024)
SAE %103 (6,025)
CRIT: N
DOC: N
F/D: F
TEST FILE: : [ ]
P-IND: 10
SHOW WRITE-UP FILE? YES
---
TOPS-20 TROUBLE-SHOOTING HANDBOOK Page 14
GS Command
[WRITE-UP FILE]
008 NEFF
[SYMPTOM]
If a job is requeued because of a communications failure, with
D60SPL reporting that the station has signed off, then, when the
station signs on again, the print file will be restarted from its
beginning, not from the last checkpoint.
[DIAGNOSIS]
When the error is detected, routine OUTE.6 calls IBACK to
backspace the file five pages. IBACK zeroes the page counter,
J$RNPP(J), and rewinds the file, in the belief that the forward
spacing code will update the page count as it skips to the correct
page. However, D60SPL discovers the error is not recoverable and it
requeues the job immediately. Since the page count is never updated,
DOREQ requeues the job to start at the beginning of the file.
[CURE]
Preserve the page at which to resume printing over the call to
IBACK. if the job is to be requeued immediately, restore J$RNPP(J) so
that the job will be requeued and checkpointed five pages back from
its current position.
[FILCOM]
File 1) DSK:D60SPL.MAC[4,1022] created: 1724 09-Apr-1979
File 2) DSK:D60SPL.MAC[4,417] created: 1625 10-Apr-1979
1)1 LPTEDT==6024 ;EDIT LEVEL
1) LPTWHO==1 ;WHO LAST PATCHED
****
2)1 LPTEDT==6025 ;EDIT LEVEL
2) LPTWHO==1 ;WHO LAST PATCHED
**************
1)4 ;*****End of Revision History*****
****
2)4 ;6025 If a job printing on a remote printer is interruped by
2) ; a communications failure, requeue to start five pages ba
ck
2) ; instead of at beginning of file. LLN, SPR # 20-12881,
2) ; 10-APR-79
2) ;*****End of Revision History*****
**************
1)179 PUSHJ P,IBACK ;BACKSPACE THE FILE
1) PUSHJ P,INTON ;[6007]TURN INTERRUPTS B
ACK ON
TOPS-20 TROUBLE-SHOOTING HANDBOOK Page 15
GS Command
1) PUSHJ P,D60NRY ;PERFORM "NOT READY" DIA
LOG
1) JRST OUTE.7 ;ERROR IS UNRECOVERABLE
1) TELL OPR,[ASCIZ /![LPT... continueing!]
****
2)179 ;**;[6025] ADD SEVERAL LINES AT OUTE.6 + 13L. LLN, 10-APR-79
2) MOVE T1,J$RNPP(J) ;[6025] CALCULATE THE NE
W
2) SUB T1,N ;[6025] DESTINATION PAG
E
2) PUSH P,T1 ;[6025] AND SAVE IT
2) PUSHJ P,IBACK ;BACKSPACE THE FILE
2) PUSHJ P,INTON ;[6007]TURN INTERRUPTS B
ACK ON
2) PUSHJ P,D60NRY ;PERFORM "NOT READY" DIA
LOG
2) JRST [POP P,J$RNPP(J) ;[6025] RESTORE PAGE NO.
FOR REQUEUE
2) JRST OUTE.7] ;[6025] ERROR IS UNRECOV
ERABLE
2) POP P,(P) ;[6025] THROW AWAY DESTI
NATION
2) ;[6025] PAGE - FORWARD S
PACING
2) ;[6025] CODE WILL HANDLE
IT
2) TELL OPR,[ASCIZ /![LPT... continueing!]
**************
[END OF WRITE-UP FILE]
[H] *EX
--
EXIT
.
TOPS-20 TROUBLE-SHOOTING HANDBOOK Page 16
MAPPING DIRECTORIES IN MDDT
MAPPING DIRECTORIES IN MDDT
---------------------------
Release 3 of TOPS-20 can take advantage of the extended
addressing features of the model B processor. Some of its data has
been reorganized and moved into non-zero sections of the addressing
space. One of the things moved was directories. Directories are now
mapped into section 2, starting at the beginning of the section. Thus
the old procedure of reading a user's directory in MDDT is no longer
valid. This will describe how to map a directory correctly, for
release 2 and for releases 3, 3A, and 4.
The procedure for release 2 was the following. You first have to
find out the structure number and directory number for the directory
to be mapped. You can use the TRANSL program to get the directory
number, or use the ^EPRINT command to list the directory information.
As an example, suppose you want to find the directory and structure
information for the directory SNARK:<DBELL>. You run TRANSL and
obtain the results:
@TRANSL SNARK:<CURDS>
SNARK:<CURDS> (IS) SNARK:[4,117]
The "programmer number" obtained is the directory number, in octal.
In this example, the directory number is 117. If the directory is in
bad shape, and you can't run TRANSL or use ^EPRINT, you will have to
find out the directory number by looking at the output from a DLUSER
or ULIST run, or from BUGCHK output.
To find the structure number, you have to work harder. If the
structure is mounted as PS:, its structure number is always 0. For
structures mounted other than PS:, you do the following. You get into
MDDT, and look at the table STRTAB. This table contains all of the
addresses of the structure data blocks in the system. The first word
of each structure data block is the structure name in SIXBIT. So you
search the tables looking for the desired structure. The offset into
the table STRTAB is then the structure number. For our example:
@ENABLE
$SDDT
DDT
JSYS 777$X
MDDT
$$6T
STRTAB/ ,8[ / PS
STRTAB+1/ M^I / REL3
STRTAB+2/ M_% / SNARK
In the example above, you see that PS: is the first structure,
followed by the structures REL3: and SNARK:. Since the offset into
TOPS-20 TROUBLE-SHOOTING HANDBOOK Page 17
MAPPING DIRECTORIES IN MDDT
STRTAB was 2 for SNARK:, the structure number you want is 2.
Knowing the structure number and the directory number, you can
now map the directory and look at it. When the directory is mapped,
location DIRORA will point to the area in the monitor you can find it
at. This is currently the address 740000. To save typing, you can
use the symbol DA, which has the value 740000 (none of the examples
here uses this symbol however). To map the directory, you call the
routine MAPDIR which is in the module DIRECT. It takes two arguments.
The directory number goes in AC1, and the structure number goes in
AC2. For our example, the output looks like:
DIRORA[ 740000
740000/ ?
1! 117
2! 2
CALL MAPDIR$X
$$
740000[ 400300,,100
The skip return from MAPDIR means you have successfully mapped the
directory. You can now look at the whole directory by examining the
proper locations. The number of pages that are mapped by MAPDIR is
30, which is the length of a directory, so the whole thing is
available to look at. By examining or changing location 740000+N in
core, you are examining or changing location N of the directory. When
you are finished, you can just leave MDDT by jumping to MRETN or by
typing ^C.
In release 3, however, when you examine location DIRORA after
calling MAPDIR, it doesn't have to contain 740000. If it does, then
your machine cannot support extended addressing and the monitor is
running the same as release 2 did. In this case you can ignore the
rest of this document. If your machine does have extended addressing,
when you examine location DIRORA you will see the number 2,,0. This
address is now in section 2 of the monitor, and MDDT cannot read the
data there directly. If you look at the location 740000 after calling
MAPDIR, it will still be unreadable, since the directory is no longer
read in there. Those pages are now unused.
To be able to read the directory now, you have to tell the
monitor to map in the pages where you can see them with MDDT. The
first step is to examine the location DRMAP. This location is the
section pointer for section 2, where the directories are mapped. This
is a share-type pointer, which contains the OFN for the desired
directory in the right half. This number is one of the arguments for
the MSETMP routine. MSETMP takes the following arguments. AC1
contains the OFN in the left half, and the first page number to be
mapped in the right half. AC2 contains flag bits in the left half,
and the address where you want to map the pages in the right half.
TOPS-20 TROUBLE-SHOOTING HANDBOOK Page 18
MAPPING DIRECTORIES IN MDDT
AC3 contains the number of pages to be mapped. For mapping
directories, you can use 740000 as the address, and you want to map 30
pages. You also want to set flag bits so that the directory can be
changed. For the example, you do the following:
DRMAP[ 224000,,147
1! 147,,0
2! 140000,,740000
3! 30
CALL MSETMP$X
$
After the call to MSETMP, the directory is now mapped in 740000, and
you can proceed as you used to in release 2. When you are finished
with the directory, you should call MSETMP again to unmap the
directory. This is done by supplying the same arguments as before,
except that ac 1 contains zero. As an example:
1! 0
2! 140000,,740000
3! 30
CALL MSETMP$X
$
Now you can simply ^C out of MDDT or jump to MRETN.
For Release 4 of TOPS-20, the various flavors of DDT have been
trained to understand extended addresses, so the mapping contortions
used for 3 and 3A are once again unnecessary. On extended machines
one can reference section two directly as below:
DIRORA[ 2,,0
2,,0[ 400300,,100
When done, you can still just ^C out or jump to MRETN.
TOPS-20 TROUBLE-SHOOTING HANDBOOK Page 19
RECOVERING FROM DIRECTORY ERRORS
RECOVERING FROM DIRECTORY ERRORS
Sometimes after a monitor crash due to disk problems, some of the
directories on the system will contain errors. These errors cause
BUGCHKs such as DIRFDB, NAMBAD, DIRPG0, and DIRPG1. It is sometimes
possible to find the error in the directory by getting into MDDT,
mapping the directory, finding what is wrong, and fixing it. This
procedure is described in the SWSKIT. However, this is not always
easy, and may take a lot of time. It is therefore better in many
cases to simply delete the bad directory and recreate it. This is
easy to do for most directories. But special procedures are necessary
for the directories <SYSTEM> and <SUBSYS>. The rest of this memo will
describe the methods of recovering from bad directories, handling in
particular the difficult case of the <SYSTEM> directory.
You can first try to give the EXPUNGE command with the REBUILD
and PURGE subcommands. If the problem with the directory is very
simple, it may fix your problem. As an example, suppose the directory
PS:<SICK-DIRECTORY> is incorrect. You would type:
$EXPUNGE (DIRECTORY) PS:<SICK-DIRECTORY>,
$$REBUILD (SYMBOL TABLE)
$$PURGE (NOT COMPLETELY CREATED FILES)
$$
PS:<SICK-DIRECTORY> [NO PAGES FREED]
$
If this does not help the problem, you will have to delete the
directory and then recreate it. Before proceeding, you should make
sure that any files you can reference are copied to another directory,
or else are saved on tape. Now first try to delete the directory
normally, as follows:
$BUILD (USER) PS:<SICK-DIRECTORY>
[OLD]
$$KILL
[CONFIRM]
$$
$
TOPS-20 TROUBLE-SHOOTING HANDBOOK Page 20
RECOVERING FROM DIRECTORY ERRORS
If this is successful, then simply recreate the directory again,
and restore the user's files. You should recreate the directory with
the same directory number as it had before, so that DLUSER's data will
still be correct.
The procedure above will fail if either the directory is mapped
by another job, or if it is totally unusable. If it is mapped, and
the directory is a random user, you can wait until the directory is no
longer in use, or you can take the system stand-alone so that no user
can reference it.
If the directory is totally unusable, you will then have to try
to delete it the hard way. Before proceeding, you should try to
delete and expunge all files in the directory. This will minimize the
amount of lost pages that will result. Now there are two cases to
consider. If the directory is not a sub-directory, you type the
following:
$DELETE (FILE) PS:<ROOT-DIRECTORY>SICK-DIRECTORY.DIRECTORY,
$$DIRECTORY (AND "FORGET" FILE SPACE)
$$
<ROOT-DIRECTORY>SICK-DIRECTORY.DIRECTORY.1 [OK]
$
If the directory is a subdirectory, you modify the above command
by replacing "ROOT-DIRECTORY" by the name of the next higher
directory. Thus if the directory was PS:<ANOTHER.BAD-ONE>, you type:
$DELETE (FILE) PS:<ANOTHER>BAD-ONE.DIRECTORY,
$$DIRECTORY (AND "FORGET" FILE SPACE)
$$
<ANOTHER>BAD-ONE.DIRECTORY.1 [OK]
$
The above procedure tells the monitor to treat the directory file
like a normal file, and to delete it as such. This means that any
files in the directory will become "lost". The disk pages can be
recovered later with CHECKD. If the above works, you simply can
recreate the directory and restore the files.
The only reason the above command should fail is if the directory
is still mapped. For PS:<SUBSYS>, you can bring up the system
stand-alone so that no programs are run from it, and then delete it.
For PS:<SYSTEM>, even taking the system stand-alone will not help, for
it is always mapped by job 0. But there are two procedures you can
use which do work.
TOPS-20 TROUBLE-SHOOTING HANDBOOK Page 21
RECOVERING FROM DIRECTORY ERRORS
The safest method can be used if the user's system has mountable
structures. If you have built another PS: structure, you can mount
the pack with the bad directory as an alias, and then the directory
will not be mapped and can be deleted. As an example:
$SMOUNT (FILE STRUCTURE) SICK:,
$$STRUCTURE-ID (IS) PS:
$$
WAITING FOR STRUCTURE SICK: TO BE PUT ON LINE...
STRUCTURE SICK: MOUNTED
$
$DELETE (FILES) SICK:<ROOT-DIRECTORY>SYSTEM.DIRECTORY,
$$DIRECTORY (AND "FORGET" FILE SPACE)
$$
SICK:<ROOT-DIRECTORY>SYSTEM.DIRECTORY.1 [OK]
$
Then you can build the new directory, restore the files to it,
and then use it again for your normal PS: pack. Be sure to build the
new directory with the same number. This is especially important for
the special system directories.
If you do not have another disk drive or another PS: disk, or if
you don't want to bother SMOUNTing the disk, you can fix the <SYSTEM>
area by using MDDT. The basic idea is to patch the monitor so that it
no longer thinks that the directory is in use. This is done as
follows:
$^EQUIT
INTERRUPT AT 17117
MX>/MDDT
CHKOFN/ JSP CX,.SAVE JRST RSKP
MRETN$G
$
Then you should have no problems deleting the directory.
Immediately after doing the delete, you should reload the system.
When the system restarts, you can read the monitor and the EXEC either
from the distribution magtape or from another directory where you had
kept copies. Then recreate the <SYSTEM> area, making sure to give it
the same directory number as it had before. Then you can restore the
files and let the users back on. Finally, you should run CHECKD
sometime to recover the lost pages.
TOPS-20 TROUBLE-SHOOTING HANDBOOK Page 22
MORE ABOUT DIRECTORY PROBLEMS
MORE ABOUT DIRECTORY PROBLEMS
=============================
SOME HINTS FOR TRACING DIRECTORY PROBLEMS
NOTE -- Use the methods documented in the Operators
Guide before resorting to the methods below.
1. There is a file on the SWSKIT called DIRTST.EXE which will
test for inconsistencies in the directory pointers.
@ENABLE
$RU DIRTST
This will tell you just about everything.
2. Another program on SWSKIT is DIRPNT which prints out the
contents on the chained FDB's, entire directory, FDB, or
symbol table.
To run it:
@ENABLE
$RU DIRPNT
And answer the questions. This also may not work if the
headers are bad.
3. If you get a BUGCHK:
Go into the monitor with MDDT and set a breakpoint at the
BUGCHK address, say, FDBBAD. Do the functions that cause the
BUGCHK; DIR, say. Trace down the bug. The relevent
listings are PROLOG and DIRECT. These give the directory
format and useful symbols.
4. If the pointers are destroyed or confused you can map in the
directory as follows:
@ENA
$^EQUIT ; get into MINI-EXEC
MX>/ ; get into MDDT
; Map in directory, put dir number in 1. Get dir
; number from DLUSER or TRANSL. Format is
; [4,directory#]. Put the structure number in AC2.
; To find the structure number look at the table
; STRTAB. STRTAB contains a list of pointers to the
; SDBs of structures that are mounted. The structure
; numbers are equal to the offset into the STRTAB. To
TOPS-20 TROUBLE-SHOOTING HANDBOOK Page 23
MORE ABOUT DIRECTORY PROBLEMS
; find out which structure has structure number
; 3 look at STRTAB+3. Address contents which are the
; SIXBIT structure name.
STRTAB/ 54321 ; str number 0
STRTAB+1/ 56776 ; str no 1
STRTAB+2/ 12345 ; str no 2
12345$6T/ FOO ; str no 2 is FOO:
1/ DIRECTORY NUMBER
2/ STR NUMBER
CALL MAPDIR$X
; Now you can look at the header pointers etc., and
; fix things up if you're lucky. Go back to the
; MINI-EXEC.
^P
MX>START
$
5. If you can't (or don't want to) recover the existing files
you can delete the directory and restore the files using a
DUMPER tape. This works for <SYSTEM> and all other
directories.
In order to delete a directory you must remove it from
<ROOT-DIRECTORY> (or next higher-level directory).
You can do this with the
following set of commands:
(first be sure nothing is mapped from this
directory)
@ENA
$DELETE<ROOT-DIRECTORY>DIRECTORYNAME.*.,
$$DIRECTORY
$$
Create new directory with the same directory number. The same number
is important for the special system directories.
$^ECREATE <DIRECTORYNAME>
[New]
$$NUMBER nn
.
.
.
Now DUMPER the files back.
TOPS-20 TROUBLE-SHOOTING HANDBOOK Page 24
JSB AND PSB MAPPING
An Easy Way to Examine the PSB and JSB of Another Job
-----------------------------------------------------
There is an occasional need to look at the state in detail of
another job on the system. A common reason for doing this is to find
the cause and cure of a "hung job" which cannot be logged out. To
find out what the job is doing you usually start by looking at the
JSYS stack in the PSB. But you cannot examine such data easily
because the fork data in the PSB and the job data in the JSB are not
in the monitor's address space until the fork is run. If you try to
look at the PSB or JSB using MDDT you will see the data for your own
fork. To look at the data for another fork you must do what the
monitor does, and that is to map it.
A procedure for doing the mapping of a PSB or JSB was given in
the release 3 and 3A SWSKITs. You first find the SPT index of the PSB
or JSB you want to map, then you call SETMPG or MSETMP to set up
pointers to the data, and then you can examine it. But there are
several problems in using that method, which are:
1. You have to find an empty set of pages in the monitor's
address space which can be used for mapping.
2. There is not enough room to map all of the PSB and JSB. So
if you want to examine many different things you have to do
the mapping many times.
3. The routines SETMPG and MSETMP do no validity checking of
their arguments. Thus if you feed them bad data the system
will probably crash. So if you need to map things many times
your chances are you will make a mistake once too often.
4. The addresses of the data are not correct. To look at PPC
for example, you can't just examine location PPC (which would
be for your own fork). You have to look in the page you are
using for mapping. So every reference has to be offset by
some constant.
5. When you are done looking at the fork, you can't simply leave
MDDT. You have to call SETMPG or MSETMP again to unmap the
data.
TOPS-20 TROUBLE-SHOOTING HANDBOOK Page 25
JSB AND PSB MAPPING
Since that documentation was written I have found a procedure
which is much easier. It eliminates almost all of the above problems.
The procedure is this:
1. Do a "GET" of the file the monitor was loaded from, usually
SYSTEM:MONITR.EXE.
2. Enter user mode DDT in the file you got, and then do a JSYS
777 to get into MDDT.
3. Find out the SPT indexes as before, and call MSETMP to map
the PSB or JSB to the USER address space, in the correct
place!!
4. Return from MDDT, and examine PSB and JSB locations directly,
and see the correct data in the right place.
5. When you are done, just ^C and do a RESET.
The rest of this document will document step by step how the
procedure above is done, by using an example. Assume that we wish to
examine the state of fork 105, which belongs to job 21. We then
begin:
TOPS-20 TROUBLE-SHOOTING HANDBOOK Page 26
JSB AND PSB MAPPING
@ENABLE !Get a copy of the monitor
$GET PS:<SYSTEM>MONITR.EXE
$START 140 !Get into user DDT
DDT
JSYS 777$X !Enter MDDT
MDDT
!Following is an example of the procedure to map the JSB of a job:
FKJOB+105[ 25,,2035 !Get the SPT index of the JSB
!of fork 105
T1! 2035,,0 !Put SPT index in left half
T2! 540000,,JSBPGA !* Flags and where to map to
T3! JSLSTA'1000-JSBPGA'1000 !Number of pages to map
CALL MSETMP$X !Do the mapping
$
!Following is an example of the procedure to map the PSB of a fork:
FKPGS+105[ 2657,,2332 !Get the SPT index of the PSB
!of fork 105
T1! 2332,,PSBMAP-PSBPGA !Put SPT index in left half,
!and offset in right half
T2! 540000,,PSSPSA !* Flags and where to map to
T3! PSBMSZ !Number of pages to map
CALL MSETMP$X !Do the mapping
$
TOPS-20 TROUBLE-SHOOTING HANDBOOK Page 27
JSB AND PSB MAPPING
!Example of returning to user mode and looking at data from both
!the PSB and the JSB of the fork:
MRETN$G !Return to user mode
$
USRNAM[ 3 !Examine job's user name
USRNAM+1[ 422050,,546230 $T;DBELL
CTRLTT[ 777777,,777777 !Controlling terminal
FILBYT+MLJFN[ 4400,,334010 !Start of data block for JFN 1
PPC/ T1,,DISXE#+2 !Current PC of the fork
PAC+17/ -215,,UPDL+62 !Current stack pointer
UPDL/ CHKHO5# !First few stack locations
UPDL+1/ CAM CHKAE0#+12
UPDL+2/ CHKHO5#
UPDL+3/ CAM CHKAE0#+12
UPDL+4/ T1,,.COMND+1
UPDL+5/ -273,,UPDL+4
!Example of terminating the mapping we have done:
^C
$RESET !To finish, just quit and reset
$
The procedure as given above maps the JSB and PSB write-enabled.
So if you find something you want to change, you can simply deposit
the new value into the location. If you want the data to be
write-protected, then change the 540000 to 500000 in the two steps
marked with an asterisk.
Warning: The procedure of mapping things into your user address
space has its limitations. Mapping the JSB and PSB works because the
user core used for mapping was previously empty. In general, you can
only map things into your user core if your core pages are either
nonexistant or are private. If you call MSETMP or SETMPG and map
something over a shared page, the old file page is unmapped without
the share counts being updated, which prevents your job from logging
out later. To get around this problem you can BLT your core image to
force all of the pages to be private.
TOPS-20 TROUBLE-SHOOTING HANDBOOK Page 28
BREAKPOINTING MULTI-USER CODE
HOW TO USE BREAKPOINTS IN CODE THAT MANY USERS EXECUTE
------------------------------------------------------
When inserting a breakpoint into the running monitor, you have to
be careful that no other users will execute the code containing the
breakpoint. If some other user hits the breakpoint, they will blow up
with an illegal instruction since MDDT will not be there to handle the
breakpoint. This normally limits the places you can set breakpoints,
since most of the monitor can be gotten to by any user. Even if you
run the system stand-alone, it is possible that the routine you are
debugging will be called by job 0. However, it is still possible to
do such debugging, even on a system which is not stand-alone, and this
document will describe how this is done.
The essential element of this technique is to put in the patch in
such a way that only your own fork can ever reach the breakpoint.
First you write a simple routine which will skip if it is not being
run by your particular fork. This can be done easily if you remember
that the location FORKX contains the currently running fork number.
An example of such a routine is the following:
@ENABLE
$SDDT
DDT
JSYS 777$X
MDDT
FORKX[ 23 ; check our fork number
FFF/ 0 NOTME: PUSH P,T1 ; save an AC
NOTME+1/ 0 MOVE T1,FORKX ; get currently running fork number
NOTME+2/ 0 CAIE T1,23 ; is it us=23?
NOTME+3/ 0 AOS -1(P) ; no, setup skip return
NOTME+4/ 0 POP P,T1 ; restore the saved AC
NOTME+5/ 0 POPJ P, ; and return to caller
NOTME+6/ 0 FFF: ; reset the position of FFF
The routine above simply saves AC T1, gets the currently running fork
number, compares it with your own fork number which you obtained by
looking at location FORKX, and skips if they differ.
TOPS-20 TROUBLE-SHOOTING HANDBOOK Page 29
BREAKPOINTING MULTI-USER CODE
Now assume that you want to set a breakpoint into the following
code, which is in the routine BLKSCN in the module DIRECT.
BLKSC2/ HLRZ C,BLKTAB(B)
BLKSC2+1/ CAME A,C
BLKSC2+2/ AOBJN B,BLKSC2
BLKSC2+3/ JUMPGE B,BLKSCE
BLKSC2+4/ HRRZ B,BLKTAB(B)
Assume you want the breakpoint at location BLKSC2+3. You do the
following:
BLKSC2+3/ JUMPGE B,BLKSCE FFF$< ; patch this location
FFF/ 0 PUSHJ P,NOTME ; call the NOTME routine
FFF+1/ 0 .$B JFCL$> ; me if it gets here, set breakpoint
FFF+2/ JUMPGE B,BLKSCE
FFF+3/ JUMPA A,BLKSC2+4
FFF+4/ JUMPA B,BLKSC2+5
BLKSC2+3/ JUMPA NOTME+6
Notice that the breakpoint has been set in the JFCL instruction
following the call to NOTME. Only your fork will execute it, so you
can now debug the section of code while other users are executing it
at the same time. Remember to remove the breakpoint when you are
done.
To run a particular program while having breakpoints set, you
must remember that the breakpoint has to be set by the same process
which you expect to hit it. So for example, typing ^EQUIT, setting a
breakpoint, returning to the EXEC and running your program will not
work. You must enter MDDT and set the breakpoints from your program
you want to debug. As an example:
@ENABLE
$GET PROGRAM ; get the program to be used
$DDT ; enter DDT
DDT
JSYS 777$X ; and enter MDDT from there
MDDT
(PUT IN "NOTME" ROUTINE AND SET BREAKPOINTS HERE)
MRETN$G ; return to the context of the test program
$
$G ; start the test program
TOPS-20 TROUBLE-SHOOTING HANDBOOK Page 30
USING ADDRESS BREAK TO DEBUG THE MONITOR
Using Address Break to Debug the Monitor
----------------------------------------
Sometimes when examining a set of dumps, you will notice the crashes
are caused by some location being destroyed. If you have no idea
where the destruction is done from, finding the problem could be very
difficult. One useful procedure in such cases is to use the address
break feature of the hardware to track down the problem (except for
2020's!). The only problem is that the use of address break is not
obvious. This is a manual describing how to use address break in the
TOPS-20 monitor.
In order to use address break, four things must be done. First,
the current routines the monitor uses to set address breaks for users
must be disabled. Secondly, your own address break must be set from
MDDT or EDDT. Thirdly, instructions which you want to execute
properly have to be modified so that they will not cause an unwanted
address break. Finally, breakpoints must be placed in the monitor so
that the state of the monitor can be examined when the address break
occurs. The following is a step by step example of doing this.
1. Load the monitor for debugging, and enter EDDT. The procedure
starting from BOOT is the following:
BOOT>/L ;Load monitor but don't start it
BOOT>/G140 ;Start EDDT
EDDT
DBUGSW/ 0 2 ;Set debugging mode
EDDTF/ 0 1 ;Keep EDDT once system starts
GOTSWM$B ;Install useful breakpoint
SYSGO1$G ;Start the monitor
[PS MOUNTED]
$1B>>GOTSWM 0$1B ;Remove breakpoint now
2. Disable the monitor's normal changing of the address break.
This is currently done at two places:
KISSAV+4/ DATAO UNPFG1+26 JFCL ;Disable instruction
SETBRK+12/ DATAO A JFCL ;Here too
TOPS-20 TROUBLE-SHOOTING HANDBOOK Page 31
USING ADDRESS BREAK TO DEBUG THE MONITOR
3. Set your own address break at the desired location. Refer to
the Hardware Reference Manual for details. The instruction to
set an address break is:
DATAO APR,ADDR ;Note: APR = 0
where ADDR contains the following fields:
Bits Description
---- -----------
9 Break at given address on instruction fetches
10 Break at given address on reads
11 Break at given address on writes
12 0=exec address space, 1=user address space
13-35 Address to break on.
So now assume you want to catch a bug which is blasting
location CURDS. You want to break only for writes, and want
to use exec virtual space. Therefore you type the following:
FFF/ 0 100000000+CURDS ;Put data in convenient place
DATAO APR,FFF$X ;Set the address break
4. Now you want to disable address break for all instructions
which you expect to change the given location. Assume in this
example that only location DIDDLE should change location
CURDS. Then you do the following for a model B CPU:
FFF! IT: ;Define location to get old flags
IT+1! ;Old PC
IT+2! ;New flags
IT+3! IT+4 ;New PC
IT+4! EXCH IT ;Save AC and get old flags
IT+5! TLO 1000 ;Set address break inhibit bit
IT+6! EXCH IT ;Restore flags and AC
IT+7! JRST 5,IT ;Return to caller
IT+10! FFF: ;Redefine FFF
DIDDLE/ MOVEM A,CURDS FFF$< ;Insert patch
FFF/ 0 JRST 7,IT$> ;Call above routine
FFF+1/ 0 MOVEM A,CURDS ;Typed by DDT when finishing patch
FFF+2/ 0 JUMPA A,DIDDLE+1
FFF+3/ 0 JUMPA B,DIDDLE+2
DIDDLE/ MOVEM A,CURDS JUMPA IT+10
The JRST 7,IT instruction is used to save the old PC at IT and
IT+1, and take a new PC from IT+2 and IT+3. There the old PC
is changed to include the address break inhibit bit. Then a
JRST 5,IT is done which returns to the caller. The next
instruction then executes without causing an address break.
You have to insert the JRST 7,IT instruction at every
TOPS-20 TROUBLE-SHOOTING HANDBOOK Page 32
USING ADDRESS BREAK TO DEBUG THE MONITOR
instruction you want to succeed.
For model A CPUs the procedure is similar, but a little easier:
FFF! IT: ;Define location to hold PC
IT+1! EXCH IT ;Get old PC and save AC
IT+2! TLO 1000 ;Set address break inhibit flag
IT+3! EXCH IT ;Restore PC and AC
IT+4! JRSTF @IT ;Return to caller
IT+5! FFF: ;Redefine FFF
DIDDLE/ MOVEM A,CURDS FFF$< ;Insert patch
FFF/ 0 JSR IT$> ;Call above routine
FFF+1/ 0 MOVEM A,CURDS ;Typed by DDT when finishing patch
FFF+2/ 0 JUMPA A,DIDDLE+1
FFF+3/ 0 JUMPA B,DIDDLE+2
DIDDLE/ MOVEM A,CURDS JUMPA IT+5
5. Now put the breakpoints into the monitor so that when an
address break occurs, you will get into EDDT. There are two
locations to patch, one for PI level and one for non-PI level.
You also have to patch a monitor bug in release 3 and 3A so
that the page fail dispatch code works properly.
ADRCMP$B ;Set breakpoint at non-PI routine
PFCD23$B ;Set breakpoint at PI routine
PIPTRP+1/ MOVE A,TRAPSW MOVE A,TRAPS0 ;And fix a bug
$P ;Now let the monitor proceed
6. When either of the above breakpoints is hit, the flags and PC
of the instruction which caused the address break will be in
locations TRAPFL and TRAPPC. If the address break was from
JSYS level (breakpoint was to ADRCMP and location INSKED is
zero) then an $P will proceed properly. If the address break
was from the scheduler or from PI level, doing $P will be
useless since the monitor will then BUGHLT because it doesn't
want to see an address break under these conditions. However,
this is ok if all you want to do is find the instruction
causing the trashing.
If the location still gets trashed after trying to catch it this
way, either your procedure is wrong; you are trying this on a 2020
(which has no address break feature); the location is being changed
by some IO being done (RH20s, DTEs, etc); or else the machine is
having some hardware problems.
TOPS-20 TROUBLE-SHOOTING HANDBOOK Page 33
RECOVERING FROM SYSTEM DISASTERS
RECOVERING FROM SYSTEM DISASTERS
There are some common system disasters which in many cases can be
recovered from quickly and with a minimum of effort. The four we will
discuss in this article are:
1. Hung Terminals
2. Hung SETSPD
3. Trashed Disks
4. Hung Jobs
1.0 HUNG TERMINALS
Hung terminals are usually the result of two problems. Either
the speed has been set incorrectly for that terminal type or a problem
exists between the KL and the front end. If the problem is a result
of an improper speed setting, then simply resetting the speed will be
sufficient. On the other hand, if the problem is due to some sync
problem between the KL and the 11 then the easiest way to recover from
this is to reload the front end. This can be done by depressing the
halt switch on the operator's console of the 11 and then placing it
back in the enable state. After about fifteen seconds, the message
[DECsystem-20 continued]
to be printed on the CTY. If this fails to free the terminal, perhaps
the problem is a hung job. See the discussion under that heading.
2.0 HUNG SETSPD
This is a fairly common problem brought on by some hardware
problem. It is possible to bring the system up without running SETSPD
under JOB 0, logging in, and then trying to run SETSPD under some
other operator job. If SETSPD then hangs, it is possible to CONTROL/C
out of the program, edit 4-CONFIG.CMD to remove the commands suspected
of hanging SETSPD, and retrying. In this way, while waiting for the
problem to be resolved, it is possible to continue timesharing.
To bring the system up without running SETSPD automatically, one
need only install the following patch to the MONITOR using EDDT on
system start up.
TOPS-20 TROUBLE-SHOOTING HANDBOOK Page 34
HUNG SETSPD
BOOT>/l
BOOT>/g141
EDDT
EDDTF[ 0 -1
DBUGSW[ 0 2
GOTSWM$B
SYSGO1$G
[PS MOUNTED]
1B>>GOTSWM
RUNDD3+7/ PUSHJ P,RUNDII JFCL
0$1B
$P
%%No SETSPD
The system will then come up as usual except that SYSJOB will not
run. After successfully deciding the problem with SETSPD, SYSJOB can
be run by typing
COPY (FROM) <SYSTEM>SYSJOB.RUN (TO) <SYSTEM>SYSJOB.COMMANDS
This will cause all the commands in the SYSJOB.RUN file to be
executed by SYSJOB.
There is a project under way to allow SETSPD to time out itself
and continue with the next comand in 4-CONFIG.CMD. Look for it in the
Large Buffer or the 20 Dispatch.
3.0 TRASHED DISKS
This is surely one of the biggest headaches facing specialist.
Trashed disks come in many forms and recovering from these requires a
good knowledge of the structure of the TOPS-20 file system.
If the structure cannot be mounted, it is because of one of the
following reasons:
1. Inconsistency in either of the HOM blocks
1. Word HOMNAM (1) of either HOM block not SIXBIT/HOM/
2. Word HOMCOD (176) of either HOM block not 707070
3. Word HOMHOM (5) of first HOM block not 1,,12
4. Word HOMHOM (5) of second HOM block not 12,,1
5. Word HOMFSN (173) of either HOM block not 20040,,47524
TOPS-20 TROUBLE-SHOOTING HANDBOOK Page 35
TRASHED DISKS
6. Word HOMFSN+1 (174) of either HOM block not 51520,,31055
7. Word HOMFSN+2 (175) of either HOM block not 20060,,20040
8. Right half of word HOMLUN (4) of either home block either
refers to a unit greater than the left half of word
HOMLUN or it refers to a UNIT already verified
9. Word HOMSNM (3) of either home block does not agree with
SIXBIT/STRUCTURE-NAME/
10. No disk address for index block in word HOMRXB (10) of
either HOM blocks
2. Inconsistencies in Root-Directory page 0
1. Directory number in Directory page 0 of Root-Directory
not 1
2. Directory block type (DRTYP) of Root-Directory page 0 not
400300
3. Relative Page number (DRRPN) of Root-Directory page 0 not
0
4. Top of symbol table (DRSTP) of Root-Directory page 0 out
of Directory bounds
5. Pointer to first free block (DRFFB) of Root-Directory
page 0 not in page 0 of the directory
6. Pointer to Directory Name String (DRNAM) not under start
of symbol table
7. Directory name pointer (DRNAM) not 0 and Name string
block length (NMLEN) not at least 2 words long
8. Directory name pointer (DRNAM) not 0 and directory name
block header (NMTYP) not 400001
9. Password block pointer not 0 and password string block
length (NMLEN) not at least 2 words long
10. Password block pointer not 0 and password string block
header (NMTYP) not 400001
11. Account string block pointer not 0 and Account string
block length (NMLEN) not at least 2 words long
12. Account string block pointer not 0 and Account string
block header (NMTYP) not 400001
TOPS-20 TROUBLE-SHOOTING HANDBOOK Page 36
TRASHED DISKS
3. Inconsistencies in Block types or free space in subsequent
pages of the directory.
All blocks in the directory (including free space) begin
with a block header which specifies type and length.
Immediatly following one block should be a header for a new
block. If this scheme is corrupted, the mount will fail.
1. Header of a block not
1. (NAMTYP) 400001
2. (EXTTYP) 400002
3. (ACCTYP) 400003
4. (USRTYP) 400004
5. (FDBTYP) 400100
6. (DIRTYP) 400300
7. (FRETYP) 400500
8. (FBTTYP) 400600
9. (GRPTYP) 400700
2. Header of a block is NAMTYP and Block length not at least
2 words
3. Header of a block is EXTTYP and block length not at least
2 words
4. Header of a block is ACCTYP and block length not at least
3 words
5. Header of a block is USRTYP and block length not at least
3 words
6. Header of a block is FDBTYP and
1. Block length not at least 30 (.FBLN0) words long
2. Pointer to Author String (.FBAUT) not 0 and points to
a block outside of the directory or points to a block
that does not meet the tests for a user name string
as described above.
3. Pointer to Last Writer String (.FBLWR) not 0 and
points to a block outside of the directory or points
to a block that does not meet the tests for a user
name string block as described above.
TOPS-20 TROUBLE-SHOOTING HANDBOOK Page 37
TRASHED DISKS
4. Pointer to Account String (.FBACT) is not less than
or equal to zero and it points to a block outside of
the directory or it points to a block that does not
meet the tests for an account string block as
described above.
5. Pointer to Name String (.FBNAM) is not 0 and it
points to a block outside of the directory or it
points to a block that does not meet the tests for a
Name String Block as described above.
6. Pointer to Extension String (.FBEXT) is not 0 and it
points to a block outside of the directory or it
points to a block that does not meet the tests for an
Extension String Block as described above.
7. Header of a block is DIRTYP and
1. Header is not on a page boundary
2. Relative page number (DRRPN) not the calculated page
number
3. Pointer to first free block (DRFFB) does not point to
a location within the current directory page
4. Directory number (DRNUM) not 1.
8. Header of a block is FRETYP and block is not at least two
words or Pointer to next free block (FRNFB) is not zero
and points to a location not on the same page as current
9. Last block did not end at DRFTP (address specified on
first page of directory)
4. BAT blocks inconsistent.
1. Either block does not contain SIXBIT/BAT/ in BATNAM
(offset 0 in block)
2. Either block does not contain 606060 in BATCOD (offset
176 in block)
3. Sector number of the BAT block (BATBLK) not the true
sector of block
4. The BAT blocks to not compare exactly with each other
through word 176 of the blocks
TOPS-20 TROUBLE-SHOOTING HANDBOOK Page 38
TRASHED DISKS
5. Checksum of the Root-directory Index Block does not agree
with the checksum calculated.
Checksums are calculated as follows:
CHKSUM = 0 ;
For I = 0 to 777
If XB(I) = 0 then
CHKSUM = CHKSUM + I
Else
CHKSUM = CHKSUM + XB(I) ;
where XB is the first word of the index block.
As you can see, there are many things that could be wrong with a
structure that inhibits it from being mounted. The consistency of the
structure can be checked quite easily using the new FILDDT commands of
STRUCTURE and DISK (see 'NEW DISK FEATURES FOR FILDDT' also in this
SWSKIT).
For structures which are badly trashed, the only sane way of
recovering is to rebuild the structure using a catastrophe tape. For
simple inconsistencies such as a bad BAT block, CHECKD does the job
well. For more involved trashes which can not be recovered from a
back up tape (because of a forgetful system manager) the above
information can be of great help.
4.0 HUNG JOBS
There are a number of circumstances which arise which cause a job
to become hung, usually waiting for some resource to free up, some
share count to become zero etc. Some times, these tests will never
become satisfied, the Job has its PSI system turned off, and as a
result the job becomes Hung. Freeing it up can be very tricky. The
first thing to try is to log the job out from some other terminal. If
this doesn't succeed in freeing the job up, then the next best thing
is to detatch the job from the terminal and allow it to sit there. It
may be using negligable amounts of CPU time and causes no adverse
affects to the system. To zap the job may crash the system which, in
most cases, is not the disirable approach.
The next time the system is reloaded, be sure to get a dump of the
system with the hung job and submit it as an SPR (see the SWSKIT
article about getting informative Dumps).
TOPS-20 TROUBLE-SHOOTING HANDBOOK Page 39
LOOKING AT HUNG TAPES
LOOKING AT HUNG TAPES
A number of problems of the general classification "tape hang"
have been reported, and will probably always exist as long as we use
magtapes. Although there are apparently several variants of the
problem, there are some things which can be done by a suitably
cautious specialist when presented with a hung tape drive. Listed
below are some techniques which can be used in an attempt to
investigate and perhaps alleviate the problem. These things should,
in general, be harmless to the system, barring mis-typing in MDDT. As
a result, perhaps they will not clear the problem.
For release 4, there are several tables that are used in relation
to tape drives. Some of these tables are indexed by MT unit number,
some by MTA unit number. In general, it can be said that if a table
name begins with the characters MT, it will be indexed by MTA or
physical unit number, and if the table name begins with TL or TP, it
will be indexed by MT or logical unit number. The TL and TP tables
will usually have something to do with the tape labeling system. This
article concerns itself mainly with the more important tables relating
to MTAs (physical tape units).
When playing with the tape subsystem, certain care should be
taken. For instance, it always helps if no one else is actively using
the tape drives while you attempt something like reloading the
microcode for a DX20.
1. Finding the Tape Drive
There are several tables parallel to each other which concern the
ownership of a tape drive. Those of interest are DEVNAM, DEVCHR, and
DEVUNT. At DEVNAM+n is the device name in SIXBIT. At DEVUNT+n is a
word with the left half set to the assigner's job number, -1 if free,
or -2 if being controlled by the allocator. The right half contains
the unit number. Note that in release 4, with tape allocation turned
on, MTAs will always indicate that job 0 has the drive assigned and
that the offset to the MT unit number will contain the job number of a
user. At DEVCHR+n is the device characteristics word. Knowing the
devicename or the owning job, one can use DDT to find the table
offset. See example below.
2. Grabbing the Drive
Knowing the offsets into DEVUNT, the device assignment can be
freed by putting -1 into the left half of the appropriate DEVUNT
entry. The drive can then be assigned by the normal ASSIGN command to
the EXEC. In dealing with the allocator for Release 4, your own job
number can be placed here if necessary. The drive, however, will
still be in no state to use. Note that the appropriate DEVUNT entry
would be the one referring to the MT not the MTA.
TOPS-20 TROUBLE-SHOOTING HANDBOOK Page 40
LOOKING AT HUNG TAPES
3. Clearing External Errors
Make sure that there is a tape of some sort mounted, and the
drive is placed on-line. Having a write-enable ring in the tape may
help in being sure the unit is functional if the hung condition is
cleared.
4. Checking the UDB
Next, the Unit Data Block status should be reset. This word can
be found using the MTCUTB table. This table is indexed by MTA unit
number, the left half is the address of the channel data block (CDB),
and the right half contains the address of the UDB. The status word
of the UDB should then be reset to the base state. The right half
should be left alone--it basically contains drive type. The left half
should have only bit 16 set, which indicates a tape type device
(US.TAP). The old contents should be remembered for purposes of later
analysis.
5. Checking the Status
Now, table MTASTS is examined, indexed by MTA unit number again.
Remember the old contents. Then clear the word to zero.
6. Example
@enaBLE (CAPABILITIES)
$sddt
DDT
mddt%$x
MDDT
dvxstn=21 !THIS WILL PROVIDE A HANDY INDEX
!INTO THE MTA OFFSETS IN THE
!DEVxxx TABLES.
!DEVNAM IS A SIXBIT DEVICE NAME
devnam+21/ HLRZM P2,FKBSPW+217(T1) $6t;MTA0
DEVNAM+22/ MTA1
DEVNAM+23/ MTA2
DEVNAM+24/ MTA3
DEVNAM+25/ MTA4
DEVNAM+26/ MTA5
...
...
...
DEVNAM+40/ MTA17
mtan=20 !ROOM FOR 20 (OCTAL) TAPE DRIVES HAS BEEN ALLOCATED
mtindx[ 777765,,5 !BUT ONLY 5 ACTUAL TAPE DRIVES ARE ON THIS SYSTEM
TOPS-20 TROUBLE-SHOOTING HANDBOOK Page 41
LOOKING AT HUNG TAPES
!THE MTs WILL APPEAR AFTER MTAs IN THE DEVxxx
!TABLES SO DVXSTN+MTAN WILL PROVIDE THE OFFSET
!TO THE MT ENTRIES
devnam+41/ HLRZM P1,@0 $6t;MT0
DEVNAM+42/ MT1
DEVNAM+43/ MT2
DEVNAM+44/ MT3
DEVNAM+45/ MT4
DEVNAM+46/ MT5
...
...
...
DEVNAM+60/ MT17
!DEVUNT IS PARALLEL TO DEVNAM AND PROVIDES
!THE OFFSETS INTO THE MTxxxx TABLES FOR MTAs
!AND OFFSETS INTO THE TLxxxx/TPxxxx TABLES
!FOR MTs
devunt+21[ 0 !MTA UNIT ZERO (MTA0: FROM DEVNAM ABOVE) ASSIGNED TO JOB 0
DEVUNT+22[ 1 !JOB 0,,MTA1:
DEVUNT+23[ 2 !JOB 0,,MTA2:
DEVUNT+24[ 3 !JOB 0,,MTA3:
DEVUNT+25[ 4 !JOB 0,,MTA4:
DEVUNT+26[ 5 !JOB 0,,MTA5:
DEVUNT+27[ 777777,,6 !UNASSIGNED,,MTA6:
...
...
...
DEVUNT+40[ 777777,,17 !UNASSIGNED,,MTA17:
!DV%PSD=400000 INDICATES A PSEUDO DEVICE
!THE FOLLOWING ENTRIES FOR MTs WILL INDICATE
!THE AVAILABILITY OF LOGICAL TAPE UNITS
devunt+41[ 32,,400000 !PSEUDO DEVICE MT0: IS ASSIGNED TO
!JOB 32 OCTAL (JOB 26 IN DECIMAL)
DEVUNT+42[ 777776,,400001 !CONTROLLED BY ALLOCATOR,,MT1:
DEVUNT+43[ 777776,,400002 ! " " " ,,MT2:
DEVUNT+44[ 777776,,400003 ! " " " ,,MT3:
...
...
...
DEVUNT+60[ 777776,,400017 ! " " " ,,MT17:
!TLABR0 (INDEXED BY MT NUMBER) WILL INDICATE
!WHICH PHYSICAL TAPE UNIT WILL BE USED WHEN
!REFERENCING AN MT. THIS IS INDICATED BY THE
!PHYSICAL MTA NUMBER IN BITS 2-8.
TOPS-20 TROUBLE-SHOOTING HANDBOOK Page 42
LOOKING AT HUNG TAPES
tlabr0[ 405000,,0 !BIT 0 INDICATES A VALID VOLUME IS MOUNTED ON MTA5
mtcutb+5[ 730437,,730625 !CDB,,UDB FOR MTA5 BEING USED BY JOB 26
!WHO KNOWS IT AS MT0 (SEE ABOVE)
730625[ 102,,157 !FIRST WORD OF UDB FOR MTA5
!US.WLK=1B11 >> WRITE LOCKED
!US.TAP=1B16 >> TAPE TYPE DEVICE
!.UTT70=17B35 >> TU70
mtasts+5[ 0 !THIS EXAMPLE INDICATES A TAPE DRIVE THAT PROBABLY
!HASN'T BEEN REFERENCED BY THE USER YET
mretn$g !TO RETURN TO SDDT FROM MDDT
<>
^Z !TO RETURN TO THE EXEC FROM SDDT
$
If clearing MTASTS and UDBSTS for the drive doesn't seem to clear
the problem, you will probably have to do more digging around to find
some other, more obscure, inconsistency in the MTA/MT tables. This
can be accomplished by referring to the monitor tables (which,
hopefully, have been included with the SWSKIT) under MTA-STORAGE-AREA.
As always, extreme caution should be exercised while fooling around in
MDDT as you can accidentally trash some random location in the monitor
just by hitting a carriage return at the wrong time.
One last note should be made about the monitor tables here. The
description of the DEVUNT table would lead one to believe that the
right half will contain a -2 if the device is under control of the
allocator. If the device is under control of the allocator, the -2
will appear in the left half.
TOPS-20 TROUBLE-SHOOTING HANDBOOK Page 43
A LOOK AT SOME OF THE DISK STUFF
A LOOK AT SOME OF THE DISK STUFF
================================
This article is a front for the PHYPAR module, which is where the
information may be reliably obtained, and should serve as the ultimate
reference for these problems.
Much of the system debugging you will have to deal with will
involve the DEC-20 hardware. There always seems to be a large gap
between what the diagnostics can tolerate and what the monitor can
tolerate in the way of malfunctioning hardware. The monitor will not
always point you to the real disk or magtape problem, say, but will
crash after something has gone wrong a few minutes ago somewhere.
Most of the hardware problems that we have had to deal with that were
really difficult to track down and point the Field Service rep. to
were problems with disk hardware. The following is information which
you can use to help Field Service trace down problems which are not
reported in the diagnostics. In most cases the Field Service rep
knows what all the status bits etc. mean but have not been able to
find them in the monitor crashes or running monitor.
CHNTAB:
CHNTAB is an ordered list of Channel Data Block
addresses starting with channel 0. RH20-0 data block
address is in the first word etc.
CDB:
CDB is the Channel Data Block. There is one CDB per
channnel. The CDB contains channel dependant
instructions and data, pointers to the unit data block
(UDB) in the case of RPO4, RP05, and RP06's. In the
case of TU45's the pointer is to the Kontroller Data
Block (TM02's) which point in turn to the UDBs. The
CDB also contains information about the currently
active unit. When the channel interrupts, control
passes (via a JSP) to CDBINT. The CDB address is
stored in AC1, P1 and the principal analysis routine,
PHYINT, is called.
NOTE: The CDBs are referenced in modules PHYSIO, PHYH2 (RH20
code), PHYM2 (TMO2 code) and PHYP4 (RP04, 05, 06
code). The Channel Data Block is defined in the
module PHYPAR. The address that you get in CHNTAB is
really a pointer to word0 which contains the status
bits for this controller (CDBSTS). Look in PHYPAR for
the table definition. Some words of interest are:
CDBaddress + CDBSTS: status and configuration
information CDBaddress + CDBUDB: 8 word table of UDB
(or KDB) addresses.
TOPS-20 TROUBLE-SHOOTING HANDBOOK Page 44
A LOOK AT SOME OF THE DISK STUFF
The status bits which are also defined in PHYPAR are
listed here for your convenience:
CS.OFL==1B0 ; offline
CS.AC1==1B1 ; primary command active
CS.AC2==1B2 ; secondary command active
CS.ACT==CS.AC1!CS.AC2 ; any active
CS.MAI==1B3 ; channel is in maintenance mode
CS.MRQ==1B4 ; maintenance mode requested for unit
CS.ERC==1B5 ; error recovery in progress
CS.STK==1B6 ; channel supports command stacking
CS.ACL==1B7 ; alternate command list is current
BITs 30-32 ; PIA field
BITs 33-35 ; channel type field
KDB:
Kontroller Data Block (TM02 only) defined in PHYPAR
also. Referenced in PHYM2, PHYPAR, PHYSIO. Words of
interest are:
KDBADDR+KDBSTS: ; flags unit type
KDBADDR+KDBUDB: ; UDB table first word (1 word/UDB)
UDB:
Unit Data Block. There is one UDB per unit associated
with a CDB or KDB. The UDB contains information about
the current activity on the unit in question. The UDB
is defined in PHYPAR as well. Some words of interest
are noted below. Look in the listings for other
information.
UDBADDR + UDBSTS: ; status and configuration info (see below)
UDBADDR + UDBERR: ; error recovery status word
UDBADDR + UDBERP: ; error reporting work area if non 0
UDBADDR + UDBRED: ; reads - sectors if disk, frames if tape
UDBADDR + UDBWRT: ; writes - sectors if disk, frames if MTA
UDBADDR + UDBSRE: ; soft read errors
UDBADDR + UDBSWE: ; soft write errors
UDBADDR + UDBHRE: ; hard read errors
UDBADDR + UDBHWE: ; hard write errors
UDBADDR + UDBPS1: ; current cylinder if disk, cur file if MTA
UDBADDR + UDBPS2: ; current sector within cyl if disk, record
; in file if tape
UDBADDR + UDBSPE: ; soft positioning error
UDBADDR + UDBHPE: ; hard positioning error
; NOTE - there are several other UDB words
; including a device dependent portion
TOPS-20 TROUBLE-SHOOTING HANDBOOK Page 45
A LOOK AT SOME OF THE DISK STUFF
STATUS BITS IN UDBSTS OR FIRST WORD OF UDB:
US.OFS==1B0 ; off line or unsafe
US.CHB==1B1 ; check HOME blocks before any normal I/O
US.POS==1B2 ; positioning in progress
US.ACT==1B3 ; active
US.BAT==1B4 ; on if bad BAT blocks on this unit
US.BLK==1B5 ; lock bit for this units BAT blocks
US.PGM==1B6 ; dual port switch in (A or B)
US.MAI==1B7 ; unit is in maintenance mode
US.MRQ==1B8 ; maintenance mode requested on this unit
US.BOT==1B9 ; unit is at BOT
US.REW==1B10 ; unit is rewinding
US.WLK==1B11 ; unit is write locked
US.MAL==1B12 ; maintenance mode allowed on this unit
US.OIR==1B13 ; operator intervention required, set at
; interrupt level, checked at periodically.
US.OMS==1B14 ; once a minute message to operator, used in
; conjunction with US.OIR.
US.PRQ==1B15 ; positioning required on this unit
US.TAP==1B16 ; device type tape
US.PSI==1B17 ; tape - online/offline/rewind done transition
BITS 32-35 CONTAIN UNIT TYPE CODE NAME IS USTYP
.UTRP4 = 1 ; RP04
.UTRS4 = 2 ; RS04 (drum)
.UTT16 = 3 ; TU16 (TU45)
.UTTM2 = 4 ; TM02 as a unit
.UTRP5 = 5 ; RP05
.UTRP6 = 6 ; RP06
.UTRP7 = 7 ; RP07
.UTRP8 = 10 ; RP08
.UTRM3 = 11 ; RM03
.UTTM3 = 12 ; TM03 AS A UNIT
.UTT77 = 13 ; TU77
.UTTM7 = 14 ; TM78
.UTT78 = 15 ; TU78
.UTDX2 = 16 ; DX20-A
.UTT70 = 17 ; TU70
.UTT71 = 20 ; TU71
.UTT72 = 21 ; TU72
.UTT73 = 22 ; TU7x
TOPS-20 TROUBLE-SHOOTING HANDBOOK Page 46
A LOOK AT SOME OF THE DISK STUFF
THE PLACES WHERE THINGS ARE ON THE DISK ARE AS FOLLOWS:
BLOCK 0: ; 11 bootstrap
BLOCK 1: ; primary HOME block
BLOCK 2: ; primary BAT block
BLOCKS 3-11: ; reserved
BLOCK 12 ; secondary HOME block
BLOCK 13 ; secondary BAT block
The places where the disk pages for the above are stored is in the
table HOME. HOME is defined in STG. The BAT blocks are defined in
PROLOG and the HOME blocks are defined in DSKALC.
TOPS-20 TROUBLE-SHOOTING HANDBOOK Page 47
NEW DISK FEATURES FOR FILDDT
NEW DISK FEATURES FOR FILDDT
The FILDDT to be shipped with release 4 of TOPS-20 will have two new
commands in relation to disk file structure maintenance.
They are:
STRUCTURE (FOR PHYSICAL I/O IS) disk-structure
Examines the specified disk structure.
DRIVE (FOR PHYSICAL I/O IS ON CHANNEL) c (UNIT) u
Examines the specified disk unit.
These are privileged functions and one must have privileges enabled to
use these.
These two commands are nearly identical. Their difference is in the
way the structure is identified. To use the STRUCTURE command the
structure must be mounted. The STRUCTURE command is useful for
examining a multi-pack structure. The DRIVE command is useful for
examining the file system of a structure which cannot be mounted.
Channel and unit numbers can be found from the programs UNITS, DS,
SYSDPY, or OPR.
Addressing is in the same format as in other forms of DDT.
It is easier to understand exactly what the disk will look like in
FILDDT if you keep in mind that all sectors will be packed in the DDT
address space, without regard for sector size, starting at DDT address
0. For instance, on an RP06 there are four sectors per memory page or
200 (octal) words per sector. Therefore, sector zero of the structure
will begin at FILDDT address 0 and end at memory address 177 (octal).
Sector 1 will begin at address 200 and end at 377. For release 4, all
DEC supported disks contain 200 (octal) words per sector, so a
consistent mapping exists between sector number and FILDDT memory
location. Soon, TOPS-20 will support RP20's. For RP20's, there are
1000 (octal) words per sector (one page per sector). Index block
addresses and most monitor disk addresses are in sectors. That is why
it is important to be able to translate between sector addresses and
FILDDT memory addresses.
The FILDDT option of ENABLE PATCHING is also available for use with
the DRIVE and STRUCTURE command. With this option on, the user is
able to modify specific words on the structure. Another very
convenient FILDDT command one may use in conjunction with the disk
commands is LOAD (symbols from) input file spec. One may specify any
file here but a useful one is SYSTEM:MONITR. The symbol table to the
MONITOR has home block sector addresses, FDB offsets etc. When a
file's symbols are loaded, one may also define his own symbols.
This is useful to remember addresses of data structures on the units.
For example, after finding the index block to a file, one could define
a symbol, FILIDX at that address for easy referencing later on.
TOPS-20 TROUBLE-SHOOTING HANDBOOK Page 48
NEW DISK FEATURES FOR FILDDT
When examining a multi-pack structure using the STRUCTURE command,
addressing the first unit is exactly as if there were only one unit in
the structure. FILDDT addresses of sectors on the other units begin
immediately after the last address for the first unit of the
structure. For example, consider that we would like to examine the
BAT blocks for the second unit of a two pack STR: on RP06 drives.
An RP06 contains 304000. sectors per unit and 128. words per sector.
The first FILDDT address for the second unit of a RP06 two pack STR:
is 304000.*128.=38912000. or 224340000 (octal)
FILDDT>STRUCTURE (FOR PHYSICAL I/O IS) PS:
[Looking at file structure PS:]
; starting address of second unit in structure
; plus sector address of BAT blocks (2)
; times number of words per sector gives
; FILDDT address of start of BAT blocks for
; that unit
224340000+2*200=224,,340400
224,,340400[ 424164,,0 $6T; BAT
For another example, let's say we would like to find the start of the
ROOT-DIRECTORY symbol table.
@ENABLE (CAPABILITIES)
$FILDDT
FILDDT>LOAD (SYMBOLS FROM) SYSTEM:MONITR
[22722 symbols loaded from file]
FILDDT>STRUCTURE (FOR PHYSICAL I/O IS) PS:
[Looking at file structure PS:]
NWSEC=200 ; number of words per sector
HM1BLK=1 ; sector number of HOM block
HOMRXB=10 ; offset in HOM block for index
; block to root-directory
; sector number of HOM block
; times words per sector equals
; FILDDT address of start of HOM block
HM1BLK*NWSEC[ 505755,,0 $6T;HOM
HM1BLK*NWSEC+HOMRXB[ 10,,5740 ; plus offset to address of index block
; sector number of index block times
; number of words per sector gives
5740*NWSEC[ 10,,5744 ; FILDDT adr of root-dir index block
; NOTE: Bit 14 (DSKAB) specifies this
; address as a disk sector address.
; sector addresses are bits 15-35
RTDIDX: ; define symbol for index block
; sector number of first page of
; root-directory times number of words
; per sector gives the
TOPS-20 TROUBLE-SHOOTING HANDBOOK Page 49
NEW DISK FEATURES FOR FILDDT
5744*NWSEC[ 400300,,100 ; FILDDT adr of first page of ROOT-DIR
RTDIR0: ; define start of page 0 of ROOT-DIR
RTDIR0+3[ 30610 ; plus 3 for start of symbol table
; NOTE: adr is a 'directory address'
; offset 610 of directory page 30
RTDIDX+30[ 10,,6250 ; get sector adr of page 30 of ROOT-DIR
; sector adr of page 30 times words per
; sector gives FILDDT address of page
; 30 of ROOT-DIR.
6250*NWSEC+610[ 400400,,1 ; Add offset for symbol table start
RTDSYM:
^E
FILDDT>EXIT
Here are some magic numbers for all DEC supported drives.
DRIVE TYPE SECTORS/UNIT STARTING ADR STARTING ADR
OF 2nd UNIT OF 3rd UNIT
(in decimal) (in octal) (in octal)
__________ ____________ ____________ ____________
RP04-RP05 152000. 112,,160000 224,,340000
RP06 304000. 224,,340000 450,,700000
RP07 502200. 365,,156000 752,,334000
RM03 121360. 73,,204000 166,,410000
RP20 201420. 611,,314000 1422,,630000
NOTE: RP20 will not be supported in release 4. It is important to
remember that there are 1000 (octal) words per sector for a
RP20. As a result, to look at a sector of an RP20, one would
multiply the sector number by 1000 (octal) to find the FILDDT
starting address for that sector. For all other drive types
there are 200 (octal) words per sector.
The above information is calculated from the parameters available in
STG.MAC.
REF: DDT41.MEM
TOPS-20 TROUBLE-SHOOTING HANDBOOK Page 50
TOPS-20 SCHEDULER TEST ROUTINES
TOPS-20 SCHEDULER TEST ROUTINES
-------------------------------
The following is a tabulation of (hopefully) all of the scheduler
tests used by the TOPS-20 monitor, time-frame approximately Release 3A.
This includes ARPA and DECNET tests. This is the data one finds in the
monitor table FKSTAT indexed by fork number for forks which have blocked
and left the GOLST (i.e. LH(FKPT) contains WTLST). The format of the
FKSTAT table words is TEST DATA,,TEST ROUTINE ADDRESS. The scheduler
test routines are called periodically to determine if a process can be
unblocked. This is indicated by a skip return from the scheduler test.
A nonskip return is taken if the process cannot yet be unblocked.
When examining the monitor because of a hung job or fork, the
FKSTAT table can often reveal the reason the fork is hung, and this
sometimes even allows corrective action to be taken.
The table below gives routine name, what you should expect to see
in the FKSTAT table, and the module in which the scheduler test is
defined, followed finally by a short description of what the particular
condition is which is being tested.
SCHEDULER TESTS
TEST CONTENTS OF T1 AT TIME OF SCHEDULER CALL DEFINED
---- ---------------------------------------- -------
BALTST [CONNECTION #,,BALTST] [NETWRK]
Wait for network bit allocation.
BATTST [UNIT #,,BATTST] [DSKALC]
Wait for US.BLK, the lock bit for the BAT blocks
on the unit, in the UDB to be zero.
BLOCKM [TIME,,BLOCKM] [SCHED]
Wait for TIME in BLOCKM format which is the low
order 17 bits of the desired future time to be
compared against a suitably masked TODCLK.
BLOCKT [TIME,,BLOCKT] [SCHED]
Wait for TIME in BLOCKT format which is a
value that is shifted left 10 bits and compared
against a suitably masked TODCLK, providing a
longer delay than BLOCKM, but less precision.
BLOCKW [TIME,,BLOCKW] [SCHED]
Wait for TIME in BLOCKW format (same as BLOCKM).
CDRBLK [UNIT NUMBER,,CDRBLK] [CDRSRV]
Wait for card-reader offline, or not waiting for
a card.
TOPS-20 TROUBLE-SHOOTING HANDBOOK Page 51
TOPS-20 SCHEDULER TEST ROUTINES
CHKLOK [ADDRESS,,CHKLOK] [NSPSRV]
Wait for NSP block lock at address to free.
COFTST [TIME,,COFTST] [MEXEC]
Wait for job in FKJOBN to be attached or time
in BLOCKT form to elapse.
DBWAIT [DTE #,,DBWAIT] [DTESRV]
Wait for the TO-10 doorbell from the given DTE.
DGLTST [0,,DGLTST] [DIAG]
Wait for DIAGLK lock to be free.
DGUIDL [UDB ADDRESS,,DGUIDL] [DIAG]
Wait for the unit to show as idle in the UDB.
DGUTST [UDB ADDRESS,,DGUTST] [DIAG]
Wait for the maintenance bit to set in the UDB.
DISET [ADDRESS,,DISET] [SCHED]
Wait for contents of ADDRESS to be zero.
DISGET [ADDRESS ,,DISGET] [SCHED]
Wait for contents of ADDRESS to be positive.
DISGT [ADDRESS,,DISGT] [SCHED]
Wait for contents of ADDRESS to be greater than
zero.
DISLT [ADDRESS,,DISLT] [SCHED]
Wait for contents of address to be less than
zero.
DISNT [ADDRESS,,DISNT] [SCHED]
Wait for contents of ADDRESS to be non-zero.
DMPTST [COUNT,,DMPTST] [IO]
Wait for COUNT to be less than DMPCNT to indicate
dump mode buffers freed.
DSKRT [PAGE #,,DSKRT] [PAGEM]
Wait for CSTAGE for PAGE # to not be PSRIP,
meaning disk read completed.
DWRTST [PAGE #,,DWRTST] [PAGEM]
Wait for DRWBIT to clear in CST3(PAGE #),
meaning write completed.
ENQTST [FORK #,,ENQTST] [ENQ]
Wait for the lock on ENFKTB+FORK #.
FEBWT [ADDRESS OF FE UDB,,FEBWT] [FESRV]
Wait for EOF or input bytes available from FE.
Wake also on invalid assignment.
TOPS-20 TROUBLE-SHOOTING HANDBOOK Page 52
TOPS-20 SCHEDULER TEST ROUTINES
FEDOBE [ADDRESS OF FE UDB,,FEDOBE] [FESRV]
Wait for output buffer empty and all bytes are
acknowledged by the FE. Wake also if not a
valid assignment.
FEFULL [ADDRESS OF FE UDB,,FEFULL] [FESRV]
Wait for the current count of output bytes to be
less than the count of bytes in the interrupt
buffer. Wake also on invalid assignment.
FORCTM [SUPERIOR FORK INDEX,,FORCTM] [SCHED]
Identifiable wait forever, forced termination.
FRZWT [PREVIOUS TEST,,FRZWT] [FORK]
Identifiable wait forever, frozen fork.
HALTT [SUPERIOR FORK INDEX,,HALTT] [SCHED]
Identifiable wait forever for halted fork.
HIBERT [TIME,,HIBERT] [SCHED]
Wait for TIME in BLOCKT format.
HUPTST [<0:9>TIME<10:17>HOST #,,HUPTST] [NETWRK]
Wait for IMPHRT bit set for host or time out in
BLOCKW form.
IDVTST [0,,IDVTST] [IMPDV]
Wait for the lock on IDVLCK to free, lock it.
IMPBPT [0,,IMPBPT] [IMPDV]
Wait for IMPFLG nonzero, or IBPTIM timer to run
out, or IDVLCK lock free and output scan needed
for the IMP.
JB0TST [TIME,,JB0TST] [MEXEC]
Wait for JB0FLG set nonzero for explicit request
or time in BLOCKT form to elapse.
JRET [0,,JRET] [SCHED]
Wait forever, interruptible.
JSKP [0,,JSKP] [SCHED]
Unconditional skip used to schedule immediately.
JTQWT [0,,JTQWT] [SCHED]
Wait for JSYS trap queue.
LCKTSS [ADDRESS,,LCKTSS] [IO]
Wait for lock at ADDRESS to unlock, lock it.
LKDSPT [0,,LKDSPT] [STG]
Wait for room in LDTAB table of directories
currently locked.
TOPS-20 TROUBLE-SHOOTING HANDBOOK Page 53
TOPS-20 SCHEDULER TEST ROUTINES
LKDTST [INDEX INTO LDTAB,,LKDTST] [STG]
Wait for bit in LCKDBT to clear, indicating
directory unlocked.
LODWAT [ADDRESS OF STATUS WORD,,LODWAT] [LINEPR]
Wait for flag LP%LHC to set in the addressed
word, indicating loading has completed of the
VFU or RAM file.
LPTDIS [UNIT ADDRESS,,LPTDIS] [LINEPR]
Wait for an error condition on the addressed
unit, or for all buffers cleared and no bytes
still in the front-end, before finishing close
operation on the device.
MTARWT [IORB ADDRESS,,MTARWT] [MAGTAP]
Wait for IRBFA in the IORB to indicate that this
IORB is no longer active.
MTAWAT [UNIT #,,MTAWAT] [MAGTAP]
Wait for all outstanding IORBs for unit to be
finished.
MTDWT1 [UNIT #,,MTDWT1] [MAGTAP]
Wait for the count of outstanding requests on the
unit to go to one.
NCPLKT [0,,NCPLKT] [NETWRK]
Wait for lock NCPLCK to free, lock it.
NICTST [0,,NICTST] [PAGEM]
Wait for SUMNR less than or equal to MAXNR or
only one fork in BALSET.
NOTTST [<0:8>CONNECTION #<9:17>STATE,,NOTTST] [NETWRK]
Wait for connection to leave state.
NSPTST [0,,NSPTST] [NSPSRV]
Wait for KDPFLG nonzero, indicating KMC11 wants
service, or MSGQ nonzero, indicating messages to
process.
NVTNTT [<0:8>OPTION #,<9:17>LINE #,,NVTNTT] [TTNTDV]
Wait for completed NVT negotiation.
OFNLKT [OFN,,OFNLKT] [PAGEM]
Wait for OFN unlocked--SPTLKB zero in SPTH(OFN).
PIDWAT [FORK #,,PIDWAT] [IPCF]
Wait for bit for fork in PDFKTB to set.
SEBTST [0,,SEBTST] [SYSERR]
Wait for SECHKF to go nonzero before starting
Job 0 task to write queued SYSERR entries.
TOPS-20 TROUBLE-SHOOTING HANDBOOK Page 54
TOPS-20 SCHEDULER TEST ROUTINES
SEEALL [0,,SEEALL] [TTYSRV]
Waits for SNDALL to go to zero, indicating the
send-all buffer available.
SPCTST [0,,SPCTST] [DTESRV]
Wait for a node.
SPMTST [0,,SPMTST] [PAGEM]
Wait for page in SPMTPG to be on SPMQ or the
time SPMTIM to expire.
SQLTST [0,,SQLTST] [IMPDV]
Wait for the special queues lock SQLCK and lock
it.
STRTST [SDB ADDRESS OF STRUCTURE,,STRTST] [MSTR]
Wait for the structure lock to be free.
STSWAT [ADDRESS OF STATUS WORD,,STSWAT] [CDRSRV]
Wait for flag CD%SHA to come on in the addressed
word, indicating that cardreader status has
arrived.
STSWAT [ADDRESS OF STATUS WORD,,STSWAT] [LINEPR]
Wait for flag LP%SHA to set in the addressed
word, indicating that printer status has
arrived.
SUSFKT [FORK #,,SUSFKT] [FORK]
Wait for fork to be on WTLST in either SUSWT
OR FRZWT.
SWPRT [PAGE #,,SWPRT] [PAGEM]
Wait for CSTAGE for PAGE # to not be PSRIP,
meaning swap read completed.
SWPWTT [0,,SWPWTT] [PAGEM]
Wait for NRPLQ nonzero. Increment CGFLG each
time test is unsuccessful.
TCIPIT [FORK #,,TCIPIT] [TTYSRV]
Waits for no interrupts pending for FORK #.
TCITST [LINE #,,TCITST] [TTYSRV]
Wait for line inactive, no fork in input wait,
or input buffer non-empty.
TCOTST [LINE #,,TCOTST] [TTYSRV]
Wait for line inactive, or output buffer not
too full to add a character to it.
TRMTS1 [0,,TRMTS1] [FORK]
Identifiable wait forever for inferior fork termination.
TOPS-20 TROUBLE-SHOOTING HANDBOOK Page 55
TOPS-20 SCHEDULER TEST ROUTINES
TRMTST [FORK #,,TRMTST] [FORK]
Wait for FORK # to be on WTLST for either HALTT
or FORCTM.
TRP0CT [MINIMUM NRPLQ,,TRP0CT] [PAGEM]
Wait for NRLPQ to be above stated minimum or
normal minimum. Increment CGFLG each time
test is unsuccessful.
TSACT1 [LINE #,,TSACT1] [TTYSRV]
Wait until line inactive, becoming active, or
has a full length dynamic block assigned.
TSACT2 [LINE #,,TSACT2] [TTYSRV]
Wait for line available--inactive or fully
active.
TSACT3 [LINE #,,TSACT3] [TTYSRV]
Wait for line inactive--dynamic data unlocked.
TSTSAL [0,,TSTSAL] [TTYSRV]
Wait for SALCNT to go to zero, indicating the
send-all is finished for this buffer.
TTBUFW [NUMBER,,TTBUFW] [TTYSRV]
Wait for NUMBER of buffers.
TTIBET [LINE #,,TTIBET] [TTYSRV]
Wait for line inactive or input buffer empty.
TTOAV [LINE #,,TTOAV] [TTYSRV]
Wait for line inactive and output buffer not
empty.
TTOBET [LINE #,,TTOBET] [TTYSRV]
Wait for line inactive or output buffer empty.
UDITST [0,,UDITST] [PHYSIO]
Wait for at least two free IORBs on UIOLST.
UDWDON [IORB ADDRESS,,UDWDON] [PHYSIO]
Wait for IS.DON to set in IRBSTS for this IORB.
UPBGT [CONNECTION INDEX,,UPBGT] [IMPDV]
Wait for LTDF connection done flag to set, or
output buffers to appear.
USGWAT [0,,USGWAT] [JSYSA]
Wait for lock on queued USAGE blocks to free.
VVBWAT [UNIT #,,VVBWAT] [TAPE]
Wait for the MDA to reset TPVV handling EOV.
WATTST [<0:8>CONNECTION #<9:17>STATE,,WATTST] [NETWRK]
TOPS-20 TROUBLE-SHOOTING HANDBOOK Page 56
TOPS-20 SCHEDULER TEST ROUTINES
Wait for connection to be in state.
WTFKT [FORK #,,WTFKT] [FORK]
Wait for fork to be on WTLST.
WTSPTT [PAGE #,,WTSPTT] [SCHED]
Wait for share count on PAGE # to go to 1.
TOPS-20 TROUBLE-SHOOTING HANDBOOK Page 57
KNOWN HARDWARE DEFICIENCIES LIST
Known Hardware Deficiencies List
This is a collected list of known hardware characteristics which
show up from time to time as part of certain reported problems. This
says nothing about whether these characteristics are bugs or features,
or whether they will ever be fixed or changed, but merely attempts to
make them known internally.
1. DZ11 - Cannot set the speed to zero in the hardware, can only
turn off the receiver.
2. TM02 - Can generate bad parity which it passes to memory to
cause the system memory parity errors when the data is
referenced.
3. TM03 - A chip race condition has been known to occur where a
function register has wrong value because it has not settled.
This generates a device error which appears transient; i.e.
CRLFing DUMPER tries the read again and succeeds.
4. TM03 - ANSI ASCII was not included in the hardware format
modes.
5. TM03 - When using industry-compatible mode, reads not of a
multiple of four bytes will produce strange results. The bytes
are counted, but the extra bytes are not written to memory,
leaving garbage.
6. DX20 - there is a race type condition where the DX20 generates
an an interrupt request on channel 5 for some condition, but
the code is playing with the DX20 and handles the condition, so
it lowers its request, however the KL has latched the interrupt
and tries to process it, but no one will respond. So it tries
the 40+2n type, which gives a PI5ERR occasionally.
7. VT100 - on a VT100 without the extended memory, one can confuse
the internal microprogram enough to have it clear sections of
the screen.
8. RH20 - perfectly willing to store bad parity data into memory.
9. DX20 - is unwilling to allow registers to be examined after it
has started I/O. Can cause register access errors if not
programmed in correct sequence.
10. LP20 - at least one of the printers fails to go off-line when
there is anything in the print line buffer, even if the drum
gate is opened.
TOPS-20 TROUBLE-SHOOTING HANDBOOK Page 58
KNOWN HARDWARE DEFICIENCIES LIST
11. KS-10 Front End - Rev. 3. exhibits problems with the KLINIK
line. If the link is in use, it is possible to lock out the
CTY. There are problems with the password check on subsequent
tries, and problems with line hang-up.
12. KS-10 Front End - Rev. 3. exhibits some problems with
powerfail restart. If the power returns in less that 3.5
seconds or so the restart will hang. In addition if Rev. 3
and Rev. 2 boards are mixed, there is no powerfail restart or
reload capability.
13. KS-10 Front End - there are more commands to the KS10> prompt
than often documented, and some typeins to the front end have
been known to hang the system, beyond even responding to ^\.
14. DX20/TU71 - the DX20 microcode does not set the 556 bpi density
correctly for TU71 (7-track) drives. This can be set
successfully from the maintenance panel.
15. TM03 - if an error ocurs while rewinding, the monitor may be
left in a state of waiting for the rewind to complete, the tape
being unusable. The easiest way to clear this condition is to
reset the TM03, most easily done by the customer by powering it
down and back up.
16. KS10 - during a forced reload, the halt status block is written
twice, first when halting and second when rebooting; thus the
second time wipes any valuable data from the first time. It's
once again the 8080 that's responsible.
TOPS-20 TROUBLE-SHOOTING HANDBOOK Page 59
KS10 PROCESSOR CONSOLE INFORMATION
KS10 PROCESSOR CONSOLE INFORMATION
----------------------------------
CSL-COMMANDS CURRENTLY IMPLEMENTED (CSL V0.161)
^Z ;enter USER mode
^\ ;enter CONSOLE mode
MK XX ;Marks microcode location XX (sets bit 95)
UM XX ;Unmarks Microcode location XX
MB ;load only bootstrap of currently selected magtape
LA XX ;Load/set KS10 Memory Address
LI XX ;Load/set I/O address
LK XX ;Load/set 8080 address
LC XX ;Load/set CRAM address to be written/read
EM ;Examine KS10 Memory (last Memory location specified)
EM XX ;Examine KS10 Memory location XX
EN ;Examine Next (either from last EK, EM or EI)
EB ;Examine BUS and 8080 control registers
EI ;Examine I/O (last I/O address specified)
EI XX ;Exmaine I/O address XX
EK ;Examine 8080 location
EK XX ;Examine 8080 address XX
DM XX ;Deposit KS10 Memory last addressed, XX data
DN XX ;Deposit next (depending on last DK, DM or DI) XX data
DB XX ;Deposit BUS, XX data
DI XX ;Deposit I/O, XX data
DK XX ;Deposit 8080 location (only RAM locations stick)
MR ;MASTER RESET
CS ;CPU clock start
CH ;CPU clock halt
CP XX ;CPU clock pulse (XX=NR of pulses -- default 1 pulse)
SI ;Single Instruction
LF XX ;selects a set (0-7) of 12 bits of microcode (see note at end ****)
DF XX ;Deposit Field, write microcode bits according to last LF-command
EC ;Examine CRAM ..curr. Control reg, no clocks .. current loc as addr.
EC XX ;Examine CRAM at address XX
DC XX ;Deposit CRAM, XX is at least 32 octal characters
EX XX ;EXecute KS10 instruction XX
ST XX ;STart KS10 at address XX
SM XX ;Start microcode at XX (SM 1 causes dump of HALT-status block !!)
;Default is 0 -- Start microcode
HA ;HALT KS10 (execute HALT-instruction -- causes microcode to
; write HSB and then to enter HALT-loop)
SH ;SHUTDOWN (deposit non-zero data in memory location 30)
; causing TOPS20 to shut down
CO ;Continue (causes microcode to leave HALT-loop)
PE X ;Parity Enable (0=disable, 7=enable all, 1=DRAM-par, 2=CRAM-par,
; 4=clock-par error stop)
CE X ;CACHE enable (0=OFF, 1=ON, <CR>=show current state)
TE X ;1 MSEC enable (0= OFF, 1=ON, <CR>=show current state)
TP X ;TRAPS enable (0=OFF, 1=ON (enables paging), <CR>=show current state)
TOPS-20 TROUBLE-SHOOTING HANDBOOK Page 60
KS10 PROCESSOR CONSOLE INFORMATION
LT ;Lamp Test, lights three lamps of front panel
RC ;Read CRAM direct, functions 0-17
; (no resets, no load diag adr, no CPU clock) (see note at end ****)
EJ ;Examine Jumps -- prints CRAM address signals (CURR, NXT, J, SUB)
TR XX ;TRACE - repeats CP and EJ commands till any character typed
;XX (if typed) is desired CRAM stop-address
PM ;Pulse Microcode (issue single CP and EJ)
ZM ;Zero KS10 MOS Memory (beware -- slow)
RP ;Repeat - repeats last command, or line of commands which it delimits
; Any character (except CNTRL-O) typed will stop repeat
;EXAMPLE: EM 0, EK 0, EC 0, RP will repeat execution of this line
BT ;Boot SYSTEM -- load CRAM from designated disk (see DS)
; via memory then load monitor boot from disk and start at 1000
BT 1 ;same as BT, but loads SMMON and starts at 20000
LB ;Load Bootstrap from designated disk (see DS)
LB 1 ;Load Bootstrap diagnostic monitor SMMON
DS ;Disk Select. Command prompts to specify
; UNIT NUMBER, RHBASE, and UNIBUS ADAPTER
; to load from when booting
MS ;Magtape Select. Command prompts to specify
; UNIT NUMBER, RH BASE, UNIBUS ADAPTER, SLAVE NUMBER, and DENSITY
; of magtape to boot from
MT ;Magtape Boot system from selected magtape
MT 1 ;BOOT diagnostic monitor SMMAG from magtape
PW ;clears KLINIK password, or sets it (6 char's max)
NOT IMPLENTED YET
***BC ;BOOT Check. PROM code which tests the basic 2020 system
; load path from the UNIBUS adaptor into the CRAM via memory.
CONTROL CHARACTERS
^U ;rub out current line
^O ;switch: first one stops CTY-output, second one resumes CTY-output
^S ;stop TTY-output and hangs 8080 waiting for CONTROL-Q (see below)
^Q ;resumes TTY-output
^C ;stops whatever the 8080 is doing
RUB-OUT ;rub out previous character typed
NOTE: Several commands may be put on a single line, separated by commas.
TOPS-20 TROUBLE-SHOOTING HANDBOOK Page 61
KS10 PROCESSOR CONSOLE INFORMATION
***** CRAM Bit Formats
LF-Command CRAM Bits RC-Command CRAM Data
-------------------- ---------------------
LF CRAM bits RC Data
-- --------- -- ------------------------------
0 00-11 0 CRAM bits 00-11
1 12-23 1 Next CRAM address
2 24-35 2 CRAM subroutine return address
3 36-47 3 current CRAM address
4 48-59 4 CRAM bits 12-23
5 60-71 5 CRAM bits 24-35 (Copy A)
6 72-83 6 CRAM bits 24-35 (Copy B)
7 84-95 7 0s
10 Parity bits A-F
11 KS10 bus bits 24-35
12 CRAM bits 36-47 (Copy A)
13 CRAM bits 36-47 (Copy B)
14 CRAM bits 48-59
15 CRAM bits 60-71
16 CRAM bits 72-83
17 CRAM bits 84-95
TOPS-20 TROUBLE-SHOOTING HANDBOOK Page 62
KS10 PROCESSOR CONSOLE INFORMATION
8080-CONSOLE-ERROR-CODES
------------------------
?BUS BUS polluted on power up
?BFO Input Buffer Overflow
?IL ILLEGAL Instruction
?UI Unknown Interrupt
?A/B A and B copies of CRAM bits did not match
?DNF Did Not Finish instruction
?BT device error or timeout during BOOT operation
?DNC Did Not Complete HALT
?PAR ERR report clock-freeze due to parity error,
and type out READ IO of 100,303,103
?MEM REFRSH ERR Memory Refresh Error (MEM BUSY stayed set too long,
because it didn't release data on a write to memory)
?CHK PROM checksum failed
?BC BOOT Check failed
?RUNNING trying to do a command that may screw up
?NDA received No Data Acknowledge on memory request
?NXM referenced NoneXistent Memory location
?NBR Console was not granted BUS on a request
?RA command Requires Argument
?BN received Bad Number on input
?KA KEEP ALIVE failed
?FRC had a forced reload
?PWL Password Length error
?IA Illegal Argument (address out of range, etc.)
OTHER 8080 CONSOLE MESSAGES
---------------------------
BUS 0-35 message header for EB command
KS10> prompt message
CYC cycle type for DB command
SENT data sent to bus
RCVD data received on bus
HLTD message "HALTED/XXXXXX " where xxxxxx is data
BT SW message says BOOTING, using BOOT switch
OFF message, says this signal is off
ON message, says this signal is on
>>UBA? query for UNIBUS adapter
>>UNIT? query for unit to use
>>RHBASE? query for RH11 to use
>>DENS? query tape density
>>SLV? query tape slave
C CYC typed on DB-command if COM/ADR cycle blew
D CYC " " DATA cycle blew
TOPS-20 TROUBLE-SHOOTING HANDBOOK Page 63
KS10 PROCESSOR CONSOLE INFORMATION
8080-ERROR-Messages-during-BOOTING
----------------------------------
Disk:
On an error-condition, detected by the 8080, the
Fault-light will go on and a message of the form
?BT XXXYYY
will be printed on the CTY.
The following error-codes are only "rough" pointers, they can be
caused by any of the following problems:
Disk not a disk at all
Wrong unit selected (see DS-command)
Home blocks not readable or not there
Home blocks not set by SMFILE for 8080
8080 File-system garbage
XXX=001 Disk error encountered while trying to read HOME-blocks
XXX=002 Disk error encountered while trying to read the page of
pointers, which make up the "8080-File-System"
XXX=003 Disk error encounterd while trying to read a page of
microcode
XXX=004 Disk error encountered while trying to read PRE-BOOT
YYY are the lower 8 bits of the 8080 address of the failing
"Channel Command List" operation. Normally it is here
a good bet to do an "EI" to get the contents of the
RH11 register that has the error-bits set !
Magtape:
The following ERROR-messages can point to the following problemareas:
Magtape is no magtape at all
Wrong unit selected (see MS-command)
Magtape is not bootable (no microcode, no PRE-BOOT)
XXX=001 Error trying to read microcode first page
XXX=003 Error trying to read additional pages of microcode
XXX=004 Error trying to read in PRE-BOOT program
YYY see above (disk-section)
TOPS-20 TROUBLE-SHOOTING HANDBOOK Page 64
KS10 PROCESSOR CONSOLE INFORMATION
Error-messages-out-of-PRE-BOOT
PRE-BOOT is loaded from Disk or Magtape (see 8080 commands DS, MS,
BT, BT 1, MT, MT 1)
PRE-BOOT is written onto the disk using "SMFILE.EXE", it also is written on
"standard" Diagnostic-tapes and onto the "MONITOR-INSTALLATION"-tapes.
PRE-BOOT is loaded by the 8080 into MEMORY-locations 1000 and up, and starts
at 1000. The ERROR-halts are:
1001 found "bad" core-transfer address
(page 1 is illegal - can't overload PRE-BOOT)
1003 No RH11 Base Address
1004 Magtape Skip failure
1002 all other failures
At ERROR-halt time the following MEMORY-Locations contain the useful INFO :
Disk-Booting Magtape-Booting
------------ ---------------
100 "8080" disk-address Not used
101 Memory transfer address same
102 Index-pointer same
103 RPCS1-register MTCS1-register
104 RPCS2-register MTCS2-register
105 RPDS - register MTDS - register
106 RPER1-register MTER1-register
107 RPER2-register (RP06 only) Not used
110 RPER3-register Not used
111 UBA Page RAM loc 0 same
112 UBA-status register same
113 Version Nr. of PRE-BOOT same
Note: The Version Nr. of PRE-BOOT will be the same as the Version Nr.
of SMFILE. The "8080" disk-address is in the form " CYL SEC SURF "
THEREBY IT WILL BE POSSIBLE TO ASK A CUSTOMER WITH A PRE-BOOT FAILURE,
TO DO AN :
EM 77
EN,RP
...... AND TYPE SOMETHING AFTER ADDRESS 115
...... AND THEN TELL US WHAT HE SEES
TOPS-20 TROUBLE-SHOOTING HANDBOOK Page 65
KS10 PROCESSOR CONSOLE INFORMATION
8080-Communication-Area (KS10 Memory)
-------------------------------------
The 8080 maintains and services an in-core communication area.
Currently used are words 31 to 40. See PROKS.MAC for more info.
Word Nr. Meaning
---- --- -------
31 Keep Alive and Status word
32 KS-10 CTY input word (from 8080)
33 KS-10 CTY output word (to 8080)
34 KS-10 KLINIK user input word (from 8080)
35 KS-10 KLINIK user output word (to 8080)
36 BOOT RH-11 Base Address
37 BOOT Drive Number
40 Magtape Boot Format and Slave Number
Word 31 Keep Alive and Status word
---- --
Bit 4 Reload Request
Bit 5 Keep Alive active
Bit 6 KLINIK active
Bit 7 PARITY Error detect enabled
Bit 8 CRAM Parity Error detect enabled
Bit 9 DRAM Parity Error detect enabled
Bit 10 CACHE enabled
Bit 11 1 msec enabled
Bit 12 TRAPS enabled
Bit 20-27 Keep Alive counter field
Bit 32 BOOT SWITCH BOOT
Bit 33 POWER FAIL
BIT 34 Forced RELOAD
BIT 35 Keep Alive failed to change
Word 32 KS-10 CTY input word (from 8080)
---- --
Bits 20-27 0 -- no action, 1 -- CTY character pending
Bits 28-35 CTY-character
Word 33 KS-10 CTY output word (to 8080)
---- --
Bits 20-27 0 -- no action, 1 -- CTY character pending
Bits 28-35 CTY-Character
Word 34 KS-10 KLINIK user input word (from 8080)
---- --
Bits 20-27 0 -- no action, 1 -- KLINIK character,
2 -- KLINIK active, 3 -- KLINIK carrier loss
TOPS-20 TROUBLE-SHOOTING HANDBOOK Page 66
KS10 PROCESSOR CONSOLE INFORMATION
Bits 28-35 KLINIK-Character
Word 35 KS-10 KLINIK user output word (to 8080)
---- --
Bits 20-27 0 -- no action, 1 -- KLINIK character, 2 -- Hangup request
Bits 28-35 KLINIK-Character
OUTPUT process KS10 ==> 8080
----------------------------
Load character and flag into 33, set 8080-interrupt, 8080 examines
33 and gets character, clears interrupt, sends character to hardware,
clears 33 and sets KS-10 interrupt.
INPUT process 8080 ==> KS10
---------------------------
8080 gets interrupted "TTY-char available", 8080 gets character and
delivers into input-word (31) with flag(s) and sets KS-10 interrupt.
TOPS-20 TROUBLE-SHOOTING HANDBOOK Page 67
CRASH ANALYSIS
CRASH DUMPS
===========
Each time there is a BUGHLT there is an automatic dumping of the
system core image into PS:<SYSTEM>DUMP.EXE. If there is sufficient
room on the DSK the data that was previously in DUMP.EXE will be
copied into DUMP.CPY by SETSPD after the system is reloaded. DUMP.CPY
does not get deleted and you may find several generations of DUMP.CPY.
In the case you have set no auto reload you can dump the crash by
hand by typing /D to the system BOOT> prompt. You can get into BOOT
if you are reloading the system by bringing the system up from the
switch registers rather than hitting <ENABLE> <DISK> on the console.
See the Operators Guide for a discussion of the meaning of the various
switches on the DEC-20.
TOPS-20 TROUBLE-SHOOTING HANDBOOK Page 68
CRASH ANALYSIS
CRASH ANALYSIS
--------------
First when analyzing software or software/hardware problems be
sure you have the proper tools:
1. A SWSKIT on magtape
2. A full copy of the current release microfiche MONITOR and
EXEC.
3. A MONITOR CALLS REFERENCE MANUAL.
4. A SYSERR manual.
5. A listing of the SYSERR log, especially if hardware is
suspected.
6. A CTY output for BUGHLTs and BUGINFs or other problem
indications, or an accurate reproduction of this information.
7. Any other manuals you may need for reference such as the
proper version Installation Guide, Operators Guide, System
Managers Guide, etc.
8. A TOPS-20.BWR file.
You will need the SWSKIT and perhaps listings of the latest
versions of monitor modules in case the microfiche are not up to date.
FILDDT is on the customers distribution tape.
Be sure you have analysed the SYSERR log. Be sure, also, that
you have looked up the BUGHLT and/or BUGCHKs in question in the
listings (microfiche) and have at least read the comments around them.
Probably tracing down how it got called is a good idea. If you happen
to be without a GLOB (provided on microfiche) you can find the BUGHLT
tag of interest in the monitor as follows:
$GET <SYSTEM>MONITR.EXE
$ST 140
DDT
ILPP3? ; BUGHLT of interest followed by "?"
PAGEM G ; it is defined in PAGEM and is global
Some other useful bits of information. There is a GLOB listing
provided in the microfiche which contains a list of all the global
symbols in the monitor. Most of the symbols are defined in the module
STG.MAC. If you don't know a tag name but want to look at the storage
for DTEs, say, look through STG. STG also contains some small portion
TOPS-20 TROUBLE-SHOOTING HANDBOOK Page 69
CRASH ANALYSIS
of code mostly to do with restart, start, auto reload, dispatches for
PI channels and A few scheduler tests. STG stands for storage. Note
that some stuff may be defined in PROLOG, and of course lots of stuff
is defined throughout the monitor. You may also want to get a listing
of MACSYM to be able to understand the macros you see while reading
the monitor listings; MONSYM is also useful at times. Be sure you
know how PARAMS has been changed in case it has. See BUILD.MEM on the
distribution tapes for the currently distributed information on what
to do to change various system parameters in PARAM0.MAC. Be sure that
you know about any variables that the site may have changed in STG as
well.
TOPS-20 TROUBLE-SHOOTING HANDBOOK Page 70
CRASH ANALYSIS
EXAMINING THE MONITOR
---------------------
Debugging a complex, multi-process software system is largely a
matter of absorbing sufficient knowledge, experience and folklore
about the particular system with a considerable element of personal
preference, or 'taste' also involved. This document is a cursory
description of features built into the system to aid debugging, and
such folklore as can be described in written English.
There are four different versions of DDT that may be used to
examine the monitor. Each is used for a different purpose and has
special capabilites. The versions of DDT are:
1. UDDT (user DDT) used to examine or modify the MONITR.EXE
file.
2. MDDT (monitor DDT) used to examine or modify the running
monitor under timesharing.
3. EDDT (exec DDT) used to examine or modify the running monitor
from the CTY in a stand-alone mode.
4. FILDDT used to examine dumps.
All the DDT's are versions of TOPS-20 DDT documented in the
TOPS-20 DDT manual, and have all of the features described in the
manual. See also the document DDT41.MEM.
The use of all four versions of the DDT's is the same and will be
described latter, however, each version is started differently.
TOPS-20 TROUBLE-SHOOTING HANDBOOK Page 71
CRASH ANALYSIS
UDDT:
----
To use UDDT to modify your MONITR.EXE file on system, you must
give the following EXEC commands:
@GET <SYSTEM>MONITR.EXE
@START 140 or on Release 4 systems, @DDT
This causes EDDT to start in user mode. This is the same DDT that is
used when examining any program. You may now look at or change any
part of the monitor. If you make changes to the monitor and want to
save it, you should get back to the EXEC by typing ^Z. Then you may
save the monitor.
You will probably have to be enabled in order to save the monitor
back in <SYSTEM>. This is the safest, best, and recommended method of
putting patches into the monitor.
TOPS-20 TROUBLE-SHOOTING HANDBOOK Page 72
CRASH ANALYSIS
MDDT:
----
A version of DDT which runs in monitor space is available. It
can examine and change the running monitor, and can breakpoint code
running as a process but not at PI or scheduler level. When patching
or breakpointing the swappable monitor, the normal write protection
must be defeated, either by setting DBUGSW to 2 on startup, or calling
SWPMWE. If you insert breakpoints with MDDT, remember monitor code is
reentrant and shared so that the breakpoint could be hit by any other
process in the system. In this event, the other process will most
likely crash since it will be executing a JSR to a page full of zeros.
To use MDDT you must have WHEEL or OPERATOR capabilities. You
first issue the EXEC command:
@ENABLE
$^EQUIT
; You are now in the mini-exec and receive a prompt
; of MX>. Now you give the "/" command:
MX>/
; You are now put into MDDT. To return to the EXEC
; you can issue a ^Z or a ^C which produces a
; message like "INTERRUPT AT 17372" and returns you
; to the mini-exec. If you type a ^P in MDDT you
; will get a message, "ABORT", and be returned to
; the mini-exec. If you once go into the mini-exec
; the CONTROL-P interrupt is enabled and typing this
; character will return you to the mini-exec. This
; is a good thing to use when debugging programs
; that do CONTROL-C trapping. From the mini-exec
; you may give either:
MX>S
; or
MX>E
; The S is filled out as START and the E as EXEC.
; both of these commands will return you to the
; EXEC. See the document EXEC-DEBUGGING.MEM for more
; about ^P and getting out of the EXEC to MX> and
; returning from MX> to either your copy of the EXEC
; or the system EXEC.
; You may also give the command:
MRETN$G
; From MDDT to return directly to the EXEC. While
; in MDDT you may examine any core location in the
; running monitor. You may also change any location
; in the resident monitor (done frequently by
; accident). If you wish to change any of the
TOPS-20 TROUBLE-SHOOTING HANDBOOK Page 73
CRASH ANALYSIS
; locations in the swappable monitor you must give
; the command:
CALL SWPMWE$X
; To write enable the monitor. After you have made
; your changes you must give the command:
CALL SWPMWP$X
; to write protect the monitor again.
MDDT may also be entered from process level via JSYS:
JSYS 777$X
or
MDDT%$X ; will enter MDDT from the context of the current process
If you wish to examine the system from the EXECs inferior fork
monitor context:
@ENA
$SDDT
DDT
JSYS 777$X
MDDT
To return to user context:
MRETN$G
Use SETMPG to map pages to this context:
page 677 has been traditionally used for this;
but any unused page may be used. To make sure that the page
is currently unused type:
ADDRESS/ ? ; the question mark from DDT indicates that the
; page is nonexistent.
when the destination page has been found, set up AC2 as:
AC2/ ACCESS,,677000
If the page has its own SPT slot:
AC1/SPT INDEX
If the source page does not have its own SPT slot, it will belong to
either a file or process page table. It will be represented as an
index into this page table:
AC1/ SPT INDEX OF PAGE TABLE,,INDEX INTO PAGE TABLE
TOPS-20 TROUBLE-SHOOTING HANDBOOK Page 74
CRASH ANALYSIS
Access = read or/and or write access
Read/Write access = 140000 in LH
Therefore, to map a page, call with either:
AC1/SPT INDEX OF PAGE
AC2/140000,,677000
or
AC1/SPT INDEX OF PAGETABLE,,INDEX INTO PAGE TABLE
AC2/140000,,677000
AND SAY:
CALL SETMPG$X
The page will then be mapped to page 677. In examining locations
677000-677777, you will be looking at the contents of the page.
If you desire to map another page into this slot, merely call SETMPG
again with arguments for the new page. You need not first un-map the
old page. However, when you are finished, page 677 should be
un-mapped in the following manner:
AC1/0
AC2/ACCESS,,677000
CALL SETMPG$X
WARNING:
Calling SETMPG incorrectly can crash the system. Be CAREFUL! Do not
use SETMPG on a time sharing system if a crash will cause bad
feelings.
TOPS-20 TROUBLE-SHOOTING HANDBOOK Page 75
CRASH ANALYSIS
EDDT:
----
NOTE
Not to be confused with ^EEDDT command
to get into UDDT used with the command
processor. See separate document on
EXEC DEBUGGING for that.
To get into EDDT you must bring the system up using the
switch-register. See the DECSYSTEM-20 Operators Guide for a
discussion of switches. Go through the KLINIT dialog and when you get
the prompt BOOT>, respond with:
BOOT>/L
BOOT>/G141
The "/L" command causes the monitor to be loaded, but not started.
The "/G141" starts the monitor at location 141, which is a jump to
EDDT. You can use EDDT like UDDT under timesharing on the MONITR.EXE
file by giving the following commands:
$GET <SYSTEM>MONITR.EXE
$START 140
EDDT is linked into the monitor and is always there. You may also get
to EDDT from MDDT by issuing the following:
EDDT$G
from MDDT. This stops timesharing. To resume timesharing and /or get
back to MDDT give the command:
MDDT$G ; back to MDDT
MRETN$G ; back to normal timesharing
Breakpoints may be inserted in the resident monitor with EDDT,
but not in the swappable monitor in general, because its pages may be
swapped out and be unavailable to EDDT. You can bring them in by
typing:
SKIP LOC$X ; where LOC is some address not in core
TOPS-20 TROUBLE-SHOOTING HANDBOOK Page 76
CRASH ANALYSIS
There are some locations in the monitor that are very useful when
using EDDT for debugging. They must be set before going on to start
the monitor.
They are:
EDDTF 1 keep EDDT in core when system comes up
0 delete DDT when system comes up (default)
DBUGSW 0 do not stop on BUGHLTs, crash and reload
1 stop on BUGHLTs (hit EDDT breakpoint)
2 write enable swappable monitor,
do not start up SYSJOB, and stop on
BUGHLTs. Also it dosn't run CHECKD
automatically on startup.
DCHKSW 0 do not stop on BUGCHKs (default)
1 stop on BUGCHKs (hit EDDT breakpoint)
DINFSW 0 do not stop on BUGINFs (default)
1 stop on BUGINFs (hit EDDT breakpoint)
In addition the symbol GOTSWM appears in the code just after the
swappable monitor is loaded. So, if you want to debug the swappable
part of the monitor you must put a breakpoint at GOTSWM (to get
swappable part in core) by,
GOTSWM$B
Then start the MONITOR by,
147$G
CALL SWPMLK$X
CALL SWPMLK is used to lock swappable monitor in core for debugging.
You must have more than 96k of core to give this command since the
resident and swappable monitor are larger than 96k. To start up the
monitor after you have gone into EDDT and set up your breakpoints
(remember the last two are used for BUGHLT and BUGCHK) give the
command:
147$G
or
SYSGO1$G
If you are in EDDT and DBUGSW is not 2, that is the monitor is write
protected, you can use the routines SWPMWE and SWPMWP to write enable
and write protect the monitor. CALL SWPMWE$X in DDT.
TOPS-20 TROUBLE-SHOOTING HANDBOOK Page 77
CRASH ANALYSIS
FILDDT:
------
FILDDT is distributed on the customer software tape.
The following is an chewed-up FILDDT.HLP file.
GET(FILE) FILE-SPEC
Loads a file for DDT to examine. If you are looking at a monitor dump
you must load DUMP.CPY explicitly. FILDDT looks for MUMBLE.EXE not
MUMBLE.CPY that is DUMP<ESC> will tell you that there is no such file
or will load DUMP.EXE. When looking at a dump and you wish to load
the symbols you must first issue the load command followed by the get
command. Be sure that the file from which you get the symbols is the
same version as the dump. Be sure, also that the monitor that was
dumped is the same monitor you use for symbols. That is don't get
MONMED symbols to use with MONBCH etc.
LOAD (SYMBOLS FROM) FILE SPEC
Reads specified file and builds internal symbol table. This must be
the first command to FILDDT before "GET" when looking at a dump. You
will most probably use <SYSTEM>MONITR.EXE which would have been the
monitor running at the time of the dump.
EXIT (FROM FILDDT)
Returns to command level. You then may type a save command if a load
command was just done to preload symbols. You will get a version of
FILDDT that has the symbols you just loaded in it so you no longer
need to "LOAD" symbols. You now have a monitor specific FILDDT, which
was common practice for TOPS-10, but is not generally done for
TOPS-20.
HELP
Types something like this text.
ENABLE PATCHING
Allows writing on an existing file specified by a GET.
ENABLE DATA-FILE
Assumes file is raw binary (i.e. no ACs, and not an EXE file).
DDT FEATURES:
TOPS-20 TROUBLE-SHOOTING HANDBOOK Page 78
CRASH ANALYSIS
EP$U Sets monitor context for FILDDT mapping. EP is a symbol
which is equal to the page number of the EPT. (Rel 4)
<CTRL/E> Returns to FILDDT command level.
TRACKING DOWN UNMAPPED ADDRESSES:
The resident monitor may be looked at without any difficulties,
but the swappable monitor may not be in core at the time of the dump.
If the value of the symbol is in the swappable monitor you must
sometimes go through the monitor map to find where the location really
is. The location MONCOR contains the number of pages of resident
monitor and the location SWPCP0 contains the first page of real core
for swapping. So if the value of the symbol is greater than contents
of MONCOR times 1000 then it is in swappable monitor.
If the page of the swappable monitor you want to look at is in core it
will probably not be in core in the location that it's address refer
to since the dump is of core and relocation of pages does not happen.
To find where a symbol really is in the dump, first type the symbol
followed by an "=". DDT will respond with the value of this symbol.
The value of the symbol can be divided into two, three octal digit,
fields. The high order three digits are the page number and the low
order three digits are the offset into the page.
If the value of the symbol is 324621 the high order three digits, 324,
are the page number and the low order three digits, 621, are the
offset into the page. To find the location of the page in question in
the dump you must look at the monitor map indexed by the page number.
For example:
MMAP+324/
would give you the monitor map word for page 324. This word contains
some protection bits for the page and the address of the page when the
dump was taken.
The page may have been in core, on the swapping area or on the disk at
the time of the dump.
If bits 14-17 in the monitor map word are non-zero the page
was on the swapping area or disk and is no longer available.
If bits 14-17 are zero then the page was in core, and the right half
of the word contains the page number in the dump of the page you are
looking for (the dump program overwrites the last several pages of
memory, the dump therefore does not contain these last pages.)
If the page was in core the new address of the symbol you are looking
for can be found by using the page number from the monitor map word
and appending the offset into the page to it. For example if MMAP+324
contains 104000,,256; then the new address of our symbol would be
256621.
TOPS-20 TROUBLE-SHOOTING HANDBOOK Page 79
CRASH ANALYSIS
All address in the swappable monitor must be resolved in this manner.
In addition address of 600000 and above are in the JSB or PSB (PSB is
page 777) and must be resolved by finding the page containing the JSB
or PSB of the process that was running when the dump occured. There
are some locations and tables in the monitor that make this easy:
NAME INDEX DESCRIPTION
FORKX none Number of the fork that was running at the time of
the dump, -1 if in the scheduler.
JOBNO In PSB Job number to which current fork belongs.
FKJOB Fork # Job number,,SPT index of JSB
JOBDIR Job # logged in directory number
JOBPT Job # controlling TTY number,,top fork number
FKSTAT Fork # test data,,address of fork wait routine
FKPGS Fork # SPT index of page table,,SPT index of PSB
SPT indexes are indexes into a share pointer table starting at SPT.
To find the PSB of fork 20, you first look at FKPGS+20. If this
location contains 425,,426, the word at SPT+426 is the pointer to the
PSB. This pointer can point to disk, swap area, or a page in the
dump. If bits 14-17 are zero it is a pointer to a page in the dump
and the right half of the SPT word is the page number of the PSB in
the dump.
When you look at a dump, you should first try to find why the dump
occured by looking at the location BUGHLT. If BUGHLT is zero then you
should check the CTY log to find out why the dump was taken and for
information like the PC at the time of the dump and the status of the
PI system. If BUGHLT is non-zero it is the address of where the
BUGHLT was issued. You should look up the BUGHLT in BUGSTRINGS.TXT or
BUGS.MAC to find additional information about the BUGHLT. If at this
point you are not sure as to why the BUGHLT occured, you will have to
look at the listings for more information. A copy of BUGSTRINGS.TXT
is in Appendix A of the Operators manual. You can find the location
of the call to the BUGHLT by typing the BUGHLT tag to DDT followed by
a "?". DDT will tell which monitor module the BUGHLT is in and you
can go to your microfiche and read all about the conditions
precipitating the BUGHLT.
Next if necessary look at FORKX. If it contains a -1 the scheduler
was running; otherwise it is the number of the fork that was running
when the crash occurred. The registers are saved at BUGACS on a
BUGHLT, but if BUGACS+17 contains something,,BUGPDL+n, then the
registers are invalid and you must go to the SYSERR buffer to get the
good registers. This is done by adding to the right half of the
SYSERR buffer pointer, SEBQOU, the offset into the buffer for the
heading and ACs, SEBDAT+BG%ACS. This value points to a 16 block of
words containing the users ACs. You may have to chain down more than
one queued-up SYSERR entry to get to the BUGHLT block.
NOTE
Do not forget to get a print out of the
SYSERR log which will give you and the
field service representative much of the
information you can get out of the dump.
The SYSERR output is much easier to
examine, however, clearly you cannot get
as much info as you can from a dump.
Some other locations in the PSB of interest are:
LOCATION DESCRIPTION
UAC User's ACs when he did his last JSYS.
PAC monitors ACs
PPC processors PC
UPDL users pushdown stack while in a JSYS
NSKED 0 = ok to run scheduler
>0 = cannot run scheduler
INTDF -1 = ok to receive software interrupts
>= 0 , cannot receive software interrupts
It may be useful to know the status of a fork when it is hung or you
are unsure of its status. This can be determined by looking at FKSTAT
indexed by the fork number. The right half of this location is the
address of a test routine and the left half is data to be tested. For
example if FKSTAT+12 contains 23,,FKWAT, then fork 12 is waiting for
fork 23 to complete. FKWAT is a routine that waits for another fork
to complete and its data (the left half of the word) is the number of
the fork it is waiting for. There are many different wait routines
and you will have to look at the code to see what individual ones are
waiting for. There is a memo on scheduler tests which details most
all of the scheduler tests in the monitor.
You can easily determine all of the forks associated with a job
by giving the commands:
-1,,0$M
FKJOB<FKJOB+NFKS>N,,0$W
Where N is the job you are looking for. A fork structure can usually
be determined by looking at the FKSTAT of the forks and seeing which
forks are waiting on which forks. A FKSTAT of FKSKP indicates a fork
is inactive.
You should refer to STG.MAC for other fork and job tables and other
locations in the PSB and JSB of interest. All of the above locations
can be examined with MDDT or EDDT while the monitor is running. Of
course at these times you do not have to go through MMAP and the PSB
and JSB that are in core are your own.
There are two separate patch areas in the monitor (FFF and SWPF). FFF
is the resident patch area and SWPF is the swapable patch area. These
two symbols should be updated to point to the next free location in
TOPS-20 TROUBLE-SHOOTING HANDBOOK Page 81
CRASH ANALYSIS
the patch area when a patch is inserted. PAT.. is defined to be
equal to SWPF. By convention, all distributed patches are applied at
FFF. This serves the purposes of reducing confusion, always working
until the patch area is exhausted, and leaving patches always present
in a dump for the cases where that is important.
There are several general purpose routines that can be used to look at
the the monitor while it is running. These routines should be used
with caution since it is certainly possible to crash the monitor by
using them incorrectly. Two of the more general routines are MAPDIR,
for mapping a directory into core, and SETMPG for mapping pages
(someone elses PSB or JSB) into core. You will have to look at the
listing for the exact use of these and other general routines. Beware
of the precautions that should be taken when using them. You can find
the module they are located in by looking in the GLOB listing which is
a cross reference listing of all the global symbols in the monitor.
You get a GLOB listing in your microfiche.
TOPS-20 TROUBLE-SHOOTING HANDBOOK Page 82
CRASH ANALYSIS
BUGHLT, BUGCHK, BUGINF
------ ------ ------
The monitor contains a considerable number of internal redundancy
checks which generally serve to prevent unexpected hardware or
software failures from cascading into severely destructive reactions.
Also, by detecting failures early, they tend to expedite the
correction of errors.
There are two failure routines, BUGCHK and BUGHLT for lesser and
greater severity of failures. Calls to them with JSR are included in
code by use of a macro which records the locations and a text string
describing the failure. The general form is:
BUG (TYPE,NAME,<STRING>)
Where type is HLT or CHK, and string describes the cause.
For example,
BUG(HLT,SKDPFL,<PAGE FAULT FROM SCHEDULER CONTEXT>)
The strings are constructed during loading and are dumped into a file.
The BUGSTRINGS.TXT file will produce an ordered listing of the bug
messages for operator or programmer use.
BUGCHK is used where the inconsistency detected is probably not fatal
to the system or to the job being run, or which can probably be
corrected automatically.
Typical is the sequence in MRETN in the SCHED module.
AOSGE INTDF
BUG(HLT,IDFOD2,<AT MRETN - INTDF OVERLY DECREMENTED>)
This BUGCHK is included strictly as a debugging aid. Detection of a
failure takes no corrective action. This situation usually results
from executing one or more excessive OKINT operations (not balanced by
a preceding NOINT). It is considered a problem because a NOINT
executed when INTDF has been overly decremented will not inhibit
interrupts and will not protect code changing sensitive data.
BUGHLT is used where the failure detected is likely to preclude
further proper operation of the system or file storage might be
jeopardized by attempted further operation. For example, the
following appears in the SCHED module:
MOVE 1,TODCLK ;CURRENT TIME
CAML 1,CHKTIM ;TIME AT WHICH JOB0 OVERDUE
BUG(HLT,J0NRUN,<JOB 0 NOT RUN FOR TOO LONG>)
This check accomplishes two things:
TOPS-20 TROUBLE-SHOOTING HANDBOOK Page 83
CRASH ANALYSIS
1. A function of JOB0 is to periodically update the disk
version of bittables, file directories and other
files. Absence of this function would make the system
vulnerable to considerable loss of information on a
crash which loses core and swapping storage. JOB 0
protects itself against various types of malfunction,
this BUGHLT detects any failure resulting in a hangup.
2. Detects if the entire system has become hung due to
failure of the swapping device or some such event, on
the basis that if JOB 0 isn't running, nobody's
running.
NOTE
For Release 4, the program form the
BUGxxx calls takes has been modified,
and the new file BUGS.MAC contains
hopefully useful information on each of
the BUGxxx calls in one place. This
should be considered a required
debugging file.
DBUGSW:
A monitor cell, DBUGSW, controls the behavior of BUGHLT and BUGCHK
when they are called. DBUGSW is set according to whether the system
is attended by system programmers.
If C(DBUGSW)=0, the system is not attended by system programmers, so
all automatic crash handling is invoked. BUGCHK will return +1
immediately, appearing effectively as NOP. BUGHLT will, if called
from the scheduler or at PI level, invoke a total reload from the disk
and a restart of the system. The BUGCHK/INF output will appear on the
CTY and in the SYSERR log when JOB0 gets around to them.
If the system continues to run or is restarted properly, the location
of the bug (saved over a reload) and its message will be reported on
the CTY.
If C(DBUGSW).NEQ.0, the system is attended, and one of the EDDT
breakpoints will be hit. This allows the programmer to look for the
bug and/or possibly correct the difficulty and proceed. There are two
defined non-zero settings of DBUGSW, 1 and 2, which have the following
distinction.
C(DBUGSW) = 1
TOPS-20 TROUBLE-SHOOTING HANDBOOK Page 84
CRASH ANALYSIS
Operation is the same as with 0 except for breakpoint
action. In particular the swappable monitor is write
protected and SYSJOB is started at startup as
described.
C(DBUGSW) = 2
Is used for actual system debugging. the swappable
monitor is not write protected so it may conveniently
be patched or breakpointed, and the SYSJOB operation
is not started to save time.
BUGCHK and BUGHLT procedures are the same as for 1.
The following is a summary of DBUGSW settings:
0 1 2
MEANING Unattended Attended Debugging
BUGCHK action NOP Hit Breakpoint Hit Breakpoint
BUGHLT action Crash System Hit Breakpoint Hit Breakpoint
SWPMON write protect? Yes Yes No
CHECKD on startup Yes Yes No
Other console functions:
In addition to EDDT, several other entry points are defined as
absolute addresses. The machine may be started at these as
appropriate.
140 JRST EDDT ; go to EDDT
141 JRST SYSDDT ; reset and go to EDDT
142 JRST EDDT ; copy of EDDT address
143 JRST SYSLOD ; initialize file system
144 0
145 JRST SYSRST ; restart
146 JRST SYSGOX ; reload and start
147 JRST SYSGO1 ; start
The soft restart (address 145, EVRST) restarts all I/O devices, but
leaves the system tables intact. If it is successful, all jobs and
all (or all but 1) process will continue in their previous state
without interruption. This may be used if an I/O device has
malfunctioned and not recovered properly. The total restart
initializes core, swapping storage and all monitor tables.
TOPS-20 TROUBLE-SHOOTING HANDBOOK Page 85
CRASH ANALYSIS
A very limited set of control functions for debugging purposes has
been built into the scheduler. To invoke a function, the appropriate
bit or bits are set into location 20 via MDDT. The word is scanned
from left to right (JFFO). The first 1 bit found will select the
function.
BIT 0:
Causes scheduler to dismiss current process if any and stall
(execute a JRST .), with -1 in AC0. Useful to effect a clean
manual transfer to EDDT. System may be resumed at SCHED0.
BIT 1:
Causes the job specified by data switch bits 18-35 to be run
exclusively. Temporarily defeats JOB 0 not run BUGHLT.
BIT 2:
Forces running of JOB 0 backup function before halting the
system.
TOPS-20 TROUBLE-SHOOTING HANDBOOK Page 86
BUG'TYP MACRO CHANGES FOR VERSION 4 OF TOPS-20
BUG'TYP MACRO CHANGES FOR VERSION 4 OF TOPS-20
Version 4 of TOPS-20 will include some changes in the BUG code
generation. The purpose of these changes is to generate a document
describing the TOPS-20 BUGCHKs, BUGHLTs, and BUGINFs that are more
descriptive than the previous BUGSTRINGS.TXT file.
The logistics of this change include moving the BUG definitions out of
the monitor source listings and into a central source file. This
source file will serve both as the definition file for the bugs and as
documentation for the BUGS. This file is called BUGS.MAC and will be
distributed to all sites on the distribution tape. These BUGS are
still referenced in the source module where the bug is invoked but
they are defined in BUGS.MAC.
This involves a modification to the old BUG macro and a new macro
called DEFBUG. The BUG macro appears in the source modules and the
DEFBUG macro appears in BUGS.MAC.
The format of the new BUG macro is as follows:
BUG (BUGNAM,<<x1,des1>,<x2,des2>...>)
This is placed in the monitor code where the BUG called BUGNAM is to
occur. This macro executes a macro with name 'BUGNAM' which generates
a XCT BUGNAM where the contents of BUGNAM is a JSR BUG'TYP. Following
the location BUGNAM are the Accumulators to be printed (one AC per
word) followed by SIXBIT/BUGNAM/. The Accumulators to be printed are
defined with the DEFBUG macro while the locations specified in the BUG
macro are for documentation only.
Accompanying this BUG macro is a DEFBUG macro which is placed in the
file BUGS.MAC. This entry completely defines the BUG, including its
type (BUGHLT, BUGCHK, or BUGINF) and documentation.
The format of the DEFBUG macro is:
DEFBUG (TYP,TAG,MOD,WORD,STR,LOCS,HELP)
For a description of the arguments to this macro see the SWSKIT
article called BUGS.MEM.
In order to make listings (output from MACRO or CREF) more informative
than before, the BUG macro will cause the statement of the short
description displayed in the listing where the BUG macro is called.
Also, the flavor of bug (INF, CHK, or HLT) and whether it's hardware
or software related will be displayed in the listing. Hence the
OVRDTA bug would appear in the listing as
TOPS-20 TROUBLE-SHOOTING HANDBOOK Page 87
BUG'TYP MACRO CHANGES FOR VERSION 4 OF TOPS-20
BUG(OVRDTA)
;BUG Type: hardware-related BUGINF
;BUG description: PHYSIO - OVERDUE TRANSFER ABORTED
When fully documented, the BUGS.MAC file will be extremely useful
for specialists. It will describe, in one convenient place, what the
additional data printed on the console is, what caused the bug, and
what the site or specialist should do if that particular bug occurs.
Here is a section of the current BUG definition/documentation for the
BUG GIVTMR from BUGS.MAC:
DEFBUG(INF,GIVTMR,JSYSA,SOFT,<GIVOK TIMEOUT>,<<T2,FUNC>>,<
Cause: The access control job has not responded with a GIVOK within
the designated time period.
Action: If this consistently happens with the same function code, you
should see if the processing of the function can be made
faster.
If there is no obvious function code pattern, you may need to
increase the timeout period or rework the way in which the
access control program operates.
Data: FUNC - the GETOK function code
>)
INF specifies the bug is a BUGINF. GIVTMR is the name of the bug.
JSYSA is the module that the bug would occur in. SOFT specifies that
it is likely the bug is caused by a software bug. <GIVOK TIMEOUT> is
the bug string. <T2,FUNC> specifies the data that will be printed on
the operator's console. The initial spec called for the descriptor
FUNC to be included in the operator's message but at this time, this
descriptor is just for source documentation.
The blurbs following the initial line of the BUG definition attempt to
describe to the specialist, in a more detailed manner than the
description printed on the console, what it means when this bug occurs
and what should be done first in order to resolve the situation. In
this case the ACTION is to examine the GETOK routine which is executed
for the additional data FUNC. This routine is getting hung up.
Sometimes, the ACTION will state to call the hot line or to submit an
SPR. These descriptions will help the specialist be more informed
about the bugs which may occur at one of their sites and save them the
time of calling the hot line or searching through the source module
for an idea of the problem.
TOPS-20 TROUBLE-SHOOTING HANDBOOK Page 88
MONITOR BUILDING HINTS
MONITOR BUILDING HINTS
======================
1. GENERAL
=======
Judging from the number of requests for help on this subject, the
chances are that you will be required to rebuild a monitor sometime
during your career as a Software Specialist. The reasons are quite
simple. There are customers, who simply want functionality other than
that provided by stock monitors. There are also those who are
experiencing performance problems. We cannot forget the sales folks.
It is not unusual to have to rebuild a monitor in order to run a
benchmark. A very common example is increasing the OFN area. Another
quite common requirement is to increase the patch area (FFF). Doing
either of these and simply submitting a build control file will often
produce a bad monitor.
We will talk about PSECTS in relation to the Monitor's address space
but will make no attempt to define what they do. A good detailed
discussion on the Monitor's address space is on pages 2-62 to 2-73 in
the Release 4 Update Manual. Also there is a memo on the Monitor's
address space in the SWSKIT.
2. BACKGROUND
==========
In V3A, all of the Monitor was in the same address space. Nevertheless
there was a crunch on space. As a result some PSECTS were allowed to
overlap. So critical was the space requirement, that attempts to
increase the OFN area or FFF usually resulted in the overlapping of
PSECTS other the the ones permitted. Therein lies the problem. The
Monitor produced from such a process would ordinarily be useless.
With the development of V4, the space requirement became more
critical. The Symbol Table became the object of concern. It required
a large number of pages, and in general, it is only used infrequently
under normal conditions. Hence the Engineering folks were of the
opinion that it should be completely elinminated. We objected. It
would be a nightmare to try to debug the monitor without symbols. It
thus became our project to somehow keep the Symbol Table while
conforming with the space restrictions. We decided to remove the
Symbol Table and place it in an alternate address space. It should
be noted that this action does not impact adversely on system
performance. With this change, the build procedure and the monitor's
address space were reorganized.
3. BUILD PROCEDURE
===============
Outlined below are some steps to guide you when rebuilding a monitor.
Bear in mind that this is a guide and might not account for all the
TOPS-20 TROUBLE-SHOOTING HANDBOOK Page 89
MONITOR BUILDING HINTS
unusual situations. This guide however, coupled with your experience
and common sense will most likely do the trick. PLEASE READ THIS
ENTIRE MEMO BEFORE ATTEMPTING TO REBUILD YOUR MONITOR. Also please
read the build BEWARE file that is on the Installation tape.
NOTE: The customers Distribution Tape will have all the files needed
to rebuild the monitor. All TOPS-20 modules will be in
TOPS-20.REL (or T2020.REL etc) The control file is TOPS20.CTL
(or T2020.CTL etc). The link file will be NAME.CCL where
"NAME" depends upon what monitor is being used (could be 2020,
ARPA etc.). For 2040/50, it is called LNKSCH.CCL. In any case
the TOPS20.CTL file will have the name. The files you will
change will be one of the PARAM's file and/or STG.MAC. It
should be noted that the special LINK.EXE and MACRO.EXE needed
to build V3A are not required under V4.
If you have the time, it is not a bad idea to use all the
standard files and build yourself a "vanilla" monitor. This
will test the procedure and files and reveal any problems
peculiar to the build itself. Once these are resolved, any
problems encountered when you are rebuilding your modified
monitor will be related to the change itself. The time for the
debugging phase can thus be reduced substantially.
STEP 1 Restore all files needed from <4-SOURCES>. This will
usually contain the monitor modules (TOPS20.REL file),
all needed source files, all build control, command
and log files.
STEP 2 Carefully make the source changes as needed.
STEP 3 Examine the TOPS20.CTL file. This file will usually
have logical name definitions and TAKE commands along
with other things. Also look at all referenced command
files.
STEP 4 Examine the corresponding log file. This will show
what the result of the original build procedure was.
It should therefore be a template which should be used
to judge the validity of the new Monitor. Pay special
attention to the section which shows the PSECT layout
at the end of the BUILD procedure. This shows the
start location, the end location and the amount of
free space between each PSECT. The file used by LINK
to set up the PSECTS is called LNKSCH.CCL. You should
look at this file to get an idea of what's happening.
STEP 5 Now edit the control and command files as necessary to
reflect your environment. This will mean, among other
things, changing or eliminating logical name
definitions. Do NOT change the order of the PSECTS in
the LNKSCH.CCL file. Also do not change the starting
value for any PSECT. The starting value is the value
TOPS-20 TROUBLE-SHOOTING HANDBOOK Page 90
MONITOR BUILDING HINTS
given to the /SET: switch.
STEP 6 Submit the control file with /TAG:SINGLE switch.
Ensure that the control file is correct and reflects
accurately logical name definitions and the .CCL file.
Also this portion of the .CTL file has the commands
necessary to compile the changed module.
STEP 7 When the job ends, examine your log file. Correct any
compilation or missing files errors and go back to
STEP 6. Continue with STEP 8 only after all errors are
eliminated.
STEP 8 At this point you should have a MONITR.EXE. Now
examine the section in the log file which gives an
outline of the PSECTS. If any PSECTS overlap, a
message will indicate the same. If there are no
overlapping messages, go to STEP 11. NOTE: There are
some instances where PSECTs can overlap. POSTCD
and SYVAR PSECTs are allowed to overlap any xxxVAR
PSECT. This will not gain very much in storage - 4
pages to be exact. If you follow the build procedure
then overlapping PSECTs are not allowed and therefore
must be resolved. You are once again advised NOT
to re-organize the monitor's address space.
STEP 9 Start with the first overlapping. Figure out the
amount of words by which the first PSECT overlap its
following PSECT. Now add this value to the start
location of the overlapped PSECT. This value quite
possibly will be location within a page i.e. an
address of the form 125300, where the page number is
125 and the offset into the page is 300. The starting
address of many PSECTs is required to be on a page
boundary i.e. an address of the form 126000. A good
rule to follow is: IF THE PSECT STARTED ON A PAGE
BOUNDARY BEFORE THE BUILD, THEN KEEP IT ON A PAGE
BOUNDARY. This would mean that you may be required to
add an additional value to round up to the next page.
For example the 125300 value would be rounded to
126000 if the PSECT is required on a page boundary.
The PSECT sequence and starting values are in the
LNKSCH.CCL file. NOTE: the values are all given in
OCTAL so add in OCTAL.
STEP 10 EDIT the LNKSCH.CCL file to reflect this new start
value for the overlapped PSECT. Go back to STEP 6.
Repeat these steps until there are no more error
messages. Note that changing the start location of the
overlapped PSECT can cause it to overlap its following
PSECT and the same procedure must be followed to
resolve any conflicts. Of course you must be careful
to ensure that you do not outgrow the monitors address
TOPS-20 TROUBLE-SHOOTING HANDBOOK Page 91
MONITOR BUILDING HINTS
space. A total of the length of all PSECTs will tell
you if the Monitor is too large.
STEP 11 At this point you should have a good Monitor. Save it
in the proper directory. The final test is getting it
up and running.
TOPS-20 TROUBLE-SHOOTING HANDBOOK Page 92
EXEC DEBUGGING
EXEC DEBUGGING
--------------
Now that most SWS have micro fiche of the released EXEC and MONITOR I
anticipate questions on looking at the EXEC and MONITOR. Here is a
cursory tutorial on investigating the internals of the EXEC (or
command processor, if you prefer). The examples are intended to be a
guide and although the typein is correct, the response may not be
character perfect. You are advised to read the other chapters in this
document for more information on DDT and MONITOR snooping and
debugging.
LOOKING AT THE EXEC WITH DDT
============================
You can either look at the running system EXEC or your own copy of the
EXEC with DDT that is loaded with the EXEC.
I. TO LOOK AT THE RUNNING EXEC:
First you must have WHEEL privileges in order to use the ^EEDDT
command. The ^EEDDT command transfers control to the DDT now loaded
with EXEC, with symbols. Now you can do all the normal DDT functions.
To exit from DDT all you do is <ESC>G , echoed as $G. This starts
your program which is the EXEC and so now you are at EXEC command
level.
@ENABLE
$^EEDDT
DDT
.
.
.
$G
$DIS
@
II. TO LOOK AT YOUR COPY OF AN EXEC(RUNNING UNDER SYSTEM EXEC):
Get your copy of the EXEC in your address space, transfer control to
it and start DDT as above. There are 3 ways to exit from this
depending on the state you are in. If you are in DDT you can ^Z out
to get back to system EXEC. If you are running your EXEC and want to
exit to the system EXEC you can ^EQUIT (if you are enabled) or "POP"
(if you are not enabled). POP is preferable. Note if you prefer to
get your EXEC and not start it in order to set breakpoints or put in
patches before running, see section "VI -- PATCHING" below.
TOPS-20 TROUBLE-SHOOTING HANDBOOK Page 93
EXEC DEBUGGING
EXAMPLE EXITING FROM DDT:
@GET MYEXEC.EXE
@SET NO CONTROL-C-CAPABILITY
@START
@MONNAM.TXT, TOPS-20 MONITOR (VERSION#)
@ENA
$^EEDDT
DDT
.
.
.
CINITF/ -1 0 ; reset initialization flag so you can
; run this EXEC again after it is saved
.
^Z ; to exit and save, for example
@ ; now you are in the monitors EXEC
; with your EXEC in your
; address space. You can save it, say.
@SAV MYEXEC.EXE.2
EXAMPLE, EXITING FROM YOUR RUNNING EXEC:
@GET MYEXEC.EXE
@START
@MONNAM.TXT,,TOPS-20 MONITOR(VERSION #)
@ENA
^EEDDT
DDT
.
.
$G ; running your EXEC
.
.
CINITF/ -1 0 ; clear initialization flag
$^EQUIT ; return to higher (system) EXEC
@ ; you are in system EXEC
@SAV NEWEXEC ; etc.
EXAMPLE, EXITING FROM YOUR RUNNING EXEC WITH POP:
@GET MYEXEC.EXE
@START
@MONNAM.TXT,,TOPS-20 MONITOR(VERSION#)
@
.
.
.
@POP ; return to higher (system) EXEC.
@ ; now you are in system EXEC.
TOPS-20 TROUBLE-SHOOTING HANDBOOK Page 94
EXEC DEBUGGING
; NOTE: you should set CINITF to 0
; if you want to save and run this
; EXEC later. You can do it by
; DDT after the POP or ^EEDDT before
; the POP.
III. GETTING OUT OF TROUBLE:
Since it is true that you could get into trouble with your EXEC
and not be able to get out of it, CTRL/C traps or you can't POP or
whatever, there is a way to exit to the MINI-EXEC always. First you
must issue ^EQUIT to get into the MINI-EXEC. Then "S" (start) to get
back to the system EXEC. Then get into your EXEC. If you now get
into trouble you can issue ^P which will get you back into the
MINI-EXEC. Now you have the chance to get back to the system EXEC
with "S" (start).
EXAMPLE:
@ENA
$^EQUIT
INTERRUPT AT 15657
MX>S
$ ; you are now back in system EXEC.
$GET MYEXEC
$
$START
@MONNAM.TXT, TOPS-20 MONITOR (VERSION)
. ; lets say you can't do anything
. ; you are in your EXEC
. ; get out, get into MINI-EXEC
^P
INTERRUPT AT 12345
MX>S ; MINI-EXEC prompt followed by start.
$ ; you are now in the system EXEC.
IV. RUNNING YOUR EXEC AS A TOP LEVEL FORK:
Suppose that you want to run your EXEC as the top level EXEC,
that is, not running under the system EXEC. Get into the MINI-EXEC
and get your copy of the EXEC and run it as the top level EXEC.
EXAMPLE:
@ENA
$^EQUIT
INTERRUPT AT 23456
TOPS-20 TROUBLE-SHOOTING HANDBOOK Page 95
EXEC DEBUGGING
MX>R ; Reset else you will MERGE rather than just GET
MX>G <MYAREA>MYEXEC.EXE.2
MX>S
@ ; Now you are in your EXEC
.
.
. ; Lets say you want to get out
@^P ; Control-P to get to MINI-EXEC
ABORT
MX>R ; "RESET" resets your address space
MX>E ; You are requesting the system EXEC
@ ; You are in system EXEC
NOTE: If you had typed "S" rather than "E" above you would
have restarted your EXEC.
V. OTHER INFORMATION:
There is one error message when trying to start DDT; "?" implies that
you do not have sufficient privleges enabled.
When searching for symbols you may notice that the module name
DDT gives you is different from the module names that are assembled
for the EXEC. For example to open the symbol table for EXECED you say
CANDE$: to DDT.
The following is a correspondence list:
FILENAME.MAC INTERNAL REFERENCE
==================================
EXECDE.MAC XDEF
EXECGL.MAC XGLOBS
EXECPR.MAC PRIV
EXEC0.MAC EXEC0
EXEC1.MAC EXEC1
EXEC2.MAC EXEC2
EXEC3.MAC EXEC3
EXEC4.MAC EXEC4
EXECED.MAC CANDE
EXECCS.MAC CSCAN
EXECSU.MAC SUBRS
EXECMT.MAC EXECMT
EXECQU.MAC EXECQU
EXECSE.MAC EXECSE
EXECP.MAC EXECP
EXECVR.MAC VER
EXECMI.MAC MIC
TOPS-20 TROUBLE-SHOOTING HANDBOOK Page 96
EXEC DEBUGGING
The sources and .CTL file for assembling the EXEC are on the
SWSKIT.
If it is true that upon trying to examine a location symbolically
you get "U" implying the symbol is undefined you may have to reset the
symbol table pointers. Look in location 770001 for the address that
contains the symbol table pointer then look at location 116 to find
the real symbol table pointer. Put the contents of 116 in the
location pointed to by 770001.
116/ 762600,54463 ; real symbol table pointer
770001/ 776456 ; location of symbol table pointer
776456/ 743200,,23540 762600,,54463
VI. PATCHING
There is a patch command in DDT. The form is as follows:
$< ; patch before this instruction
$$< ; patch after this instruction
$> ; end this patch following this instruction
DDT will put the patch in the EXEC patch area. The symbol is PAT..
DDT will insert JUMPA 1,LOC+1 and JUMPA 2,LOC+2 following the patch
you typed in. Where LOC is the location of the instruction you're
patching. DDT then replaces LOC, the original INST., with a JUMPA
XXXXX, where XXXXX is the patch area where your patch is now. Then
the patch area (PAT..) is redefined to follow your last patch.
EXAMPLE:
Get a copy of <SYSTEM>EXEC, insert calls to subroutine MUMBLE and
subroutine FRATZ before location DING+1. DING+1 contains PRINT Q3
originally and contains a JUMPA to the patch area after the patch.
The patch area will contain:
CALL MUMBLE
CALL FRATZ
PRINT Q3
JUMPA 1,DING+2
JUMPA 2,DING+3
USER TYPESCRIPT FOR THE ABOVE:
@ENABLE
$GET<SYSTEM>EXEC
$SAVE NUEXEC ; you must SAVE and GET in order to write
$GET NUEXEC ; enable the EXEC to use DDT not ^EEDDT.
$DDT
DDT
TOPS-20 TROUBLE-SHOOTING HANDBOOK Page 97
EXEC DEBUGGING
EXEC0$: ; open symbols for module where DING is
DING/ PUSH P,A ; first location in routine "DING"
DING+1/ PRINT Q3 $< ; begin patching before location DING+1
PAT../ 0 CALL MUMBLE ; DDT opens up PAT.. area, you add code
PAT..+1/CALL FRATZ ; continue to insert your patch
$> ; close the patch
PAT..+2/ PRINT Q3 ; the original instruction being replaced.
PAT..+3/ JUMPA 1,DING+2 ; DDT inserts this return.
PAT..+4/ JUMPA 2,DING+3 ; incase a SKIP inst.
DING+1/ JUMPA 12345 ; JUMPA to PAT.. replaces original LOC.
$G ; start your copy of EXEC etc.
Various methods may be used to write-enable the EXEC for
patching. You can use the GET, SAVE method above, or SET PAGE n
COPY-ON-WRITE, or the $W command in DDT to achieve the same results.
TOPS-20 TROUBLE-SHOOTING HANDBOOK Page 98
RECOVERING FROM A BAD EXEC
RECOVERING FROM A BAD EXEC
--------------------------
This procedure is simply a rehash of the procedure for recovering
from the case in which the EXEC refuses to log in. For more
information see the article "Looking at the EXEC with DDT".
If your system version of the EXEC blows up completely, you can
recover rather easily. You type a ^C on the CTY, and when the EXEC
blows up you will be dumped into the MINI-EXEC. Then you can use the
GET and START commands to read in a good version of the EXEC, either
from a copy on disk, or from the distribution magtapes.
If the problem with the EXEC is that it does not blow up, but it
still fails to let you log in, then you have a harder time. In this
case you have to bring up the system with the switches, and bring up
the system stand-alone. An example of what to do from the point where
the BOOT program is loaded follows:
BOOT>/L ; load in the monitor
BOOT>/G141 ; start up EDDT
EDDT
DBUGSW[ 0 2 ; set system as debugging
EDDTF[ 0 1 ; keep EDDT around
GOTSWM$B ; set a breakpoint after the swappable
; part of the monitor has been loaded
147$G ; start the system
GOTSWM$1B>> STEX+1/ HRROI T2,BOOTER+51 HRROI T2,FFF
FFF[ ""PS:<SYSTEM>OLD-EXEC.EXE"
FFF: ; change the name of the EXEC file
0$1B ; remove the GOTSWM breakpoint
$P ; proceed to bring up the system
^C ; and Control-C to get the new EXEC
If you had no old version of the EXEC around, then change the name to
some garbage, so that the monitor can't find any such program. This
will then dump you into the MINI-EXEC, and then you can read a good
EXEC in from magtape.
In release 3 of the monitor, there is a new JSYS which is very
useful for debugging new versions of the EXEC. The CRJOB JSYS can
allow you to start up a new job with any program at all as it's top
level fork. You can also start the job not logged in. So you can
debug your new versions of the EXEC easily, with no possibility of
ripping yourself off. Of course the ^EQUIT, GET from MINI-EXEC is
still a valid sequence for starting a new top-level fork.
TOPS-20 TROUBLE-SHOOTING HANDBOOK Page 99
Debugging the GALAXY System
Debugging the GALAXY System
1.0 INTRODUCTION
The GALAXY system presents a unique problem to the software specialist
who is trying to debug one of its components. Usually, any user mode
program can be debugged under TOPS-20 by running a copy of it, loaded
with DDT, taking appropriate care that nothing is done which will
affect any users of the system. For GALAXY, however, it is very
difficult to not affect users of the system. For example, if you are
trying to debug BATCON, you will find that QUASAR will very happily
schedule batch jobs submitted by other users to be run by your BATCON.
If you are not careful, you can cause those batch jobs to be lost, or
at least slowed down, while you are debugging.
Debugging QUASAR or ORION would be even worse. Users would see PRINT,
SUBMIT, etc. commands hang when you hit a breakpoint in QUASAR.
Operators would be unable to control any system components if you were
breakpointed in ORION. On top of this, the monitor knows about
QUASAR, and you may lose messages which happen when users close a
spooled lineprinter file, or when a job logs out.
To solve these problems, the concept of a "private GALAXY system" has
been implemented by software engineering in version 4 of GALAXY. When
a private GALAXY system is operating, all of its components are
completely independent of the primary GALAXY system. QUASAR, the
queue maintainer, keeps queues that are separate from the system
queues and are failsofted to a different master queue file. This
QUASAR communicates only with other components in the same private
system. It is even possible to run several complete private GALAXY
systems, with the restrictions that:
1. All components in a private system must run under the same
user name.
2. Only one private system may be run by a given user.
3. Each private QUASAR must be connected to a different
directory.
4. Each private ORION must be connected to a different
directory.
TOPS-20 TROUBLE-SHOOTING HANDBOOK Page 100
BUILDING A PRIVATE GALAXY SYSTEM
2.0 BUILDING A PRIVATE GALAXY SYSTEM
Since the changes necessary to create a private GALAXY system were
implemented in the version 4 source code, it is relatively simple to
build the system. The recommended procedure is as follow:
1. Create a directory to for the private GALAXY system.
2. Restore the file EXEC-FOR-DEBUGGING-GALAXY.EXE from the
SWSKIT to this newly created directory.
3. Restore each of the following files from the "Subsys files
for TOPS20 V4" saveset on the TOPS-20 distribution tape to
this directory.
BATCON.EXE
CDRIVE.EXE
GLXLIB.EXE
LPTSPL.EXE
OPR.EXE
ORION.EXE
PLEASE.EXE
QMANGR.EXE
QUASAR.EXE
SPRINT.EXE
SPROUT.EXE
4. For each component in the above list except GLXLIB.EXE and
QMANGR.EXE, perform the following steps:
1. Give the EXEC command "GET xxxxxx.EXE"
2. Give the command "DEPOSIT 135 -1"
3. Give the command "SAVE xxxxxx"
TOPS-20 TROUBLE-SHOOTING HANDBOOK Page 101
EXAMPLE OF A PRIVATE GALAXY BUILD
3.0 EXAMPLE OF A PRIVATE GALAXY BUILD
It is not strictly necessary to restore all of the GALAXY components
for a one time only debugging session. To debug a component like
BATCON, you would need at a minimum:
1. Your own copy of BATCON
2. Your own copy of QUASAR for BATCON to speak to
3. Your own copy of ORION for BATCON and QUASAR to speak to
4. A copy of OPR to speak to ORION to control BATCON
5. An EXEC which knows about your QUASAR to make queue entries
The following is a log of an example build of a private GALAXY system:
TOPS-20 Command processor 4(560)
@ENABLE (CAPABILITIES)
$!
$! First connect to a debugging directory
$!
$CONNECT (TO DIRECTORY) MISC:<HEMPHILL.GALAXY.DEBUG>
$!
$! Now build and save debugging .EXE files
$!
$! QUASAR, the queue maintainer
$!
$GET (PROGRAM) SYS:QUASAR.EXE.55
$DEPOSIT (MEMORY LOCATION) 135 (CONTENTS) -1
[Shared]
$SAVE (ON FILE) QUASAR.EXE.1 !New file! (PAGES FROM)
QUASAR.EXE.1 Saved
$!
$! ORION, the message clearinghouse
$!
$GET (PROGRAM) SYS:ORION.EXE.53
$DEPOSIT (MEMORY LOCATION) 135 (CONTENTS) -1
[Shared]
$SAVE (ON FILE) ORION.EXE.1 !New file! (PAGES FROM)
ORION.EXE.1 Saved
$!
$! OPR, the operator interface
$!
$GET (PROGRAM) SYS:OPR.EXE.55
$DEPOSIT (MEMORY LOCATION) 135 (CONTENTS) -1
[Shared]
$SAVE (ON FILE) OPR.EXE.1 !New file! (PAGES FROM)
OPR.EXE.1 Saved
$!
$! BATCON, the batch controller
$!
TOPS-20 TROUBLE-SHOOTING HANDBOOK Page 102
EXAMPLE OF A PRIVATE GALAXY BUILD
$GET SYS:BATCON.EXE.39
$DEPOSIT (MEMORY LOCATION) 135 (CONTENTS) -1
[Shared]
$SAVE (ON FILE) BATCON.EXE.1 !New file! (PAGES FROM)
BATCON.EXE.1 Saved
$!
$! Now a directory of what we've got
$!
$VDIRECTORY (OF FILES) *.*.*
MISC:<HEMPHILL.GALAXY.DEBUG>
BATCON.EXE.1;P777700 16 8192(36) 13-Feb-80 22:00:37
EXEC-FOR-DEBUGGING-GALAXY.EXE.1;P777700
82 41984(36) 13-Feb-80 04:33:50
OPR.EXE.1;P777700 31 15872(36) 13-Feb-80 22:00:09
ORION.EXE.1;P777700 44 22528(36) 13-Feb-80 21:59:45
QUASAR.EXE.1;P777700 40 20480(36) 13-Feb-80 21:59:27
Total of 213 pages in 5 files
$
TOPS-20 TROUBLE-SHOOTING HANDBOOK Page 103
RUNNING THE PRIVATE GALAXY SYSTEM
4.0 RUNNING THE PRIVATE GALAXY SYSTEM
Starting and running a private GALAXY system is similar to running
GALAXY in the usual manner. First QUASAR and ORION are started, then
the component you wish to debug. You will also need OPR to issue
operator commands and the modified EXEC to make queue entries. Since
you will need about five jobs, it is usually most convenient to run
each component as a separate subjob under PTYCON.
4.1 Starting QUASAR
QUASAR and ORION should be started before everything else. Nothing
evil happens if you start them last, but all the other components will
be waiting for these two to start. A suggested procedure is:
1. Define a subjob "Q"
2. Connect to it
3. LOGIN a job under the same user name
4. CONNECT that job to the directory in which you did the
private GALAXY build
5. ENABLE
6. RUN QUASAR
4.2 Starting ORION
Starting ORION is as painless as starting QUASAR:
1. Define a subjob "O"
2. Connect to it
3. LOGIN a job under the same user name
4. CONNECT that job to the directory in which you did the
private GALAXY build
5. ENABLE
6. RUN ORION
TOPS-20 TROUBLE-SHOOTING HANDBOOK Page 104
Starting OPR
4.3 Starting OPR
OPR starts up using the same formula as QUASAR and ORION:
1. Define a subjob "OPR"
2. Connect to it
3. LOGIN a job under the same user name
4. CONNECT that job to the directory in which you did the
private GALAXY build
5. ENABLE
6. RUN OPR
7. You may now type OPR commands to see if QUASAR and ORION
appear to be healthy.
4.4 Starting The Component To Be Debugged
If the component you wish to debug is QUASAR, ORION, or OPR, then you
have already started it. Breakpoints could have been set, and when
they were hit, the component could have been debugged without any
noticable affect on other users of the system. If you wish to debug
PLEASE, BATCON, LPTSPL, CDRIVE, SPRINT, or SPROUT, do the following:
1. Define a subjob with an appropriate ID (e.g. B for BATCON)
2. Connect to it
3. LOGIN a job under the same user name
4. CONNECT that job to the directory in which you did the
private GALAXY build
5. ENABLE
6. GET the component
7. Enter DDT
8. Set breakpoints, then start the program
TOPS-20 TROUBLE-SHOOTING HANDBOOK Page 105
Starting the Modified EXEC
4.5 Starting The Modified EXEC
The file "EXEC-FOR-DEBUGGING-GALAXY.EXE" which has been supplied on
the SWSKIT has exactly two commands added to its repertoire. These
are "^ESET DEBUGGING-GALAXY" and "^ESET NO DEBUGGING-GALAXY". The
effect of these commands is to select which one of two PIDs (Process
IDs) to communicate with: the system QUASAR or the private QUASAR.
If "NO DEBUGGING-GALAXY" is set, then PRINT, SUBMIT, CANCEL, MODIFY,
and the INFORMATION commands will all cause communication with the
system QUASAR. If "DEBUGGING-GALAXY" is set for this EXEC, then the
commands listed will communicate with the private QUASAR run by that
user.
1. Define a subjob "E"
2. Connect to it
3. LOGIN a job under the same user name
4. CONNECT that job to the directory in which you did the
private GALAXY build
5. RUN EXEC-FOR-DEBUGGING-GALAXY
6. ENABLE
7. ^ESET DEBUGGING-GALAXY
TOPS-20 TROUBLE-SHOOTING HANDBOOK Page 106
EXAMPLE DEBUGGING SESSION
5.0 EXAMPLE DEBUGGING SESSION
The following is a log of a sample debugging session:
TOPS-20 Command processor 4(560)
@!
@! First run PTYCON, so we can control five jobs from one terminal
@!
@PTYCON.EXE.7
PTYCON> !
PTYCON> ! Now start up QUASAR as subjob Q
PTYCON> !
PTYCON> DEFINE (SUBJOB #) 0 (AS) Q
PTYCON> CONNECT (TO SUBJOB) Q
[CONNECTED TO SUBJOB Q(0)]
2102 Development System, TOPS-20 Monitor 4(3245)
@LOG HEMPHILL (PASSWORD)
Job 21 on TTY222 13-Feb-80 22:18:05
Structure PS: mounted
Structure MISC: mounted
@ENABLE (CAPABILITIES)
$!
$! Connect to directory where debugging .EXE files are
$!
$CONNECT (TO DIRECTORY) MISC:<HEMPHILL.GALAXY.DEBUG>
$!
$! Finally run the component
$!
$RUN (PROGRAM) QUASAR.EXE.1
% QUASAR GLXIPC Becoming [HEMPHILL]QUASAR (PID = 66000031)
% QUASAR GLXIPC Waiting for ORION to start
^X
PTYCON> !
PTYCON> ! Now start up ORION as subjob O
PTYCON> !
PTYCON> DEFINE (SUBJOB #) 1 (AS) O
PTYCON> CONNECT (TO SUBJOB) O
[CONNECTED TO SUBJOB O(1)]
2102 Development System, TOPS-20 Monitor 4(3245)
@LOG HEMPHILL (PASSWORD)
Job 22 on TTY223 13-Feb-80 22:19:25
Structure PS: mounted
Structure MISC: mounted
@ENABLE (CAPABILITIES)
$!
$! Connect to directory where debugging .EXE files are
$!
$CONNECT (TO DIRECTORY) MISC:<HEMPHILL.GALAXY.DEBUG>
$!
$! Finally run the component
$!
TOPS-20 TROUBLE-SHOOTING HANDBOOK Page 107
EXAMPLE DEBUGGING SESSION
$RUN (PROGRAM) ORION.EXE.1
% ORION GLXIPC Alternate [HEMPHILL]QUASAR (PID = 66000031)
% ORION GLXIPC Becoming [HEMPHILL]ORION (PID = 70000032)
**** Q(0) 22:19:58 ****
% QUASAR GLXIPC Alternate [HEMPHILL]ORION (PID = 70000032)
**** O(1) 22:19:58 ****
^X
PTYCON> !
PTYCON> ! Now start up OPR as subjob OPR
PTYCON> !
PTYCON> DEFINE (SUBJOB #) 2 (AS) OPR
PTYCON> CONNECT (TO SUBJOB) OPR
[CONNECTED TO SUBJOB OPR(2)]
2102 Development System, TOPS-20 Monitor 4(3245)
@LOG HEMPHILL (PASSWORD)
Job 23 on TTY224 13-Feb-80 22:20:29
Structure PS: mounted
Structure MISC: mounted
@ENABLE (CAPABILITIES)
$!
$! Connect to directory where debugging .EXE files are
$!
$CONNECT (TO DIRECTORY) MISC:<HEMPHILL.GALAXY.DEBUG>
$!
$! Finally run the component
$!
$RUN (PROGRAM) OPR.EXE.1
% OPR GLXIPC Alternate [HEMPHILL]QUASAR (PID = 66000031)
% OPR GLXIPC Alternate [HEMPHILL]ORION (PID = 70000032)
OPR>
22:19:59 -- Network Node 1031 is Online --
22:19:59 -- Network Node 2137 is Online --
22:19:59 -- Network Node 4097 is Online --
22:19:59 -- Network Node DN20A is Online --
22:19:59 -- Network Node MILL20 is Online --
22:19:59 -- Network Node SYS880 is Online --
OPR>!
OPR>! Let's take a look at our brand new queues
OPR>!
OPR>SHOW QUEUES
OPR>
22:21:21 --The Queues are Empty--
OPR>SHOW STATUS PRINTER
OPR>
22:21:27 --There are no Devices Started--
OPR>^X
PTYCON> !
PTYCON> ! Now start up BATCON as subjob B
TOPS-20 TROUBLE-SHOOTING HANDBOOK Page 108
EXAMPLE DEBUGGING SESSION
PTYCON> !
PTYCON> DEFINE (SUBJOB #) 3 (AS) B
PTYCON> CONNECT (TO SUBJOB) B
[CONNECTED TO SUBJOB B(3)]
2102 Development System, TOPS-20 Monitor 4(3245)
@LOG HEMPHILL (PASSWORD)
Job 24 on TTY225 13-Feb-80 22:21:49
Structure PS: mounted
Structure MISC: mounted
@ENABLE (CAPABILITIES)
$!
$! Connect to directory where debugging .EXE files are
$!
$CONNECT (TO DIRECTORY) MISC:<HEMPHILL.GALAXY.DEBUG>
$!
$! Finally run the component
$!
$RUN (PROGRAM) BATCON.EXE.1
% BATCON GLXIPC Alternate [HEMPHILL]QUASAR (PID = 66000031)
% BATCON GLXIPC Alternate [HEMPHILL]ORION (PID = 70000032)
^X
PTYCON> !
PTYCON> ! Now start up special EXEC as subjob E
PTYCON> !
PTYCON> DEFINE (SUBJOB #) 4 (AS) E
PTYCON> CONNECT (TO SUBJOB) E
[CONNECTED TO SUBJOB E(4)]
2102 Development System, TOPS-20 Monitor 4(3245)
@LOG HEMPHILL (PASSWORD)
Job 19 on TTY226 13-Feb-80 22:23:00
Structure PS: mounted
Structure MISC: mounted
@CONNECT (TO DIRECTORY) MISC:<HEMPHILL.GALAXY.DEBUG>
@!
@! Run the special EXEC, which is provided on the SWSKIT
@!
@RUN (PROGRAM) EXEC-FOR-DEBUGGING-GALAXY.EXE.1
TOPS-20 Command processor 4(560)-1
@ENABLE (CAPABILITIES)
$!
$! Make this EXEC switch from system queues to private queues
$!
$^ESET DEBUGGING-GALAXY
$!
$! Use ordinary EXEC commands to examine private queues
$!
$INFORMATION (ABOUT) OUTPUT-REQUESTS
[The Queues are Empty]
$INFORMATION (ABOUT) BATCH-REQUESTS
[The Queues are Empty]
$!
TOPS-20 TROUBLE-SHOOTING HANDBOOK Page 109
EXAMPLE DEBUGGING SESSION
$! Now switch back to look at system queues
$!
$^ESET NO DEBUGGING-GALAXY
$INFORMATION (ABOUT) OUTPUT-REQUESTS
Printer Queue:
Job Name Req# Limit User
-------- ---- ----- ------------------------
* KLERR 6 1197 DEUFEL On Unit:0
Started at 22:05:47, printed 314 of 1197 pages
XXX 3 18 KAMANITZ /Dest:4097
MS-OUT 18 117 BRAITHWAITE /Unit:0
There are 3 Jobs in the Queue (1 in Progress)
$INFORMATION (ABOUT) BATCH-REQUESTS
Batch Queue:
Job Name Req# Run Time User
-------- ---- -------- ------------------------
* DUMP 16 02:00:00 OPERATOR In Stream:0
Job# 17 Running DUMPER Last Label: A Runtime 0:23:55
BATCH 2 00:05:00 BLIZARD /Proc:FOO
SOURCE 8 00:05:00 BLOUNT /After:14-Feb-80 0:00
SRCCOM 12 00:05:00 MURPHY /After:14-Feb-80 0:00
QJD4R 13 00:05:00 SROBINSON /After:19-Feb-80 0:00
QAR 10 00:05:00 BLOUNT /After:19-Feb-80 0:14
SAVE 1 00:05:00 FICHE /After:19-Feb-80 9:10
There are 7 Jobs in the Queue (1 in Progress)
$!
$! Now let's submit a batch job to our own BATCON
$!
$^ESET DEBUGGING-GALAXY
$!
$! Make a trivial batch control file
$!
$COPY (FROM) TTY: (TO) A.CTL.1 !New file!
TTY: => A.CTL.1
@SY A
^Z
$!
$! And submit the job
$!
$SUBMIT (BATCH JOB) A.CTL.1
[Job A Queued, Request-ID 1, Limit 0:05:00]
$!
$! Now examine private queues
$!
$INFORMATION (ABOUT) BATCH-REQUESTS
Batch Queue:
Job Name Req# Run Time User
-------- ---- -------- ------------------------
TOPS-20 TROUBLE-SHOOTING HANDBOOK Page 110
EXAMPLE DEBUGGING SESSION
A 1 00:05:00 HEMPHILL
There is 1 Job in the Queue (None in Progress)
$!
$! Our job is in the batch queue, but no batch-streams have been started
$!
$^X
PTYCON> CONNECT (TO SUBJOB) OPR
[CONNECTED TO SUBJOB OPR(2)]
OPR>START (Object) BATCH-STREAM (Stream Number) 0
OPR>
22:25:40 Batch-Stream 0 --Startup Scheduled--
22:25:40 Batch-Stream 0 --Started--
OPR>
22:25:40 Batch-Stream 0 --Begin--
Job A Req #1 for HEMPHILL
OPR>
22:25:51 Batch-Stream 0 --End--
Job A Req #1 for HEMPHILL
OPR>
^X
PTYCON> !
PTYCON> ! Cleaning up is easy
PTYCON> !
PTYCON> KILL (SUBJOB) ALL
PTYCON> EXIT (FROM PTYCON)
@
TOPS-20 TROUBLE-SHOOTING HANDBOOK Page 111
TECHNICAL DETAILS
6.0 TECHNICAL DETAILS
This section is to explain what happens differently when a component
has had location 135 (.JBOPS) poked to -1, and to present a few
helpful tidbits of information about debugging some of the programs.
.JBOPS incidentally is the word in the job data area (defined under
TOPS-10) which is reserved for a program's OTS. GALAXY references
this location by the symbol "DEBUGW".
6.1 GLXLIB
GLXLIB is the GALAXY library. It consists of a code segment which
starts at address 400000 and a data segment at address 600000. Each
of the programs QUASAR, ORION, OPR, PLEASE, BATCON, LPTSPL, CDRIVE,
SPRINT, and SPROUT uses it. Part of the initialization code of each
of these programs maps in GLXLIB as a "high segment". This is in
effect an object time system for GALAXY, with many commonly used
routines. Most of the support for the private GALAXY system is in
this library, enough so that OPR, PLEASE, BATCON, LPTSPL, SPRINT and
SPROUT actually have no code which cares whether they are part of a
private GALAXY. The initialization code in each component looks in
three places to find GLXLIB.EXE: first on the structure and directory
that the component itself came from, second on DSK:, third on SYS:.
This search order is the same for both the system GALAXY and the
private one.
The actual changes implemented for the private GALAXY are as
follows:
1. Ordinarily, a component which stopcodes will save a crash
file on disk. When debugging, however, the crash file is not
written. In either case, if DDT is loaded with the program,
the stopcode will invoke a jump to DDT.
2. GALAXY components do not require receiving privileged packets
under debugging.
3. Ordinarily, QUASAR and ORION get special system PIDs for IPCF
communications. When debugging, they get PIDs with names of
the form "[username]QUASAR" and "[username]ORION". All
GALAXY components will then look for these PID names. Even a
pseudo-GALAXY component, such as MOUNTR or IBMSPL, will be
able to find these PIDs if its location 135 has been poked to
-1, simply because it uses GLXLIB.
4. GALAXY components print messages like:
"% QUASAR GLXIPC Waiting for ORION to start"
only while debugging.
5. ORION and QUASAR print messages about PIDs they acquire,
like:
"% QUASAR GLXIPC Becoming [HEMPHILL]QUASAR (PID =
TOPS-20 TROUBLE-SHOOTING HANDBOOK Page 112
TECHNICAL DETAILS
66000031)"
6. All components print messages about the special PIDs they
find for QUASAR and ORION, like:
"% ORION GLXIPC Alternate [HEMPHILL]QUASAR (PID =
66000031)"
6.2 QUASAR
1. QUASAR reads and writes private queues from its connected
directory. The full filespec is
"DSK:PRIVATE-MASTER-QUEUE-FILE.QUASAR"
2. QUASAR does absolutely no privilege checking. Anyone can
modify or kill any request in the queues (if they know how to
speak to this private QUASAR).
6.3 ORION
1. ORION will create a log file under the name of
"DSK:ORION-TEST.LOG" instead of
"PS:<SPOOL>ORION-SYSTEM-LOG.001", and does no renaming of any
old log files present.
2. ORION will not set up any NSP servers when debugging. It
therefore will not speak to remote nodes to run OPRs for
them. However, there are hooks for ORION to initialize
"SRV:128" instead of the usual "SRV:47" when debugging.
6.4 QMANGR
QMANGR has also been modified to look for a private QUASAR's PID if
the low segment has a non-zero entry in .JBOPS.
6.5 CDRIVE
CDRIVE can pose a problem to debug, since it has potentially many
inferior forks all executing the same code, so each fork automatically
loads SDDT into its address space and jumps to it when it starts up.
After setting any breakpoints or otherwise modifying this fork's code,
the debugger types "GO<ESC>G" to resume the fork. While debugging, if
the fork terminates (crashes), CDRIVE will not go through its normal
purging of the crashed fork, so that its status can be examined.
TOPS-20 TROUBLE-SHOOTING HANDBOOK Page 113
EXAMINING GALAXY CRASH FILES
7.0 EXAMINING GALAXY CRASH FILES
All GALAXY components use the stopcode facility supplied by GLXLIB.
This facility dumps the ACs, program error codes, associated error
messages, program version numbers, and the last nine locations of the
stack onto the controlling terminal of the program executing the
stopcode. In addition, a crash file is created with the name of the
form: PS:<SPOOL>program-stopcode-CRASH.EXE. This .EXE file contains
the entire core image of the program which has crashed, and is
extremely useful in determining the cause of the crash. In
particular, there is a block of data referred to as the "crash block"
which usually contains the information most pertinent to the debugger.
This information can be read with either DDT or FILDDT. Its contents
are tabulated as follows:
Location Data
.SPC PC of stopcode
.SCODE SIXBIT name of stopcode
.SERR Last TOPS-20 error code
.SACS Contents of the sixteen accumulators
.SPTBL Base address of page table used by
GLXMEM
.SPRGM Name of program in SIXBIT
.SPVER Program version number
.SPLIB GLXLIB version number
.LGERR Last GALAXY error code
.LGEPC PC of last GALAXY error return
TOPS-20 TROUBLE-SHOOTING HANDBOOK Page 114
DEBUGGING MOUNTR
DEBUGGING MOUNTR
1.0 INTRODUCTION
This write-up was prepared to assist developers and maintainers
in understanding and debugging the TOPS-20 tape and structure mounting
program, MOUNTR. It is assumed that the reader has a working
knowledge of TOPS-20 assembler language coding and the set of TOPS-20
monitor calls.
2.0 SOURCES OF INFORMATION
This document will serve primarily as a guide to debugging MOUNTR
crashes. Much of the information needed to understand the data bases
and the operation of MOUNTR resides within the first 20 or 30 pages of
the MOUNTR code itself. Just make a listing and start reading.
3.0 MOUNTR CRASHES
When MOUNTR crashes, it saves its core image in the file,
PS:<SPOOL>MOUNTR-CRASH.EXE
All crashes are initiated by a CALL STOP instruction. This may result
from a logic inconsistency, or it can happen if MOUNTR receives a
software interrupt on a panic channel. The STOP routine gathers some
important data and saves it in core. It then types a message giving
the name of the filespec wherein it is saving the core image, and
issues an SSAVE JSYS to save the image. After restoring the ACs from
the time of the crash, MOUNTR halts.
To begin debugging a MOUNTR crash, follow these steps:
1. GET PS:<SPOOL>MOUNTR-CRASH.EXE
2. Get into DDT and type STOP1$G. This will load DDT's ACs with
MOUNTR's ACs at the time of the crash and exit to the EXEC.
Give the DDT command to the EXEC again to get back into ddt.
3. Look at P (AC 17). If it contains PDL1+something, there has
been a stack trap, and the routine STOPP was called as a
result. The location BADP contains the contents of P at the
time of the trap.
4. If P contains PDL+something, type TAB to look at the top of
the stack. This will contain one plus the address of the
CALL STOP instruction. Type TAB and ^H to display the
CALL STOP instruction that invoked the crash. If MOUNTR died
TOPS-20 TROUBLE-SHOOTING HANDBOOK Page 115
MOUNTR CRASHES
as a result of a panic channel interrupt, LPC1 will contain
one plus the address of the instruction that caused the
interrupt.
The following locations and data structures are helpful in locating
the cause of difficulties in MOUNTR:
NAME FUNCTION
---- --------
CRSHAC Contains the ACs at the time the STOP routine was called.
LPC1 For crashes caused by panic channel interrupts, LPC1 contains
one plus the address of the instruction that caused the crash.
MRPDB PDB for last IPCF message received by MOUNTR
RBUF Last IPCF message received by MOUNTR (particularly useful if
SSSDAT+1 contains MRCVIH, indicating that MOUNTR crashed while
processing an incoming IPCF message).
SSSDAT When MOUNTR crashes, SSSDAT+1 contains the address of the
routine that was invoked by MOUNTR's scheduler. Starting here
and using the stack, you can trace the execution of MOUNTR's
code that led to the crash.
TOPS-20 TROUBLE-SHOOTING HANDBOOK Page 116
DEBUGGING PA1050
DEBUGGING PA1050
In order to debug the compatibility package you must have a copy of
the file called PAT.EXE. PA1050 is just the system name for PAT. If
there is no copy of PAT.EXE, then take the source program called
PAT.MAC, and assemble it. Thereby creating a sharable save file
called PAT.EXE. To debug the compatibility package the following
steps are required.
$RESET
$GET ISAM ;Where ISAM may be any program you choose
$MERGE PAT ;PAT is the source name for PA1050
$DDT
PAT$: MOVBF$b ;You set your breakpoints here
DEBUG$G
$G ;You must type $G twice because of the double
symbol table
NOTE
Some of the error messages you may
receive from PA1050 may not be the true
error message. To have the correct
error message printed out use an ERJMP,
or an ERCAL after the JSYS it fails on.
For more information on ERJMP and ERCAL
refer to the Monitor Calls Reference
Manual.
In order to build the compatibility package the following steps are
required.
$LOAD /CREF PAT.MAC
$START
$SAVE PAT
$GET PAT
$DDT
MAKEPF$G
Output file: PA1050.EXE
$
UDDT
40000,,0$X
^Z
$I MEM
The start after loading causes the program to be moved from its
location to its running location in high core. The symbol table is
also moved, and the pointer adjusted. A sharable save file of pages
700-777 must be made for debugging. This is created when you
MAKEPF$G, then load 40000,,0 in UDDT. When you type I MEM you should
now have PA1050.EXE in 700-730.
TOPS-20 TROUBLE-SHOOTING HANDBOOK Page 117
COPYING FLOPPY DISKS
COPYING FLOPPY DISKS
====================
This is a description of the front end program COP (quick floppy
copy). This program should be used to create backup copies of the
distributed set of floppies.
CAUTIONARY NOTES ABOUT FLOPPY DISKS:
1) Only IBM floppies should be used. Other floppies may
destroy the DX11 drives.
2) Floppies have a finite life while mounted in the
drive. The heads do not float, and the floppies turn
continuously. This causes the magnetic surface to be
eaten away. Minimum floppy life is something like 200
hours.
3) Floppies which are dropped, badly shocked, or used as
frisbees will lose their sector headers, and will be
good for nothing.
4) Never put a floppy which you suspect is bent into the
drive -- it may damage the drive.
5) COP is discussed also in the Front End File System
Specification manual in Volume 14 of the TOPS-20
Software Notebooks, section 3.2.
COP COMMANDS:
The basic COP command string is of the form:
COP> <destination device>/<switch>=<source device>
To enter COP, type a Control-backslash to get to the
Parser, then MCR COP to start up COP. The floppies
should have already been mounted with MCR MOUNT, and
should then be dismounted with MCR DMOUNT after the
copy.
COP SWITCHES:
/HE Help, types a list of switches
/RD Read Device, check for errors
/CP Copy (default action)
/VF Verify copy (default when copy in effect)
/ZE Zero the device
TOPS-20 TROUBLE-SHOOTING HANDBOOK Page 118
COPYING FLOPPY DISKS
COP EXAMPLE:
The following sequence of commands will succeed in
copying the contents of the floppy in DX0: (the left
hand drive) onto the floppy in DX1:, and verifying the
operation.
^\
PAR>MCR MOU
MOU>DX0:
Mount completed
MOU>DX1:
Mount completed
MOU>^Z
^\
PAR>MCR COP
COP>DX1:=DX0:
COP>^Z
^\
PAR>MCR DMO
DMO>DX0:
Dismount Complete
DMO>DX1:
Dismount Complete
DMO>^Z
The copy takes about two minutes, the verify about the same.
Take care to specify the correct source and destination
devices.
CAUTIONARY NOTE--
If you COP for many generations you will build up
ghost bad blocks until RSX will declare the floppy
useless. This is because in each generation the bad
block file of the old floppy is copied onto the new
(which will have its bad blocks in different physical
locations). A way around this is to use PIP for any
non-boot copies once every several generations.
TOPS-20 TROUBLE-SHOOTING HANDBOOK Page 119
THE SWSKIT TOOLS PROGRAMS
THE SWSKIT TOOLS PROGRAMS
=========================
Included on the SWSKIT are a number of utility programs, as summarized
below. These tools have been found to have at least some usefullness
in the past in a debugging environment. Most of these programs
require the user to have WHEEL or OPERATOR privleges to work, but also
most are of the "show and tell but don't touch" category, so they are
in general "safe" to run.
We have cleaned up some of the old ones a bit, added a few new ones,
and checked them all out to the extent that they will all run. There
should even be some documentation, at least a HELP file, with each
program.
While we do not actively "support" these programs, we are quite
willing to accept complaints and suggestions and submissions from the
field.
These are the "standard" tools; the Marlboro Support Group is
generally familiar with their operation and quirks, and in providing
support to the field may request that one or more of the programs be
used at a customer site to diagnose or assist in correcting a problem.
This is generally more effective than random poking about in DDT, or
trying to learn the peculiarities of whatever the customer may have
available.
And now, the current collection:
PROGRAM DESCRIPTION
CHANS This program will produce system
configuration, and status information on
tapes and disks.
DIRPNT This program will list the contents of
the blocks in a disk directory.
DIRTST This program will check the format, and
list any invalid data in directory
files.
DS This program will provide software
diagnostic help concerning the disk file
system.
DSKERR This program will provide a convenient
listing of the hard and soft disk errors
TOPS-20 TROUBLE-SHOOTING HANDBOOK Page 120
THE SWSKIT TOOLS PROGRAMS
that have occurred.
DX20PC This program will trace the microcode PC
in the DX20.
EXEC-FOR-DEBUGGING-GALAXY This EXEC contains commands to
facilitate debugging a private GALAXY
system.
FILADR This program will display the disk
addresses a file is using, or the
addresses which are marked in the BAT
block.
JSTRAP This program will produce information in
a log on any JSYS, including the PC and
arguments used.
MONRD This program will allow you to easily
examine the running monitor.
READ This program performs the same action as
the CHECK FILE command to DS; it
read-checks files for disk errors.
REV This program will allow you to easily
alter, edit, delete, obtain information,
etc. on files.
RSTRSH This program will detect bug induced
changes in the resident monitor in a
dump file.
TYPVF7 This program is useful for typing out
the contents of a VFU file in a readable
form.
UNITS This program will produce status
information on disk drives.