Trailing-Edge
-
PDP-10 Archives
-
BB-H348C-RM_1982
-
swskit-v21/documentation/catastrophe.mem
There are no other files named catastrophe.mem in the archive.
Dealing with Network Problems
If you have a problem with your network there are two
actions that you can take. (1)You can do everything
possible to recover from the error to get the network back
on the air as fast as possible. (2) You can collect all the
information necessary for problem analysis. Often a
combination of these two approaches is a good solution. The
next section lists some actions that you can take in
specific failure situations, along wiht some of the
questions you should be asking. The last section lists the
information you should have available for you software
specialist, or obtain with the help of your software
specialist, before writing an network SPR.
RECOVERY FROM FAILURE
A communications network contains many components which
can fail. Quick recovery may be as easy as restarting a
program, or turning off a line. The key is isolation of the
problem. The following paragraphs outline a number of
situations and what can be done in each to circumvent the
problem and recover quickly.
o Application fails
1. Is the server process available? On a TOPS-20
network, the server must be running before the
source task.
2. If the connecton is to another node, is the DN20 on
line? Check with the @INFORMATION DECNET command.
3. Are there sufficient resources? Often when an
application program fails, it is because the
necessary resources are not available. Often simply
restarting the program will solve the problem.
o Netcon Fails
1. Are you sure? Some NETCON commands can take a long
time to respond if the system is heavily loaded or
the network is full. Also, some commands will never
respond; for example, a command whose executor is a
node that does not exist. Try to get NETCON to do
something, such as a SHOW EXECUTOR command ( this
hsould be fairly fast).
2. At TOPS-20 level, issue a SYSTAT OPERATOR ALL
command. What is NETCON's apparent state? Is it
there at all?
Dealing with Network Problems Page 2
3. If NETCON appears to be running, and there is still
no responde, restart NETCON according to the DECnet
manual. (Use the ESPEAK command.)
Note that reloading NETCON should not affect user
processes, since NETCON is not a direct part of the
network software.
o Front-end crash
1. Again, are you sure? Does TOPS-20 think that the
front-end is there (use the INFO DECNET command)?
2. Issue the "SHOW QUEUE NCP" command. If the
front-end has gone down, NETCON should automatically
reload it; the queue should contain a reload in
process.
3. If the command file used to start the system
automatically has disabled auto-reload, you may have
to reload the front-end manually. You can use the
TAKE command, specifying the command file.
Note that reloading the front-end will desconnect any
user processes that were connected to another node.
o Remote gone
1. The remote system may not really be gone. In
Version 2 of DECnet, if the link is not used (idle)
for some large period of time, the link may appear
to be disconnected. However, it may still be there.
The reason for this is that version 2 of DECnet does
not send periodic "are you there" messsages to its
remotes.
2. A cure for this situation is to go to whichever
system thinks the line is still there, turn the line
off and then on again. This should cure the
problem.
3. If the problem does not go away, the front-end may
require reloading.
o Modem and line problems
1. If the error rate is high (you can determine this
with the SHOW COUNTS command), or if the line fails
completely, it helps to have a backup modem you can
install, or some kind of switching system.
2. Set the line OFF, install the new modem, and set the
Dealing with Network Problems Page 3
line ON again.
Dealing with Network Problems Page 4
.lm 1
COLLECTING INFORMATION
If you choose to call the hotline or prepare an
SPR to solve a problem with a DECnet network, you
should gather the following materials at a minimum.
More information, or the use of a datascope could be
required.
1. A complete description of the configuration.
For each node in the network: its system type, node
name, node number, communication interfaces, line
speeds, modem types, and anything else you can think
of.
2. A complete description of the problem.
Include a description of the apparent problem
itself, the time of day, and anything that happend
in the systems that seemed out of the ordinary,
whether or not in the network. If you have just
upgraded or changed you configuration describe the
changes you made. Include console sheets and
listings of any possible offending programs.
3. The SYSERR file (ERROR.SYS), or SYSERR listing
(DECsystem-2040/50/60 only) and the console sheets.
The DECsystem-2020 does not record SYSERR entries
for DECnet-20 V2.0. For other systems include the
output from the log device.
4. All documents associated with the generation or
definition of the network.
For KL-based DECsystem-20's this includes the NETGEN
saved image file (NETGEN-CONFIG.IMAGE), the log from
the network build (BUILD-ALL.LOG), all STB and MAP
files, and CETAB.MAC (i.e. the entire contents of
the "build" directory). For all TOPS-20 systems
also include the 4-CONFIG.CMD file and any command
files given to the NCP subset of OPR at system
start-up time.
If you also have non-TOPS-20 systems in your
network, include all documents generated during the
NETGEN process.
5. Finally, include any dumps that are available.
A KL or KS system requires the MONITR.EXE file for
symbol definition and the front-end dump requires
all STB and MAP files for symbol definition.
Dealing with Network Problems Page 5
Remember, if your network is a heterogeneous
network, the problem may not be easily solved by the
information from one node alone. Include all
information you can obtain.