Google
 

Trailing-Edge - PDP-10 Archives - BB-H348C-RM_1982 - swskit-v21/documentation/catastrophe.mem
There are no other files named catastrophe.mem in the archive.


               Dealing with Network Problems

     If you have a problem with your network there  are  two
actions  that  you  can  take.   (1)You  can  do  everything
possible to recover from the error to get the  network  back
on the air as fast as possible.  (2) You can collect all the
information  necessary  for  problem  analysis.    Often   a
combination of these two approaches is a good solution.  The
next section  lists  some  actions  that  you  can  take  in
specific   failure   situations,  along  wiht  some  of  the
questions you should be asking.  The last section lists  the
information  you  should  have  available  for  you software
specialist,  or  obtain  with  the  help  of  your  software
specialist, before writing an network SPR.


RECOVERY FROM FAILURE

     A communications network contains many components which
can  fail.   Quick  recovery  may be as easy as restarting a
program, or turning off a line.  The key is isolation of the
problem.   The  following  paragraphs  outline  a  number of
situations and what can be done in each  to  circumvent  the
problem and recover quickly.


  o Application fails

    1.  Is the  server  process  available?   On  a  TOPS-20
        network,  the  server  must  be  running  before the
        source task.

    2.  If the connecton is to another node, is the DN20  on
        line?  Check with the @INFORMATION DECNET command.

    3.  Are  there  sufficient  resources?   Often  when  an
        application   program   fails,  it  is  because  the
        necessary resources are not available.  Often simply
        restarting the program will solve the problem.


  o Netcon Fails

    1.  Are you sure?  Some NETCON commands can take a  long
        time  to  respond if the system is heavily loaded or
        the network is full.  Also, some commands will never
        respond;  for example, a command whose executor is a
        node that does not exist.  Try to get NETCON  to  do
        something,  such  as  a SHOW EXECUTOR command ( this
        hsould be fairly fast).

    2.  At  TOPS-20  level,  issue  a  SYSTAT  OPERATOR  ALL
        command.   What  is  NETCON's apparent state?  Is it
        there at all?
Dealing with Network Problems                         Page 2


    3.  If NETCON appears to be running, and there is  still
        no  responde, restart NETCON according to the DECnet
        manual.  (Use the ESPEAK command.)

    Note  that  reloading  NETCON  should  not  affect  user
    processes,  since  NETCON  is  not  a direct part of the
    network software.


  o Front-end crash

    1.  Again, are you sure?  Does TOPS-20  think  that  the
        front-end is there (use the INFO DECNET command)?

    2.   Issue  the  "SHOW  QUEUE  NCP"  command.   If   the
        front-end has gone down, NETCON should automatically
        reload it;  the queue should  contain  a  reload  in
        process.

    3.  If  the  command  file  used  to  start  the  system
        automatically has disabled auto-reload, you may have
        to reload the front-end manually.  You can  use  the
        TAKE command, specifying the command file.

    Note that reloading the front-end  will  desconnect  any
    user processes that were connected to another node.


  o Remote gone

    1.  The remote  system  may  not  really  be  gone.   In
        Version  2 of DECnet, if the link is not used (idle)
        for some large period of time, the link  may  appear
        to be disconnected.  However, it may still be there.
        The reason for this is that version 2 of DECnet does
        not  send  periodic "are you there" messsages to its
        remotes.

    2.  A cure for this situation  is  to  go  to  whichever
        system thinks the line is still there, turn the line
        off  and  then  on  again.   This  should  cure  the
        problem.

    3.  If the problem does not go away, the  front-end  may
        require reloading.


  o Modem and line problems

    1.  If the error rate is high (you  can  determine  this
        with  the SHOW COUNTS command), or if the line fails
        completely, it helps to have a backup modem you  can
        install, or some kind of switching system.

    2.  Set the line OFF, install the new modem, and set the
Dealing with Network Problems                         Page 3


        line ON again.
Dealing with Network Problems                         Page 4


                           .lm 1
        COLLECTING INFORMATION

             If you choose to call the hotline or prepare an
        SPR  to  solve  a problem with a DECnet network, you
        should gather the following materials at a  minimum.
        More information, or the use of a datascope could be
        required.


    1.  A complete description of the configuration.

        For each node in the network:  its system type, node
        name,  node  number,  communication interfaces, line
        speeds, modem types, and anything else you can think
        of.

    2.  A complete description of the problem.

        Include  a  description  of  the  apparent   problem
        itself,  the  time of day, and anything that happend
        in the systems that  seemed  out  of  the  ordinary,
        whether  or  not  in  the network.  If you have just
        upgraded or changed you configuration  describe  the
        changes   you  made.   Include  console  sheets  and
        listings of any possible offending programs.

    3.  The  SYSERR  file  (ERROR.SYS),  or  SYSERR  listing
        (DECsystem-2040/50/60  only) and the console sheets.
        The DECsystem-2020 does not  record  SYSERR  entries
        for  DECnet-20  V2.0.  For other systems include the
        output from the log device.

    4.  All documents  associated  with  the  generation  or
        definition of the network.

        For KL-based DECsystem-20's this includes the NETGEN
        saved image file (NETGEN-CONFIG.IMAGE), the log from
        the network build (BUILD-ALL.LOG), all STB  and  MAP
        files,  and  CETAB.MAC (i.e.  the entire contents of
        the "build" directory).   For  all  TOPS-20  systems
        also  include  the 4-CONFIG.CMD file and any command
        files given to the  NCP  subset  of  OPR  at  system
        start-up time.

        If  you  also  have  non-TOPS-20  systems  in   your
        network,  include all documents generated during the
        NETGEN process.

    5.  Finally, include any dumps that are available.

        A KL or KS system requires the MONITR.EXE  file  for
        symbol  definition  and  the front-end dump requires
        all STB and MAP files for symbol definition.
Dealing with Network Problems                         Page 5


             Remember, if your network  is  a  heterogeneous
        network, the problem may not be easily solved by the
        information  from  one  node  alone.   Include   all
        information you can obtain.