Trailing-Edge
-
PDP-10 Archives
-
bb-h137f-bm
-
new-subsys/spear.msg
There are 7 other files named spear.msg in the archive. Click here to see a list.
@@ran_quest_help
Part of Instruct consists of a series of sixty questions. The questions
pertain to the system event files (ERROR.SYS and ERRLOG.SYS), the Spear
Library dialogs, and the Spear Library reports.
The Random Question feature is primarily a Course Administrator's tool.
It allows the Course Administrator to randomly select a few questions
that will help determine a student's progress. If the student is able
to answer 8 out of 10 random questions correctly, then chances are he
(or she) understands how to use the Spear Library. If not, then perhaps
a little more study time is needed.
Students can also use the Random Question feature as a self evaluation
tool. To do so, enter a random number in the range of 1 to 50. Instruct
will dispatch to a corresponding random question. Answer the question to
the best of your knowledge. Instruct will evaluate your answer and print
an approprate message.
At that point you can type: RANDOM and select another random question.
@@quest_help
You are participating in a teaching dialog informally referred to as the
"Rhetorical Approach to Learning". The approach involves a statement
about a subject, in this case the Spear Library. You are to determine
whether the statement is True or False.
If your answer is correct you will receive a short message and then go
on to the next statement. If your answer is incorrect, then the correct
answer will be explained and the statement will be repeated. If you are
not sure whether the statement is True or False, you can press the
RETURN key and the correct answer will be explained.
In addition to the True, False, and RETURN key response, you can type
NEXT if you want to skip to the next statement. You can also press the
BACKSPACE key if you want to return to the menu.
@@ans_help
You have just answered a question either correctly or incorrectly. You
now have three choices. You can:
1. Press the RETURN key. If you answered the question correctly you
will continue on to the next sequential question. If, however, you
answered the question incorrectly, then the question will be
repeated.
2. Type NEXT to continue on to the next sequential question regardless
of whether you answered the last question correctly or not.
3. Press the BACKSPACE key to repeat the last question regardless of
whether your answer was correct or not.
Your response please:
@@no_help
There really isn't anyway that we can help you at this point. Press the
BACKSPACE key and try reading the text again. If it still doesn't make
sense, then contact:
The Spear Team
MRO1-1 / M2
Sorry
@@text_help
Instruct is frame oriented. That is, it displays one frame or block of
information at a time. After you have read the frame you can:
1. Press the RETURN key to proceed to the next frame of information.
2. Press the BACKSPACE key to review the previous frame of information.
3. Type MENU if you want to go back to the subject menu.
@@menu_help
Instruct is organized around a hierarchy of subject menus. The menus
allow you to use Instruct as a reference tool. The top item on the
menu (item 0) introduces the subjects and explains there relationship.
The remaining items are subjects. You can select any item on the menu
by typing the number that corresponds to the item. You can also press
the RETURN key to automatically proceed to the first subject on the
menu. If you want to go back to the previous menu in the hierarchy you
can type MENU.
@@fwd_trans_help
Instruct is organized around a hierarchy of subject menus. You can use
the RETURN key feature to sequence through the subjects listed on the
menu. Each time you move from one subject to a other you will be notified.
At this point you can choose to go on by pressing the RETURN key, or you
can choose to go back to the menu and select a different subject by typing
MENU.
@@rev_trans_help
Instruct is designed in such a way that you can go forward and backward
through the subject matter. Each time you move from one subject to
another you will be notified. In this case you were notified that you
were about to back into to previous subject on the menu. At this point
you can:
1. Type MENU to go back to the subject menu.
2. Press the RETURN key to go back to where you came from.
3. Press the BACKSPACE key, or type/REVERSE to continue backing up.
However, if the subject that you are backing into required multiple
frames of text to explain, then you will back into the last frame.
4. Type BEGIN to backup to the first frame of the subject that you
are backing into.
@@ran_quest_res_error_msg
This is the Random Question response error message. The number that you
entered is not within the range of 1 to 50.
@@text_res_error_msg
This is the text response error message. Instruct displays one page of
text at a time. After you have read the text you can:
1. Press the RETURN key to go on to the next page.
2. Press the BACKSPACE key or type /R to go back to the previous page.
3. Type MENU to go back to the menu and select another subject.
4. Type /B to return to the Spear prompt. If you are using a student
ID, and if you specify that ID at the Instruct prompt, you will
return to the page that you were at when you typed /B.
5. Type anything else and you will get this message.
@@menu_res_error_msg
This is the menu response error message. Instruct uses a hierarchy of
menus. The menus allow you to use Instruct as a quick reference tool.
At a menu you can:
1. Type the number on the menu that corresponds to the subject that you
are interested in.
2. Type MENU to go back to the previous (higher level) menu.
3. Type /B to return to the Spear prompt. If you are using a student
ID, and if you specify that ID at the Instruct prompt, you will
return to the page that you were at when you typed /B.
4. Press the BACKSPACE key or type /R. You will get a message stating
that you are about to back into the Introduction to the menu.
5. Type anything else and you will get this message.
@@fwd_trans_res_error_msg
This is the forward response error message. You can sequence through
Instruct by pressing the RETURN key. If you do so, you will sequence
through an Introduction, followed by a menu, followed by the first
subject, followed by the second subject, etc. You will be notified each
time you move from one subject to another. At that point you can:
1. Press the RETURN key to continue sequencing through Instruct.
2. Press the BACKSPACE key or type /R to repeat the last page of text.
3. Type MENU to go back to the menu and select another subject.
4. Type /B to return to the Spear prompt. If you are using a student
ID, and if you specify that ID at the Instruct prompt, you will
return to the point that you were at when you typed /B.
5. Type anything else and you will get this message.
@@rev_trans_res_error_msg
This is the reverse-transition prompt/response error message.You are
sequencing through Instruct in a reverse direction. You were notified
that you are about to move in a reverse direction from one
subject to another. You can:
1. Press the RETURN key to begin sequencing in a forward direction.
2. Press the BACKSPACE key to continue going in a reverse direction.
3. Type BEGIN to go to the beginning of the subject.
4. Type MENU to go back to the menu and select another subject.
5. Type /B to return to the Spear prompt.
6. Type anything else and you will get this message.
@@ans_res_error_msg
This is a response error message. Your response does not match the list
of acceptable responses. For further information press the RETURN key,
then type: ? or HELP.
@@farewell
Instruct bids you farewell.
Type /Break to return to Spear.
@@course_admin
Spear Course Administrator and Student Guide
Course Description
The Instruct course consists of four main modules:
1. Fault Isolation Techniques - This module describes the nature of
intermittent faults and discusses some of the most common methods
used to isolate intermittent system and subsystem failures.
2. System Event File Organization and Content - This module describes
the overall organization and content of TOPS-10, TOPS-20, and
VAX/VMS system event files.
3. Spear Library Functions - This module explains how to use each
of the Spear maintenance functions: Retrieve, Summarize,
and Compute.
4. Guaranteed Uptime Program/NOTIFY - This module describes the GUP
service which ensures the highest level of reliability for your
system. This module also explains how to use NOTIFY to calculate
statistics and to log information related to system uptime.
@@course_admin_a
Each module consists of an introduction and a menu of subordinate
subjects. When appropriate, the subordinate subjects are further
broken down into introductions and menus. Thus, Instruct can be
used as both a tutorial and a reference tool.
If you want to use Instruct as a tutorial (i.e., sequence through
the course much as you would read a book) you can do so using the
RETURN key. You will proceed to the module introduction, then the
menu, then the first subject on the menu, followed by the next
subject, etc.
If you want to use Instruct as a reference tool, then instead of
pressing the RETURN key at the menu, select the subject number that
interests you. You will proceed directly to that subject. If, after
investigating the subject you want to return to the menu, type MENU.
@@course_map
Course Map
______________________________________
| Guaranteed Uptime Program/NOTIFY |
______________________________________
^ |--Applications
| |--Summarize
_________________________________ |--Compute
| Using the Spear Library |--------|--Retrieve
_________________________________ |--Klerr
^
|
_________________________________
| System Event Files |
_________________________________
^
|
______________________________________ ___________________________
| Course Administrator/Student Guide |-->| Troubleshooting |
______________________________________ ___________________________
@@course_map_a
The course map suggests a sequence to follow to learn about
Spear. This sequence reflects the following factors:
Spear processes the system event file and generates a number
of reports which are useful in supporting the system.
Spear allows the user to produce the following reports:
Summary of the system faults be device and time.
System reliability and uptime reports.
Dump of event log entries in multiple formats.
Spear also allows the user to maintain the event file, and
includes its own instruction package for its use.
@@feedback
Feedback is an important part of any system design. Technically,
feedback is defined as a representive sample of the output used
to control or correct the process.
The process, in this case, is The Spear Library. The output is the
ability of the Spear Library to help you evaluate system performance
and solve service related system problems. If you have any ideas or
suggestions for improving the usefulness of The Spear Library, please
contact:
Digital Equipment Corporation
The SPEAR Team MRO1-1 / M2
200 Forest Street
Marlboro, Mass. 01752
Thank you;
The Spear Team
@@random_question
The Random Question feature allows you to enter a random number in the
range of 1 to 50. Instruct will respond by presenting you with a random
question based on the course content.
This feature can be used by anyone who has a few minutes, and who would
like to pickup a few tidbits about the use of The Spear Library. The
feature can also be used by The Course Administrator as a tool to spot
check student progress.
After being informed that you have correctly answered a question, you
may select another random question by typing "RANDOM".
Type <return> if you wish to enter the random question mode.
@@spear_man
Using The Spear Manual
You can use The Spear Manual as a learning aid, a user's guide, or a
reference tool.
As a Learning Aid: Chapters 1, 2, and 3 provide an overview of the
Spear Library. They also provide background information required to
understand and use the Spear library.
As a User's Guide: Chapters 4 and 5 provide step-by-step procedures for
using the Spear functions; Retrieve, Summarize, and Compute.
The chapters explain, in detail, the command syntax and the response
parameters associated with each function.
As a Reference Tool: Chapter 6 and the appendices provide reference
material such as system event file formats, event record descriptions,
and examples of the report formats. This chapter and the appendices are
for reference only. They are not meant to be read from beginning to end.
@@R.T.cou_ovr_a
STOP - You are moving in a reverse direction through the menu. You are
about to back into the Course Administrator/Student Guide.
@@1.M.
Troubleshooting
Topic menu:
1. Attitude vs. Approach
2. The Formal Approach
3. The Systematic Approach
4. The Variable Approach
@@1.1.
Attitude vs. Approach
First and foremost; your success as a problem solver depends more on
your attitude than it does on your approach. Quite simply, if you
believe that you can (solve a particular problem), then you probably
will; if you believe that you can't, then you probably won't.
The only thing that a problem has going for it is your attitude.
Therefore, with the right attitude, you can solve almost any problem.
It's just a matter of time. Never give up and you'll never lose.
@@1.1.A.
Approach
The way you approach the solution to a problem will also, to a large
extent, determine your success as a problem solver. The more logical
and systematic your approach, the more successful you're likely to be.
Next on the menu are a couple of systematic problem solving approaches
that I think you will find to be both interesting and quite effective.
@@R.T.1.1.A.
STOP - You are moving in a reverse direction through the menu. You are
about to back into the Attitude vs. Approach section of the course.
Your response please:
@@1.2.
The Formal Approach
The Formal Approach consists of seven steps:
1. RESEARCH and DEFINE the problem (what is, or is not, happening)
2. VENTURE a testable educated guess (as to the cause of the problem)
3. SETUP a practical experiment (to test the educated guess)
4. PREDICT the result (before you conduct the experiment)
5. CONDUCT the experiment (keep an accurate set of notes)
6. EVALUATE the result (compare the actual and predicted results)
7. REFINE the definition and REPEAT the process (begining with step 2)
@@1.2.A.
Step 1 - RESEARCH and DEFINE the problem - If you're not familiar with
the system, begin your research at the Branch office. Look over the
records for the last couple of weeks. Try to get an idea of the size
and the application of the system. Also, find out when the system was
last serviced, by whom, and why.
When you first arrive on site take five or ten minutes to talk with the
customer, the operator, or anyone else that may be able to explain the
problem. Here's a partial list of the type of questions that you should
ask:
How serious is the problem ?
How long has it been going on?
Has the system ever had a problem like this before?
How has the system been performing lately?
Have there been any recent hardware or software changes lately?
@@1.2.B.
You can define the problem at the same time that you are doing the
research. Ask yourself three questions:
1. What is happening that shouldn't?
2. What is not happening that should?
3. What are the surrounding conditions?
The first two questions will help you identify the main error symptom.
The third question will help you identify the context or circumstances
that surround the symptom. That's important, because it's practically
impossible to solve a problem out of context.
@@1.2.C.
Once again, the questions to ask yourself when defining a problem are:
What is happening that shouldn't?
What is not happening that should?
What are the surrounding conditions?
The definition should be as complete as possible. It should also state,
in clear and concise terms, the major symptom and the conditions or
circumstances that surround that symptom. One more thing, and this is
important, you should write the definition down, at least in note form.
For example:
Def - 4 days/2020/256K/cache/TOPS-20(4.1)/UBANXM/freq:12-14 hrs.
@@1.2.D.
Or more formally:
During the last four days, the system, a 2020 with 256K and cache,
running TOPS-20 (4.1) has crashed about every 12 or 14 hours with
a UBANXM Bug Halt.
Note that the definition states only one main error symptom, UBANXM Bug
Halt. The rest of the information describes the conditions that surround
the error symptom (i.e., the context of the problem).
@@1.2.E.
Sometimes, however, a system will exhibit multiple error symptoms. In
such a case, each error symptom (including the surrounding conditions)
should be stated separately. This is important because, when you first
start working, you have no way of knowing, for sure, whether or not the
system actually has multiple problems.
Therefore, assume the worst case. If a system exhibits multiple error
symptoms treat each symptom separately. That way you will eliminate the
possibility of multiple errors compounding the problem solving process.
Also, if you separate multiple error symptoms, then you can investigate
the most obvious symptom first, which is sound troubleshooting practice.
@@1.2.F.
Review - The key points discussed so far are:
1. Talk to anyone who may know something about the problem.
2. DEFINE the problem. Find out exactly:
What is happening that shouldn't?
What is not happening that should?
What are the surrounding conditions?
3. Remember to get all the conditions and circumstances. It's next to
impossible to solve a problem out of context.
4. Write down the definition, at least in note form. Be clear, concise,
and as complete as possible.
4. Treat each error symptom as if it were a separate problem.
5. Attempt to solve the most obvious problem first.
@@1.2.G.
Step 2 - VENTURE a testable educated guess (TEG) as to the cause of the
problem. The truth of the matter is, when you first start out to solve
a problem, you can't know (for sure) what the cause is. Therefore, you
really don't have much of a choice; you have to begin with a guess.
Fortunately, if the guess is testable, it does not have to be accurate.
In fact, your first few guesses probably won't be accurate. But, if you
use this approach and your guesses are testable, then they will quickly
become accurate. In other words, they will either:
a) lead you directly to the cause of the problem, or
b) they will lead you to the realization that you could use some help.
Either way, you win.
@@1.2.H.
Here's a couple of testable educated guesses (TEGs) to go along with the
problem that was identified and defined earlier:
Def - 4 days/2020/256K/cache/TOPS-20(4.1)/UBANXM/freq:12-14 hrs.
TEG #1. A low voltage condition exists at one of the UBAs.
TEG #2. One of the Unibus cables is improperly seated.
@@1.2.I.
REMEMBER
TEGs don't have to be earth shattering.
But they do have to be testable.
@@1.2.J.
Step 3 - SETUP an experiment that will prove, or disprove, your TEG.
The experiment should be carefully thought out. You should make every
effort to ensure that it is a true, and accurate test of your guess.
Take your time. Make sure that your experiment is not inadvertently
testing something other than your TEG.
Here's why. If your experiment turns out to test something other than
your TEG, and you don't realize it, then you are liable to misinterpret
the result. Consequently, you may find yourself tripping down the Old
Garden Path.
@@1.2.K.
The Old Garden path, by the way, is a expression that refers to a
troubleshooting tangent, a lesson in pure frustration. The path or
tangent leads you away from the real cause of the problem, contributes
very little useful information, and consumes lots of valuable time
and effort.
So, give yourself a break. Don't take a chance on a trip down the garden
path. Instead, use the time to carefully think out your experiment.
@@1.2.L.
The experiment doesn't have to be complex or elaborate. Let's go back to
the problem definition and TEGs that we used earlier, and see if we can
devise a couple of simple experiments that will prove, or disprove, the
TEGs.
Def - 4 days/2020/256K/cache/TOPS-20(4.1)/UBANXM/freq:12-14 hrs.
TEG #1. A low voltage condition exists at one of the UBAs.
Exp #1. Use up a DVM to test the voltage at each UBA.
TEG #2. One of the Unibus cables is improperly seated.
Exp #2. Clean and reseat each cable in the Unibus.
@@1.2.M.
Review - The key points discussed so far are:
1. Research and Define the problem (in writing). Find out exactly:
a) What is happening that shouldn't?
b) What is not happening that should?
c) What are the surrounding conditions?
2. Treat each error symptom as if it was a separate problem. Then,
select the most obvious problem and work on it.
3. Venture a testable educated guess (TEG) as to what might be causing
the problem.
4. Setup an experiment that will either prove, or disprove your guess.
Take your time. Make sure the experiment is a valid test. If it's
not, you may waste a lot of time chasing a tangent.
If you've been following this course, right around now you should be
getting some idea of how effective a problem solving approach such as
this can be. Essentially, it is a systematic process of elimination.
Properly used it will isolate and ultimately eliminate virtually any
problem a system can develop. It's just a matter of time.
@@1.2.N.
Step 4 - PREDICT the result of the experiment before you conduct it.
The purpose of this step is to double check the validity of your
experiment. The prediction should be based on the assumption that:
1. Your TEG or guess is absolutely correct.
2. Your experiment is a true and valid test of your TEG.
Both of these assumptions will be verified later in Step 6.
@@1.2.O.
As trivial as this step may seem, it should never be skipped. Nor should
you ever leave it up to "maybe" type thinking:
Maybe...this will happen (or) Maybe...that will happen
When it comes to your experiment and the predicted result, "maybe" type
thinking leads to: "Gee.. that's interesting; wonder what it means" type
curiosity. And that my friend, will lead you right down Old Garden Path.
Therefore, if you decide to use this problem solving approach, keep in
mind that your prediction should be explicitly stated and well thought
out. Don't get tricked into going off on a wild turkey chase.
@@1.2.P.
Getting back to our example, let's add a couple of predictions:
Def - 4 days/2020/256K/cache/TOPS-20(4.1)/UBANXM/freq:12-14 hrs.
TEG #1. A low voltage condition exists at one of the UBAs.
Exp #1. Use up a DVM to test the voltage at each UBA.
Pre #1. The voltage at one of the UBAs will be out of tolerance.
TEG #2. One of the Unibus cables is improperly seated.
Exp #2. Clean and reseat each cable in the Unibus.
Pre #2. One of the cables will be loose or dirty.
@@1.2.Q.
Well, that's it for the hard part. The last three steps are relatively
simple and straight-forward. But before we go on, let's quickly review
the main points:
1. Research and Define the problem. Find out:
a) What is happening that shouldn't?
b) What is not happening that should?
c) What are the surrounding conditions?
2. Treat each error symptom as if it were a separate problem.
3. Venture a testable educated guess (TEG) as to the cause of the
problem.
4. Setup an experiment that will either prove, or disprove your TEG.
5. Predict the result of the experiment in advance. Assume that your
TEG is correct and your experiment is valid. Be explicit. State (in
writing) exactly what you expect to happen.
6. Avoid "maybe" type thinking. It's liable to get you into trouble.
@@1.2.R.
Step 5 - CONDUCT the experiment - This is the most exciting step in the
formal problem solving process. Here's all you have to do. Either:
1. Check the voltage at each UBA.
2. Clean and reseat each UBA cable.
Unfortunately, this is where a lot of people fall down. They're
overwhelmed by the task. So they tend to put it off. After all checking
the voltage at each UBA, or cleaning and reseating each UBA cable is
not a five minute job.
@@1.2.S.
But, if you properly set-up the experiment, then half the job is done.
Now if it's going to take a while to conduct the experiment, set a time
limit. Don't rush, but try to estimate how long it will take. You might
be surprised to find that, once you are set up, it only takes a few minutes
to check the voltage at each UBA. So if you've five UBAs to check, you
could easly be done in ten minutes. That's not so bad.
@@1.2.T.
When it comes to cleaning and reseating cables, however, that can be a
large undertaking. At two or three minutes per connector, that could
require forty five minutes or an hour to complete. At this point you
might want to revise your experiment: you might decide to clean and
reseat a third of the UBA cables, and see if that corrects the problem.
There is a trade-off involved here. You must consider; the seriousness
of the problem, the frequency of recurrence, and the amount of time
and effort necessary to prove (or disprove) your TEG. The decision is
subjective, and entirely up to you. The rule of thumb here is: Do what
you think is right.
@@1.2.U.
Step 6 - EVALUATE the result - After conducting the experiment compare
predicted result, with the actual result. If they match, then you have
accomplished one of two things.
1. You have either identified the cause of the problem, or
2. You have gathered some new, fairly reliable, information that
you can use to refine the problem definition.
If the predicted result and the actual result do not match, however,
then there is a conflict. Either the experiment tested something other
than your TEG, or your understanding of the experiment (the prediction)
was incorrect. In either case you should STOP IMMEDIATELY.
@@1.2.V.
You must figure out which was in error; the experiment, or the predicted
result. If, after some thought, you decide that the predicted result was
in error, that's ok. It means that the experiment was, in fact, a valid
test of your TEG. And, therefore, the result can be used with confidence
to refine the definition of the problem.
If, however, you discover that the experiment was in error; that is, the
experiment was not a valid test of your TEG, then be very careful. You
should reconsider the entire situation and either; revise the experiment
in such a way that it is a valid test of your hypothesis, or scrap the
whole thing and start over again.
@@1.2.W.
Now you can see the importance of predicting the result of an experiment
before you conduct it. If you are unable to determine whether or not the
experiment was, in fact, a valid test of your TEG then, you're liable to
"assume" that it was. And that kind of an assumption may lead you right
down the Old Garden Path.
The point here is: if you know what you expect to happen, then you are
much more likely to recognize cases where the experiment is not testing
what you think it is.
@@1.2.X.
CHANGES - If an experiment requires that you change the system in any
way (swap a cable, perform an adjustment, exchange a module, etc.) be
sure that you can restore the system to its original state should you
need to. One fool proof way of doing that is to keep notes. Notes don't
forget.
Surely, in the past at least, some very successful technicians didn't
keep any notes at all. But that doesn't mean that they shouldn't have;
it only means that they didn't. And that's too bad, because that means
that they were using part of their brain muscle to recall facts, there-
fore, less of their brain muscle was available to think about solving
the problem. Besides, once you get used to it, thinking is much more fun
than recalling facts. Don't you think?
@@1.2.Y.
Either you disagree, or you're not thinking.
@@1.2.Z.
So do I.
@@1.2.A1.
Back to CHANGES. If you change the system and the change doesn't correct
the problem, then you should restore the system to its original state
as soon as possible. If you don't, then you should realize that you are
running the risk of introducing new problems into the system and thus,
compounding the situation.
@@1.2.B1.
NEW SYMPTOMS - Finally, if you restore the system to its original state
and find that the symptoms have changed, STOP. Don't go on until you are
satisfied that you know the reason WHY the symptoms changed.
Remember
Symptoms Change For A Reason
@@1.2.C1.
Step 7 - REFINE the definition and REPEAT the process beginning with
Step 2 - Venture a TEG. This is the last step. Append the TEG, the
experiment, and the result of the experiment to the problem definition.
Even if the experiment disproved the TEG, append the information to
the definition. At least you know one thing that is not causing the
problem.
Then, once again, ask yourself:
1. What is happening that shouldn't?
2. What is not happening that should?
3. What are the surrounding conditions?
@@1.2.D1.
Take your time appending the new information to the problem definition.
Follow the same guide lines that you followed when you first constructed
the definition; be clear, be concise, and be as accurate and complete as
possible. That's one of the keys to using this problem solving approach
successfully.
Finally, close the loop. In other words, venture a new TEG, setup a new
experiment to test the TEG, predict the result, conduct the experiment,
evaluate the result, refine the definition, and continue to close the
loop. Eventually, if you use this approach, one of two things will
happen. Either:
1. you will identify and ultimately eliminate the problem, or
2. you will flat run out of TEGs, time, or both and end up calling for
support. But even a call for support is a TEG of fashions, because
you won't know until the end, whether or not you needed support from
the beginning.
@@1.2.E1.
One last word before we go on to the final summary. Earlier, we talked
about Attitude vs. Approach. During the discussion the statement was
made "never give up and you will never lose". The statement does not
mean never call support.
Support is a tool. It's there to help you do your job more efficiently.
Don't be afraid to use it. But, please be prepared to describe the exact
problem, what you've done, why, and what the results were. It will save
a lot of time, and you will get much better service.
@@1.2.F1.
Never give up really means, never let a problem go without finding out
what the cause was, and how the cause was finally isolated. Even if you
have to leave a problem (i.e., let someone else take over) always follow
up. Get back to the individual that solved the problem and find out what
the cause was and how he or she arrived at that conclusion.
That way, in your mind, no problem will go unsolved. And that's where
the solution takes place, in the mind. So never give up, never let a
problem go unsolved, and you'll never lose. It's as simple as that.
@@1.2.G1.
Final Summary:
1. RESEARCH and DEFINE the problem. Find out exactly:
a) What is happening that shouldn't?
b) What is not happening that should?
c) What are the surrounding conditions?
2. VENTURE a testable educated guess (TEG) as to the cause.
3. SETUP a practical experiment that will prove, or disprove, your TEG.
4. PREDICT the result before conducting the experiment. Know what you
expect to happen. Don't leave it up to "maybe" type thinking.
5. CONDUCT the experiment (keep an accurate set of notes) If you change
the system, restore it to its original state before you go on.
6. EVALUATE the result (predicted vs. actual). If the symptoms changed,
they changed for a reason. Find out why before you go on.
7. REFINE the definition and REPEAT the process (begining with step 2)
Tighten the loop. It's simply a matter of time and TEGs.
@@F.T.1.3.
That concludes the explanation of the Formal Troubleshooting Approach.
Next on the menu is the Systematic Approach.
@@R.T.1.2.G1
STOP - You are moving in a reverse direction through the menu. You are
about to back into the Formal Troubleshooting Approach.
Your response please:
@@1.3.
Systematic Substitution
Some old school hard-line purist technicians may not agree, but under
certain circumstances systematic substitution (of spare parts) is a
perfectly valid troubleshooting approach.
For example, let's assume that you are at home, working
in your cellar. Furthermore, let's assume that you are using a circular
saw to cut a 2 x 8 piece of oak planking.
@@1.3.A.
Suddenly, the saw binds, the lights dim and then they go out. From the
symptoms (the lights are out) and the conditions (at the time of failure
the saw was operating under a heavy load) you might logically conclude
that a fuse had blown.
Now, let's assume that you light a match and find your way to the fuse
box. Upon opening the box you discover a package of spare fuses and a
wiring diagram of the house. You decide to light another match. This
time you discover that the fuse box contains six 15amp fuses - two rows
of three fuses. At the same time, however, you also realize that the
match does not provide enough light for you to determine which fuse is
blown.
@@1.3.B.
At this point you have, roughly, six options:
1. You can stall for time hoping that the problem will disappear. This,
however, is not a very practical solution because: problems don't
just happen, they are caused; and, although some problems may go
away temporally, very rarely do they just disappear. Therefore, the
best approach is to identify and eliminate the cause of the problem.
So much for the wishful thinking approach.
@@1.3.C.
2. You can call an electrician, but that could be very expensive.
@@1.3.D.
3. You can go get a flashlight (if you can find one that works) and use
it to identify the blown fuse.
@@1.3.E.
4. You can light a couple more matches, study the wiring diagram and
attempt to figure out which fuse is blown. But let's say that your
ability to read an electrical wiring diagram is a bit rusty. So this
could be quite time consuming, and the results are not certain.
@@1.3.F.
5. You can use the spares and randomly substitute fuses until the
lights come on (i.e., the symptoms go away). This is a very risky
approach, however, because you could lose track of which fuses you
did and did not substitute. Thus, you could accidently overlook the
blow fuse and conclude that something else was causing the problem.
@@1.3.G.
6. You can use the spares and systematically substitute each fuse until
the lights come on. You might choose to begin with the upper left
most fuse and substitute left-to-right top-to-bottom. If, in fact,
the problem is being caused by blown fuse, then sooner or later the
lights will come back on.
@@1.3.H.
Now let's say that, after careful consideration of all six options, you
reject wishful thinking and random substitution because they are both
risky and impractical. Next, you dismiss the possibility of calling an
electrician because it seems unnecessary and it could be very expensive.
Finally, you eliminate using the wiring diagram to figure out exactly
which fuse is blown. The idea is feasible and even tempting, but under
the circumstances it's just too time consuming. Remember, you want to
get the saw back on line so you can finish cutting that piece of wood.
@@1.3.I.
That leaves you with two options; either go get a working flashlight,
or try the systematic substitution approach. If you opt for the flash-
light, that's a trip upstairs, a few minutes locate a flashlight, a trip
back down cellar and 30 seconds to replace the blown fuse. Total time
expended approximately five minutes. (That does not include the time
required to return the flashlight so that you can find it the next time
you need it.)
But suppose that instead of opting for the flashlight, you opted for the
systematic substitution approach and, on the fourth try you locate the
blown fuse. Total time expended (at 30 seconds per fuse): 2 minutes with
no trips involved. Not bad.
@@1.3.J.
One last supposition; suppose that instead of working in your cellar,
you had just arrived on site. Instead of a dead power line you're faced
with a failing subsystem. Instead of six fuses in a box, the subsystem
consists of four modules, a cable and a power supply. Finally, instead
of a box of spare fuses and a wiring diagram you have a spares kit, a
scope, and a set of prints. But the diagnostics that you need are not
on site. The same six options apply:
1. You can stall for time wishing the problem will go away.
2. You can call for support.
3. You can go back to the office and get the diagnostics that you need.
4. You can study the print set and try to figure out what's wrong.
5. You can randomly substitute the spares and hope to solve the problem.
6. You can systematically substitute the spares and quickly identify the
cause. There are, however, some things that you should be aware of:
@@1.3.K.
A. You must approach the substitution process systematically. If you
don't, you'll become confused and end up resorting to the random
method of substitution. The random method is so prone to error that
it's just not worth it.
B. If there are more than a few modules involved, keep notes. You may
not always need them, but when you do you'll find that they're worth
their weight in gold.
C. If you substitute a module and the problem doesn't go away, replace
the original module immediately. If you don't, you'll run the risk
of introducing new problems into the system. Spares tend to have a
higher failure rate than modules that have been in use for awhile.
D. If you substitute a module and the symptoms change, STOP. Replace the
original module. If the original symptoms return, then chances are
you have come upon a bad spare. Try it one more time. If the results
are the same; Tag the spare right away. If you don't, you're likely
to forget and reliable spares are a must.
@@1.3.L.
E. If you substitute a module and the symptoms change, and they remain
changed even after you replace the original, STOP. Chances are you
inadvertently changed something and didn't realize it. Retrace every
step. Symptoms change for a reason. Find the reason. Don't run the
risk of compounding the problem.
F. If you substitute a module and it seems to solve the problem don't
stop. Confirm the fix. Return the original module. The symptoms
should appear. If they don't, then you can't be sure that you found
the problem. If they do, then you can be pretty sure that you got it.
But, don't stop yet. Run the diagnostics one more time. Make sure
that no new problems have crept into the system. Finally, hang around
a few minutes, make sure that the equipment comes back on line ok.
G. Back to the case where the spare seemed to correct the problem, but
when you replaced the original module to confirm the fix, everything
seemed to work fine. In this case you may, or may not have identified
the cause of the problem. You don't know. So, leave the spare in the
system, tag the suspect module as potentially intermittent, and save
your notes. Such situations call for a different type of confirmation
technique.
@@1.3.M.
The technique is called the subjective time window. To use it, you must
establish a period of time during which you will monitor the problem.
Usually a week is adequate, if the problem was solid. If the problem was
intermittent, however, then you must determine the rate, or frequency of
the failure, triple it (at least), and use that as the period of time
during which you will monitor the problem.
If the problem does not recur during the time window that you set up,
then you can assume that you solved it. Tag the suspect module as
intermittent, return it for repair, file your notes, and close out
the paper work. If, however, the problem does recur, then you're all
set, replace the original module, update your notes, and pick up
where you left off. That's all there is to it. As some of the old
school hard-liners would say; Hey, at least you know what it's not,
and that's worth something.
@@F.T.1.4.
That concludes the explanation of the Systematic Substitution Approach.
Next on the menu is the Variable Approach.
@@R.T.1.3.M.
STOP - You are moving in a reverse direction through the menu. You are
about to back into the Systematic Substitution Approach.
Your response please:
@@1.4.
The Variable Approach
This short story was once told by a senior field service engineer to
illustrate a VERY important point about using the variable approach
to isolate the cause of an intermittent failure. The story is about a
telephone conversation he had while working for another company (not
to be mentioned).
@@1.4.A.
"At first, the diagnostic only failed every hour or two. So I performed
all the standard checks and adjustments. The problem got a little worse,
but it still wasn't solid. So then, I decided to vary the voltage and
clock margins awhile. That helped some. I pulled out a marginal module
and the symptoms changed so I knew I was getting closer.
Then I thought, maybe the problem had something to do with temperature,
so I blocked the fans for a few minutes. I just wanted to see if varying
the temperature would have any effect. Finally, I tapped around with the
back of my screw driver awhile. That really helped. I found a couple of
vibrational modules. But now I seem to have a new problem - I can't even
load the diagnostics. What do you think is wrong?"
@@1.4.B.
At that point, the senior engineer would bellow; "Now that's what I call
a dumb question - Obviously the guy beat the poor thing to death."
The story served its purpose. Clearly, it illustrates the problem with
indiscriminately using the variable approach. That is, if you're not
really careful, you're likely to cause more problems than you solve.
@@1.4.C.
The reason is: systems frequently operate in a controlled environment
for long periods of time. As a result the environmental operating range
of the system narrows. Normally, this is not a problem. As long as the
environment remains relatively stable the system will run indefinitely.
Keep in mind that if an intermittent problem is, in fact, being caused
by an environmentally sensitive component then, just a slight variation
in voltage, temperature or clock speed should be enough to aggravate it.
The rule of thumb is; BE CAREFUL.
@@1.4.D.
After all, if you had an intermittent problem, would you want a doctor
to double your heart rate in an effort to determine whether or not the
problem had something to do with your circulatory system? Probably not,
because a lot of working parts could get damaged in the process.
Keep that in mind next time you use the variable approach to isolate an
elusive intermittent system problem. It will, in the long run, save you
a lot of unnecessary grief and irritation. And don't forget the rule of
thumb:
BE CAREFUL
@@ts_end
Well, that concludes the Troubleshooting section of the course. We hope
that you found it useful. Also, if you have any comments or know of any
other troubleshooting approaches that you think should be added to this
section please get in touch with us. We're listed under FEEDBACK on the
main course menu.
Thank You
@@2.0.
System Event Files
(Overview)
Most operating systems maintain a system event file. The event file is
used to record information about certain events that happen within the
system (e.g., system reloads, configuration changes, hardware and
software detected errors, etc.).
The classification and type of information that is recorded in a system
event file is unique to the operating system maintaining the event file.
For example:
TOPS-10 supports approximately 55 event categories.
TOPS-20 supports approximately 25 event categories.
VAX/VMS supports approximately 20 event categories.
@@2.0.A.
The event categories are listed on the back of the Spear Reference card.
File Structures - There is nothing special about the file structure
associated with a system event file.
a. If the event file is maintained by a TOPS-10 operating system,
then it conforms to the standard TOPS-10 file structure. For
further information about the TOPS-10 file structure refer to
The TOPS-10 Software NoteBook 17 (Monitor Table Descriptions).
b. If the event file is maintained by a TOPS-20 operating system,
then it conforms to the standard TOPS-20 file structure. For
further information about the TOPS-20 file structure refer to
The TOPS-20 Software NoteBook 16 (Monitor Table Descriptions).
c. If the event file is maintained by a VAX/VMS operating system,
then it conforms to the standard VAX/VMS file structure. For
further information about the VAX/VMS file structure refer to
The VAX/VMS Software Support Notebook.
@@R.T.2.0.A.
STOP - You are moving in a reverse direction through the menu. You are
about to back into the System Event File Overview.
@@2.M.
System Event Files
Topic Menu:
1. Overview
2. TOPS-10 System Event Files
3. TOPS-20 System Event Files
4. VAX/VMS System Event Files
5. DEFINE.LIS
@@define_lst
DEFINE.LIS is a text file that describes the hardware and/or software
status that is saved for each entry type in both the TOPS-10 and the
TOPS-20 system event file. DEFINE.LIS is normally stored in the system
documentation area. To obtain a copy of the file type:
PRINT<DOC>DEFINE.LIS<cr>
If DEFINE.LIS is not in the system documentation area you can get a
copy from the Spear distribution tape. There are two procedures; one
for TOPS-10, the other for TOPS-20.
@@pri_define_tops_10
TOPS-10 procedure to copy DEFINE.LIS from the Spear tape to your area.
Assign a magtape (xxx), mount the Spear tape, run BACKUP, and type:
/TAPE MTxxx:<cr>
/REWIND<cr>
/INTERCHANGE<cr>
/FILES<cr>
/SUPERSEDE ALWAYS<cr>
/SKIP 1<cr>
Note: BACKUP will print "DONE" and reprompt. Type:
/RESTORE DEFINE.LIS = DEFINE.LIS<cr>
Note: BACKUP will print the following message and reprompt. Type:
!
"DEFINE LST"
"DONE"
/UNLOAD<cr>
/EXIT<cr>
Note: Remove and return the Spear distribution tape. Then type:
PRINT DEFINE.LIS<cr>
@@pri_define_tops_20
TOPS-20 procedure to copy DEFINE.LIS from the Spear tape to your area.
Assign a magtape (xxx), mount the Spear tape, run DUMPER, and type:
DUMPER> TAPE MTxxx:<cr>
DUMPER> REWIND<cr>
DUMPER> INTERCHANGE<cr>
DUMPER> FILES<cr>
DUMPER> SUPERSEDE ALWAYS<cr>
DUMPER> SKIP 1<cr>
Note: DUMPER will print two information messages and reprompt. Type:
DUMPER> RESTORE PS:<*>DEFINE.LIS PS:<your directory><cr>
Note: DUMPER will print the following message and reprompt. Type:
% RESTORING FILES TO PS:<your directory>
PS:<*>DEFINE.LIS => DEFINE.LIS [OK]
DUMPER> UNLOAD<cr>
Note: Remove and return the Spear distribution tape. Then type:
PRINT DEFINE.LIS<cr>
@@tops_10_ef
TOPS-10 System Event Files
The section of Instruct consists of a series of questions that pertain
to the TOPS-10 System Event File (ERROR.SYS). Before you attempt to
answer the questions you should review Chapter 2 of the Spear Manual.
Don't forget, you can use the /BREAK feature and return via your ID.
@@tops_10_ef_a
Press the RETURN key when you are ready.
@@tops_10_ef_q1
TOPS-10 System Event Files - Q1 of 10
True or False - Many of the questions that pertain to the TOPS-10
system event file also pertain to the TOPS-20 system event file.
@@tops_10_ef_q1_at
That's correct.
In fact, the questions are practically identical. In many cases so are
the answers. Therefore, if you have already answered the questions as
they pertain to the TOPS-20 system event file, then you can probably
afford to skip this section of Instruct. Of course, on the other hand,
you may want to answer the questions anyway. If that's the case, then
don't be confused by the redundancy.
@@tops_10_ef_q1_af
The statement is TRUE. The TOPS-10 system event file (ERROR.SYS) and the
TOPS-20 system event file (ERROR.SYS) are very similar. Therefore, it
stands to reason that many of the questions that pertain to one event
file will also pertain to the other event file.
@@tops_10_ef_q2
TOPS-10 System Event Files - Q2 of 10
True or False - The TOPS-10 System Event File is called ERROR.SYS.
@@tops_10_ef_q2_at
That's correct.
Both the TOPS-10 and the TOPS-20 system event file are called ERROR.SYS.
The VAX/VMS system event file is called ERRLOG.SYS.
@@tops_10_ef_q2_af
The statement is TRUE. The idea of a system event file (ERROR.SYS) was
first implemented in the early 170's for TOPS-10. Initially, the file
was used only to record main memory, channel, and disk errors. The idea
proved to be a good one and new entries were added to the file until now
ERROR.SYS is the main source of information for solving intermittent
system failures.
In the mid 1970's the idea of a system event file along with the file
name ERROR.SYS was carried over to TOPS-20. Thus, both the TOPS-10 and
the TOPS-20 system event file are called ERROR.SYS.
@@tops_10_ef_q3
TOPS-10 System Event Files - Q3 of 10
True or False - Prior to the Spear library, TOPS-10 used a program
called SYSERR to record entries in the system event files.
@@tops_10_ef_q3_at
The statement is FALSE. Neither SYSERR nor the Spear library have any-
thing to do with the recording of entries in the system event file. That
is strictly a function of the operating system.
Both SYSERR and the Spear library are designed to process the contents
of the system event file. SYSERR was a report generator. Basically, it
allowed the user to select and translate specific entries in the event
file. The SPEAR library (SYSERR's replacement) is more sophisticated. In
addition to translating event file entries it also attempts to localize
the cause of intermittent disk and tape subsystem failures. Note however
that neither SYSERR nor Spear have anything to do with recording the
system event file.
@@tops_10_ef_q3_af
That is correct.
Both SYSERR and its replacement, the SPEAR library, are designed to
process the contents of the system event file. They have nothing to do
with recording the entries. That is a strictly function of the operating
system.
@@tops_10_ef_q4
TOPS-10 System Event Files - Q4 of 10
True or False - All hardware detected failures are recorded in the
system event file.
@@tops_10_ef_q4_at
The statement is FALSE. Only failures that require operating system
intervention are recorded in the system event file. Failures that do
not require operating system intervention are not recorded in the event
file. For example, some subsystems attempt error recovery locally. In
most cases, if the recovery is successful then the operating is not
notified. Thus, those kinds of errors are normally not recorded in the
system event file.
@@tops_10_ef_q4_af
That's correct.
Only errors that require operating system intervention are recorded in
the system event file.
@@tops_10_ef_q5
TOPS-10 System Event Files - Q5 of 10
True or False - Every record in a TOPS-10 system event file consists of
a header section and a body section.
@@tops_10_ef_q5_at
That's correct.
Furthermore, the header and body section of each entry type is described
in a file called DEFINE.LIS. To obtain a copy of DEFINE.LIS, refer to
Appendix A on the Event File Menu.
@@tops_10_ef_q5_af
The statement is TRUE. Each entry in the TOPS-10 system event file
consists of a header section and a body section. The header identifies
the entry type (i.e., event code), the date and time that the entry was
recorded, the processor serial number, the length of the header section
and the length of the body section. Currently, the header section is set
at four words, the body section varies in size depending on the type of
entry.
@@tops_10_ef_q6
TOPS-10 System Event Files - Q6 of 10
True or False - Each record in a TOPS-10 system event file represents
one complete system event.
@@tops_10_ef_q6_at
The statement is true with one exception, KLERR. KLERR entries are built
by the console front-end whenever the KL10 crashes. When the system is
restarted the entry is transfered via the DTE to KL main memory and then
recorded in the system event file.
Because the buffer area set aside for communications between the console
and KL main memory is significantly smaller than a typical KLERR entry,
the entry divided into segments. Each segment is given a unique sequence
number and recorded as a separate record in the event file. Technically,
therefore, the statement is FALSE.
@@tops_10_ef_q6_af
That's correct.
The KLERR entry consists of multiple records. Each record has a separate
sequence number. When a KLERR entry is translated, however, only the
first sequence number is used to identify the entry. The other sequence
numbers are masked-out to avoid confusion.
@@tops_10_ef_q7
TOPS-10 System Event Files - Q7 of 10
True or False - The synchronization word is used to recover from hard
read errors that occur while reading the system event file.
@@tops_10_ef_q7_at
That's correct.
Whenever Spear uses the synchronization word to recover from a hard
read error it will print the message "Bad header found - RESYNCing".
@@tops_10_ef_q7_af
The statement is TRUE. The first word in each system event file data
block is a synchronization pointer. The pointer points to the starting
location of the next record in the file. Thus, if a hard read error
occurs while reading a record Spear skips to the next data block, reads
the sync word, finds the starting location of the next record, and
continues reading the file.
The idea of adding a synchronization word to each data block in a system
event file was incorporated in the mid 1970's. Prior to that time, if a
hard read error occurred while reading the event the remaining records
in the file were lost. Now only the records affected by the read error are
lost.
@@tops_10_ef_q8
TOPS-10 System Event Files - Q8 of 10
True or False - When the TOPS-10 operating system detects a device error
the following occurs:
1. Normal operation is suspended and applicable hardware and/or software
status is captured (at error) and saved in the Unit Data Block (UDB).
2. If applicable, an error recovery algorithm is applied.
3. Regardless of whether the recovery algorithm is successful or not,
the applicable hardware and/or software status is captured again
(at end) and appended to the UDB.
4. The error status stored in the UDB is formatted, assigned a sequence
number, and appended to to the system event file.
5. If the system was able to recover from the error normal operation
continues. If, however, the system was unable to recover from
the error, then the job affected by the error is notified and it
handles the error.
@@tops_10_ef_q8_at
That's correct.
The action outlined in the question is typical of the way TOPS-10
handles most device errors. Non-device errors (e.g., CPU errors) and
errors that affect the operating system itself are also handled in a
similar manner. If, however, there is no recovery algorithm or if the
recovery algorithm is unsuccessful, then those errors may result in a
user job or system crash.
@@tops_10_ef_q8_af
The statement is TRUE. Most TOPS-10 device errors are handled this way.
@@tops_10_ef_q9
TOPS-10 System Event Files - Q9 of 10
True or False - The exact content and format of each TOPS-10 event
record is described in the Spear Manual.
@@tops_10_ef_q9_at
The statement is FALSE. The Spear Manual does describes the report
formats generated by Retrieve, but it does not describe the content
and format of the actual event records.
@@tops_10_ef_q9_af
That's correct.
The event records are described in a file called DEFINE.LIS.
@@tops_10_ef_q10
TOPS-10 System Event Files - Q10 of 10
True or False - The fifth word in a 011 type record is used to save the
results of the DATAI performed at the time of the failure.
Note: Refer to DEFINE.LIS. If you do not have a copy of DEFINE.LIS and you
want one, refer to Appendix A on the Event File Menu.
@@tops_10_ef_q10_at
The statement is FALSE. Open the DEFINE.LIS to the 011 entry. It starts
some place around line number 00450. The line number are listed at the
left of the page.
Now skipping over the word, byte, and bit definitions, go down the
center, or word number column, until you get to word number 5. To the
left you will see that word number 5 is defined as CONI_INITIAL. To the
right you will see that CONI_INITIAL is described as "controller status
at error".
Now find word 16. You will see that it is defined as "RH_DATA_BAR_ERR",
and described as: DATAI from RH10 block address register at error time.
@@tops_10_ef_q10_af
That's correct.
Word 5 is used to save the CONI status word. The DATAI status is saved
word 16. Anytime you want to know exactly what hardware and software
status is saved in an entry type you can consult DEFINE.LIS
Now, if you haven't already done so, take a few minutes to look over the
contents of the file. The introduction explains the overall organization
and format of an event file record. Following the introduction, each of
the event types are described in detail.
When you are finished, take a few more minutes and compare the reports
listed in the Spear Manual with the corresponding record descriptions
listed in DEFINE.LIS. As a result, you will have a better understanding
of the system event file and the reports that are generated from it.
@@tops_10_ef_lq
That's it. There are only 10 questions about TOPS-10 System Event Files.
Press the RETURN key to return to the System Event File Menu.
@@tops_20_ef
TOPS-20 System Event Files
The section of Instruct consists of a series of questions that pertain
to the TOPS-20 System Event File (ERROR.SYS). Before you attempt to
answer the questions you should review Chapter 2 of the Spear Manual.
Don't forget, you can use the /BREAK feature and return via your ID.
@@tops_20_ef_a
Press the RETURN key when you are ready.
@@tops_20_ef_q1
TOPS-20 System Event Files - Q1 of 10
True or False - Many of the questions that pertain to the TOPS-10
system event file also pertain to the TOPS-20 system event file.
@@tops_20_ef_q1_at
That's correct.
In fact, the questions are practically identical. In many cases so are
the answers. Therefore, if you have already answered the questions as
they pertain to the TOPS-10 system event file, then you can probably
afford to skip this section of Instruct. Of course, on the other hand,
you may want to answer the questions anyway. If that's the case, then
don't be confused by the redundancy.
@@tops_20_ef_q1_af
The statement is TRUE. The TOPS-10 system event file (ERROR.SYS) and the
TOPS-20 system event file (ERROR.SYS) are very similar. Therefore, it
stands to reason that many of the questions that pertain to one event
file will also pertain to the other event file.
@@tops_20_ef_q2
TOPS-20 System Event Files - Q2 of 10
True or False - The TOPS-20 System Event File is called ERROR.SYS.
@@tops_20_ef_q2_at
That's correct.
Both the TOPS-10 and the TOPS-20 system event file are called ERROR.SYS.
The VAX/VMS system event file is called ERRLOG.SYS.
@@tops_20_ef_q2_af
The statement is TRUE. The idea of a system event file (ERROR.SYS) was
first implemented in the early 1970's for TOPS-10. Initially, the file
was used only to record main memory, channel, and disk errors. The idea
proved to be a good one and new entries were added to the file until now
ERROR.SYS is the main source of information for solving intermittent
system failures.
In the mid 1970's the idea of a system event file along with the file
name ERROR.SYS was carried over to TOPS-20. Thus, both the TOPS-10 and
the TOPS-20 system event file are called ERROR.SYS.
@@tops_20_ef_q3
TOPS-20 System Event Files - Q3 of 10
True or False - Prior to the Spear library, TOPS-20 used a program
called SYSERR to record entries in the system event files.
@@tops_20_ef_q3_at
The statement is FALSE. Neither SYSERR nor the Spear library have any-
thing to do with the recording of entries in the system event file. That
is strictly a function of the operating system.
Both SYSERR and the Spear library are designed to process the contents
of the system event file. SYSERR was a report generator. Basically, it
allowed the user to select and translate specific entries in the event
file. The SPEAR library (SYSERR's replacement) is more sophisticated. In
addition to translating event file entries it also attempts to localize
the cause of intermittent disk and tape subsystem failures. Note however
that neither SYSERR nor Spear have anything to do with recording the
system event file.
@@tops_20_ef_q3_af
That is correct.
Both SYSERR and its replacement, the SPEAR library, are designed to
process the contents of the system event file. They have nothing to do
with recording the entries. That is a strictly function of the operating
system.
@@tops_20_ef_q4
TOPS-20 System Event Files - Q4 of 10
True or False - All hardware detected failures are recorded in the
system event file.
@@tops_20_ef_q4_at
The statement is FALSE. Only failures that require operating system
intervention are recorded in the system event file. Failures that do
not require operating system intervention are not recorded in the event
file. For example, some subsystems attempt error recovery locally. In
most cases, if the recovery is successful then the operating is not
notified. Thus, those kinds of errors are normally not recorded in the
system event file.
@@tops_20_ef_q4_af
That's correct.
Only errors that require operating system intervention are recorded in
the system event file.
@@tops_20_ef_q5
TOPS-20 System Event Files - Q5 of 10
True or False - Every record in a TOPS-20 system event file consists of
a header section and a body section.
@@tops_20_ef_q5_at
That's correct.
Furthermore, the header and body section of each entry type is described
in a file called DEFINE.LIS. To obtain a copy of DEFINE.LIS, refer to
Appendix A on the Event File Menu.
@@tops_20_ef_q5_af
The statement is TRUE. Each entry in the TOPS-20 system event file
consists of a header section and a body section. The header identifies
the entry type (i.e., event code), the date and time that the entry was
recorded, the processor serial number, the length of the header section
and the length of the body section. Currently, the header section is set
at four words, the body section varies in size depending on the type of
entry.
@@tops_20_ef_q6
TOPS-20 System Event Files - Q6 of 10
True or False - Each record in a TOPS-20 system event file represents
one complete system event.
@@tops_20_ef_q6_at
The statement is true with one exception, KLERR. KLERR entries are built
by the console front-end whenever the KL10 crashes. When the system is
restarted the entry is transfered via the DTE to KL main memory and then
recorded in the system event file.
Because the buffer area set aside for communications between the console
and KL main memory is significantly smaller than a typical KLERR entry,
the entry divided into segments. Each segment is given a unique sequence
number and recorded as a separate record in the event file. Technically,
therefore, the statement is FALSE.
@@tops_20_ef_q6_af
That's correct.
The KLERR entry consists of multiple records. Each record has a separate
sequence number. When a KLERR entry is translated, however, only the
first sequence number is used to identify the entry. The other sequence
numbers are masked-out to avoid confusion.
@@tops_20_ef_q7
TOPS-20 System Event Files - Q7 of 10
True or False - The synchronization word is used to recover from hard
read errors that occur while reading the system event file.
@@tops_20_ef_q7_at
That's correct.
Whenever Spear uses the synchronization word to recover from a hard
read error it will print the message "Bad header found - RESYNCing".
@@tops_20_ef_q7_af
The statement is TRUE. The first word in each system event file data
block is a synchronization pointer. The pointer points to the starting
location of the next record in the file. Thus, if a hard read error
occurs while reading a record Spear skips to the next data block, reads
the sync word, finds the starting location of the next record, and
continues reading the file.
The idea of adding a synchronization word to each data block in a system
event file was incorporated in the mid 1970's. Prior to that time, if a
hard read error occurred while reading the event the remaining records
in the file were lost. Now only the record effected by the read error is
lost.
@@tops_20_ef_q8
TOPS-20 System Event Files - Q8 of 10
True or False - When the TOPS-20 operating system detects a device error
the following occurs:
1. Normal operation is suspended and applicable hardware and/or software
status is captured (at error) and saved in a buffer.
2. If applicable, an error recovery algorithm is applied.
3. Regardless of whether the recovery algorithm is successful or not,
the applicable hardware and/or software status is captured again
(at end) and appended to the buffer.
4. The contents of the buffer are formatted, assigned a sequence number,
and appended to to the system event file.
5. If the system was able to recover from the error normal operation
continues. If, however, the the system was unable to recover from
the error, then the job effected by the error is notified and it
handles the error.
@@tops_20_ef_q8_at
That's correct.
The action outlined in the question is typical of the way TOPS-20
handles most device errors. Non-device errors (e.g., CPU errors) and
errors that affect the operating system itself are also handled in a
similar manner. If, however, there is no recovery algorithm or if the
recovery algorithm is unsuccessful, then those errors may result in a
user job or system crash.
@@tops_20_ef_q8_af
The statement is TRUE. Most TOPS-20 device errors are handled this way.
@@tops_20_ef_q9
TOPS-20 System Event Files - Q9 of 10
True or False - The exact content and format of each TOPS-20 event
record is described in the Spear Manual.
@@tops_20_ef_q9_at
The statement is FALSE. The Spear Manual does describes the report
formats generated by Retrieve. But it does not describe the content
and format of the actual event records.
@@tops_20_ef_q9_af
That's correct.
The event records are described in a file called DEFINE.LIS. To obtain a
copy of DEFINE.LIS refer to Appendix A on the Event File Menu.
P.S. You will need a copy of DEFINE.LIS to answer the next question.
@@tops_20_ef_q10
TOPS-20 System Event Files - Q10 of 10
True or False - The thirty second word in a 111 type record is used to
save the first channel control word.
Note: Refer to DEFINE.LIS. If you do not have a copy of DEFINE.LIS and you
want one, refer to Appendix A on the Event File Menu.
@@tops_20_ef_q10_at
The statement is FALSE. Open the DEFINE.LIS to the 111 entry. It starts
some place around line number 01320. The line number are listed at the
left of the page.
Now skipping over the word, byte, and bit definitions, go down the
center, or word number column, until you get to word number 32. To the
left you will see that word number 32 is defined as RETRY_CNT. To the
right you will see that the RETRY_CNT is saved in bit 18 through 35 of
the word and it is described as "final retry error count".
Now find word number 28. You will see that it is defined as CCW1, it
consists of 36 bits, and it is described as "first chan control word".
@@tops_20_ef_q10_af
That's correct.
Word 32 is used to save the error retry count. The first channel control
word is saved in word 28. Anytime you want to know exactly what hardware
and software status is saved in an entry type you can consult DEFINE.LIS
Now, if you haven't already done so, take a few minutes to look over the
contents of the file. The introduction explains the overall organization
and format of an event file record. Following the introduction, each of
the event types are described in detail.
When you are finished, take a few more minutes and compare the reports
listed in the Spear Manual with the corresponding record descriptions
listed in DEFINE.LIS. As a result, you will have a better understanding
of the system event file and the reports that are generated from it.
@@tops_20_ef_lq
That's it. There are only 10 questions about TOPS-20 System Event Files.
Press the RETURN key to return to the System Event File Menu.
@@vax_vms_ef
You do not need to understand file structures to use System Event Files
to isolate system failures. However, in order to be effective you should
understand something about their format and content. Chapter 5 of the
VAX11 Spear Manual describes the overall format and content. Appendix B
of the VAX11 Spear Manual describes in detail, the content of each
record type that you will find in the system event file.
This section of Instruct consists of a series of general and specific
questions about the VAX/VMS System Event File (ERRLOG.SYS). Before you
attempt to answer the questions you should review Chapter 5 and Appendix
B in the Spear Manual. (Don't forget, you can use the /BREAK feature and
return via your student ID.)
@@
@@vax_vms_ef_a
Press the RETURN key when you are ready.
@@
@@vax_vms_ef_q1
Q1 of 10 (VAX/VMS System Event Files)
True or False - Several of the questions that pertain to the VAX/VMS
system event file also pertain to the TOPS-20 system event file.
@@
@@vax_vms_ef_q1_at
That's correct.
@@
@@vax_vms_ef_q1_af
The statement is TRUE. The VAX/VMS system event file (ERRLOG.SYS) and the
TOPS-20 system event file (ERROR.SYS) are very similar in concept. Therefore,
it stands to reason that many of the questions that pertain to one event
file will also pertain to the other event file.
@@
@@vax_vms_ef_q2
Q2 of 10 (VAX/VMS System Event Files)
True or False - In addition to the Spear library, VAX/VMS uses a program
called SYE to record entries in the system event files.
@@
@@vax_vms_ef_q2_at
VAX/VMS Q2
The statement is FALSE. Neither SYE nor the Spear library have any-
thing to do with the recording of entries in the system event file. That
is strictly a function of the operating system.
Both SYE and the Spear library are designed to process the contents
of the system event file. SYE is a report generator. Basically, it
allows the user to select and translate specific entries in the event
file. The SPEAR library is more sophisticated. In
addition to translating event file entries it also attempts to localize
the cause of intermittent disk and tape subsystem failures. Note however
that neither SYE nor Spear have anything to do with recording the
system event file.
@@
@@vax_vms_ef_q2_af
VAX/VMS Q2
That is correct.
Both SYE and the SPEAR library are designed to
process the contents of the system event file. They have nothing to do
with recording the entries. That is a strictly function of the operating
system.
@@
@@vax_vms_ef_q3
Q3 of 10 (VAX/VMS System Event Files)
True or False - The VAX/VMS System Event File is called ERRLOG.SYS.
@@
@@vax_vms_ef_q3_at
VAX/VMS Q3
That's correct.
The VAX/VMS system event file is called ERRLOG.SYS.
Both the TOPS-10 and the TOPS-20 system event file are called ERROR.SYS.
@@
@@vax_vms_ef_q3_af
VAX/VMS Q3
The statement is TRUE. The idea of a system event file (ERROR.SYS) was
first implemented in the early 1970's for TOPS-10. Initially, the file
was used only to record main memory, channel, and disk errors. The idea
proved to be a good one and new entries were added to the file until now
ERROR.SYS is the main source of information for solving intermittent
system failures.
In the mid 1970's the idea of a system event file along with the file
name ERROR.SYS was carried over to TOPS-20. Thus, both the TOPS-10 and
the TOPS-20 system event file are called ERROR.SYS.
@@
@@vax_vms_ef_q4
Q4 of 10 (VAX/VMS System Event Files)
True or False -
More than one process may do read access on the error file at the same
time.
@@
@@vax_vms_ef_q4_at
VAX/VMS Q4
That's correct.
More than one process may read the file at the same time.
@@
@@vax_vms_ef_q4_af
VAX/VMS Q4
The statement is TRUE
The problem arises when the operating system tries to write to the file and
finds some other process reading the file. In this case, the operating system
creates a new file.
@@
@@vax_vms_ef_q5
Q5 of 10 (VAX/VMS System Event Files)
True or False -
All I/O device errors are logged under the device error record format
regardless of the type of device.
@@
@@vax_vms_ef_q5_at
VAX/VMS Q5
That's correct.
The CPU and memory errors are recorded different record formats but not
I/O device errors
@@
@@vax_vms_ef_q5_af
VAX/VMS Q5
The statement is TRUE
Only TOPS-10 and TOPS-20 use different record formats for different types
of I/O devices.
@@
@@vax_vms_ef_q6
Q6 of 10 (VAX/VMS System Event Files)
True or False -
Only device errors and other hardware detected errors are recorded in the
VMS system error file.
@@
@@vax_vms_ef_q6_at
VAX/VMS Q6
The statement is FALSE
Many other types of information are also recorded in the error file
such as volume mounts and dismounts. Software detected errors are also
recorded in this file as well as text messages from the operator.
@@
@@vax_vms_ef_q6_af
VAX/VMS Q6
That's correct.
There are many other sources of the information found in the error file.
@@
@@vax_vms_ef_q7
Q7 of 10 (VAX/VMS System Event Files)
True or False -
The format of the device error entry is the same regardless of the type
of VAX CPU used in the system.
@@
@@vax_vms_ef_q7_at
VAX/VMS Q7
That's correct.
Only the CPU specific entries are different.
@@
@@vax_vms_ef_q7_af
VAX/VMS Q7
The statement is TRUE
Only the CPU specific entries are different.
@@
@@vax_vms_ef_q8
Q8 of 10 (VAX/VMS System Event Files)
True or False -
If the operating system must create a new version of the error file,
ERRLOG.SYS, it renames the current version to ERRLOG.OLD and then
creates the new file.
@@
@@vax_vms_ef_q8_at
VAX/VMS Q8
The statement is FALSE
The operating system will create a new file using the same name and the next
higher version number.
@@
@@vax_vms_ef_q8_af
VAX/VMS Q8
That's correct.
The convention of renaming the error file to ERRLOG.OLD has nothing to do
with the operating system.
@@
@@vax_vms_ef_q9
Q9 of 10 (VAX/VMS System Event Files)
True or False -
The media identification is not included as part of the information recorded
in a device error.
@@
@@vax_vms_ef_q9_at
VAX/VMS Q9
That's correct.
The media information is recorded in the system event file when the media is
mounted or dismounted.
@@
@@vax_vms_ef_q9_af
VAX/VMS Q9
The statement is TRUE
The media information is recorded in the system event file when the media is
mounted or dismounted.
@@
@@vax_vms_ef_q10
Q10 of 10 (VAX/VMS System Event Files)
True or False -
Some device error records in the event file may have no apparent indication
of any error occuring.
@@
@@vax_vms_ef_q10_at
VAX/VMS Q10
That's correct.
Media off line is a good example. It this case the "on-line" bit would
be off indicating the error.
@@
@@vax_vms_ef_q10_af
VAX/VMS Q10
The statement is TRUE
Media off line is a good example. It this case the "on-line" bit would
be off indicating the error.
@@
@@vax_vms_ef_lq
That's it. There are only 10 questions about VAX/VMS System Event Files.
Press the RETURN key to get back to the System Event File Menu.
@@3.0.
Spear Library
Introduction
Spear is an on-line maintenance software library that runs under three
operating systems: TOPS-10, TOPS-20, and VAX/VMS. Currently, the library
contains three functions: Summarize, Retrieve, and Compute.
These functions; Summarize and Retrieve, are designed
to help you sort and evaluate 32- and 36-bit system event files. The
third function, Compute, calculates system availability. Its purpose is
to help you prepare crash and up time reports and determine overall
system performance.
@@3.0.A.
Each Spear Library function supports a dialog style user interface. The
dialog prompts for information and waits for a response. If the prompt
accepts a default, the default will be (parenthetically) included as
part of the prompt.
@@R.T.3.0.
STOP - You are moving in a reverse direction through the menu. You are
about to back into the Spear Library Introduction.
Your response please:
@@R.T.3.1.0.B.
STOP - You are moving in a reverse direction through the menu. You are
about to back into the the Introduction
@@R.T.3.1.1.F.
STOP - You are moving in a reverse direction through the menu. You are
about to back into
@@M.
Spear Course Menu
1. Course Administrator/Student Guide
2. Troubleshooting
3. System Event Files
4. Using The Spear Library
5. Guaranteed Uptime Program/NOTIFY
6. Feedback
7. Random Questions
8. Dialog Changes
@@R.T.3.1.2.M.
STOP - You are moving in a reverse direction thru the menu. You are
about to back into the Menu.
Your response please:
@@3.2.0.
The Big Picture Input File
:
Retrieve accepts event ..........:.........
files and packet files. : :
Event File Packet File
The Selected information .....:..... :
in an event file can be: : : :
Included in, or Excluded Include Exclude Packet
from, the output file. :.........: Numbers
: :
One or more Packets can Selection and :
be selected from a Packet Time Criteria :
file. :..................:
:
Output Mode
.....:.....
Retrieve can translate the : :
selected entries or it can ASCII Binary
save the selected entries :.........:
in a binary history file. :
Output File
@@R.T.3.2.0.
STOP - You are moving in a reverse direction thru the menu. You are
about to back into the Retrieve Overview.
@@3.2.M.
Spear Library - Retrieve
Topic menu:
1. Overview
2. Retrieve Dialog
3. Retrieve Questions & Answers
@@3.2.1.
The basic Retrieve dialog consists of eight selection prompts and
one confirmation prompt.
RETRIEVE mode
-------------
Event or packet file (default):
Selection to be (INCLUDED):
Selection type (ALL):
Time from (EARLIEST):
Time to (LATEST):
Output mode (ASCII):
Report format (SHORT):
Output to ([DSK]:RETRIE.RPT):
Type [cr] to confirm (/GO):
@@3.2.1.A.
The first selection prompt:
Event or packet file (default):
allows you to specify the name of the input file. The default response
(SYS:ERROR.SYS for TOPS-10, SERR:ERROR.SYS for TOPS-20, and
SYS$ERRORLOG:ERRLOG.SYS for VAX/VMS) is enclosed in parentheses and can
be selected by pressing the RETURN key.
Retrieve accepts two types of files: standard system event files (such
as those generated by TOPS-10, TOPS-20, or VAX/VMS systems), and Packet
files
If you specify a system event file Retrieve will continue with the basic
dialog. If you specify a Packet file, however, Retrieve will switch to
the Packet selection dialog. Since the Packet dialog is short (1 prompt)
it will be explained next. Then we will continue with the basic dialog.
This prompt also supports standard Help and question mark (?) responses.
@@RETRIEVE INPUT
Selection Criteria ___________. .___ Short Report
.________!_________. !___ Full Report
Event File ___. ! Event Retrieval ! !___ Raw Data Report
!___! Translation !___!
Packet File ___! ! and/or Storage ! !
!__________________! !___ Device History
Merge File (binary) ___________! Files (binary)
Retrieve can be used to generate reports, or it can be used to establish
and maintain device history files. If you choose to generate a report,
you can select one of three formats: Short, Full, or Octal (Hexadecimal
on VAX/VMS systems). If you choose to generate a device history file you
will be asked if you want to merge it with an existing (history) file.
@@3.2.1.B.
If you specify a Packet file at the input file prompt, Retrieve will
prompt you for the packet numbers that you want to select.
Event or packet file (SERR:ERROR.SYS): DSK:A1225.PAK<CR>
Packet numbers:
Each numbered packet contains a list of sequence numbers. The sequence
numbers identify the individual records that were used by Analyze as
evidence to support the theories listed in the corresponding Analyze
Report file. There is one packet for each theory listed in the report.
You can use Retrieve to translate (or save in a separate binary file)
the records listed in the packet files. Typically, you would translate
a packet if you wanted to examine the records that were used as evidence
to support a particular theory. You would save the records if you were
building or maintaining a history file for a particular device or a
specific type of error.
This prompt also supports standard Help and Question mark (?) responses.
@@3.2.1.C.
If you specify multiple packet numbers, each number should be separated
by a comma. You should realize, however, that if you specify more than
one packet number the records listed in the packets will be grouped and
translated (or saved) according to sequence numbers. In other words, the
records will not be grouped according to packet number.
After prompting for packet numbers, Retrieve will skip the "Time from"
and "Time to" prompts and pickup the basic dialog at the "Output mode"
prompt. From that point on, there is no difference between the Event
File dialog and the Packet File dialog.
Event or packet file (SERR:ERROR.SYS): DSK:A1225.PAK<cr>
Packet numbers: 3,7,14<cr>
Output mode (ASCII):
Report format (SHORT):
Output to ([DSK]:RETRIE.RPT):
Type [cr] to confirm (/GO):
@@3.2.1.D.
Back to the basic Retrieve dialog. The second selection prompt:
Event or packet file (default):
Selection to be (INCLUDED):?
INCLUDED
EXCLUDED
allows you to specify whether the selected entries will be included in,
or excluded from, the output file. Included is the normal response. If,
however, you specify Excluded, then ALL the entries in the input file
(except those that you select later in the dialog) will be extracted and
translated or saved in the output file.
The Exclude feature is used to purge entries from a system event file
before the file is translated or saved. For example, suppose
a communications node developed a problem that caused the event file to
fill up with Network entries; since you know what caused the problem you
might want to remove the entries before you process or save the file.
Note: the original (or input) file will not be altered in any way.
This prompt also supports standard Help and Question mark (?) responses.
@@RETRIEVE TYPE
The following example illustrates the difference between Include
and Exclude.
Include(event type C) Exclude(event type C)
Time: From To From To
: : : :
Input file: CABBACBCCAABCABBCAACCBCA CABBACBCCAABCABBCAACCBCA
Output file: CC C C CABBACB AAB ABB AACCBCA
@@3.2.1.E.
The third selection prompt asks you to choose from two separate lists.
Selection type (ALL):
Type one or more of the following from the first group:
ERROR
STATISTICS
DIAGNOSTICS
CONFIGURATION
OTHER
If you choose more than one of these types, separate each with
a comma.
Or, type one of the following from the second group:
the RETURN key, or ALL
SEQUENCE
CODE
@@3.2.1.EA.
ERROR - indicates that you want to select entries that contain
actual failure data. If you select ERROR you can also specify
the particular error types for which you are looking in relation
to the specific device.
STATISTICS - indicates that you want to select statistic
entries.
DIAGNOSTICS - indicates that you want to select entries
created by a diagnostic.
CONFIGURATION - indicates that you want to select configuration
entries.
OTHER - indicates that you want to select entries that do not fit
into the other types.
These responses will be explained later, after the frames relating to
SEQUENCE and CODE.
@@3.2.1.EB.
ALL (or the RETURN key) - indicates that you want to select
all the entries in the file. (This is the default). You can
further qualify the selection at the Time prompts.
SEQUENCE - indicates that you want to select entries according
to sequence numbers. This response will be explained next.
CODE - indicates that you want to select entries based on the
event codes assigned each type of entry by the operating
system. This response will be explained after the SEQUENCE
response.
@@3.2.1.EC.
When you specify SEQUENCE in response to the "Selection type"
prompt, Retrieve will prompt you for the sequence numbers that
you want to select.
Selection type (ALL): SEQUENCE<cr>
Sequence numbers: 22,24,35-67,12<cr>
You can select as many sequence numbers as you want. Individual
sequence numbers must be separated by commas, groups of sequence
numbers must be specified by entering the first and last sequence
numbers in the group. The sequence numbers must be separated by a
dash (-). For example, 35-67 indicates that you want to select
sequence numbers 35 through 67.
@@3.2.1.ED.
If you specify CODE in response to the "Selection type" prompt,
Retrieve will prompt you for the event codes that you want to
select.
Selection type (ALL): CODE<cr>
Event codes: 133,161-163<cr>
You can select as many event codes as you want. Each event code
must be separated by a comma. You can also select groups of
event codes. The first and last event codes in the group must be
separated by a dash (-). For example, 161-163 indicates that you
want to select event codes 161 through 163.
This prompt also supports standard Help and Question mark (?)
responses.
@@3.2.1.EE.
If you specify ERROR, STATISTICS, DIAGNOSTICS, or OTHER, or a
combination of these responses to the "Selection" prompt,
Retrieve will enter the "Error class" dialog.
Selection type (ALL): ERROR
Category(ALL):
ALL
MAINFRAME
DISK
TAPE
CI
NI
UNITRECORD
NETWORK
OPERATING-SYSTEM
COMM
PACKID
REELID
HELP
@@3.2.1.EF.
ALL (or the RETURN key) - indicates that you want to select
all errors. (This is the default).
MAINFRAME - indicates that you want to select errors occurring in
specific mainframe components.
DISK - indicates that you want to select errors occurring on disk
units. After selecting DISK, you can specify ALL the specific
disks by name (DPA3, RPB7), or by disk type (RP06, RM05).
TAPE - indicates that you want to select errors occurring on tape
units. After selecting TAPE, you can specifiy ALL, or specify the
tape names or types in question.
CI - indicates that you want to select CI-related errors. After
selecting CI, you can specify ALL, or the specific component of
interest.
NI - indicates that you want to select NI-related errors.
@@3.2.1.EG.
UDA - indicates that you want to select UDA-related errors.
After selecting UDA, you can specify ALL, or the specific
component of interest.
UNITRECORD - indicates that you want to select errors occurring
on unit-record devices such as card readers and line printers.
After selecting UNITRECORD, you can specify ALL, or type the
specific device names or types in question.
OPERATING-SYSTEM - indicates that you want to select operating
system codes. After selecting OPERATING-SYSTEM, you can specify
ALL, or type the name of a specific STOPCODE or BUG type.
COMM - indicates that you want to select errors occurring on
communication devices.
@@3.2.1.EH.
PACKID - indicates that you want to select specific disk packs.
After typing PACKID, you can type ALL, or type the specific pack
identifiers.
REELID - indicates that you want to select specific tape reels.
After typing REELID, you can type ALL, or the specific tape
identifiers.
HELP - indicates that you want to get detailed information
on the above categories.
All categories except for COMM and NI prompt further for specific
device types. Type ? at the subprompt level to get a list of
acceptable responses.
If you choose the DISK drive, TAPE drive, or CI controller subprompt,
Retrieve then prompts you further for an error type. Type ? at the
subprompt level to get a list of acceptable responses.
@@3.2.1.EI.
RETRIEVE keeps prompting you for categories until you either type
FINISHED, or press the RETURN key.
Next Category (FINISHED):
Type one of the following:
The RETURN key, or FINISHED to take the default,
or,
another category.
@@3.2.1.V.
Back to the basic dialog. The fourth selection prompt:
Time from (EARLIEST):
allows you to specify the time at which you want the selection process
to begin. The default response (EARLIEST) is inclosed in parentheses
and can be selected by pressing the RETURN key. You can also specify
real and relative time.
The prompt also supports standard Help and question mark (?) responses.
@@3.2.1.W.
The fifth selection prompt:
Time to (LATEST):
allows you to specify the time at which you want the selection process
to end. The default response (LATEST) is inclosed in parentheses
and can be selected by pressing the RETURN key. Again, you can also
specify real and relative time.
The prompt also supports standard Help and question mark (?) responses.
@@3.2.1.X.
The sixth selection prompt allows you to specify the type of file that
you want Retrieve to generate.
Output mode (ASCII): ?
ASCII - indicates that you want the selected entries extracted
and translated in a report.
BINARY - indicates that you want the selected entries extracted
and saved in a binary file.
We will discuss the ASCII response first.
This prompt also supports standard Help and question mark (?) responses.
@@3.2.1.Y.
If you specify ASCII in response to the "Output mode" prompt, Retrieve
will prompt you for the type format that you want.
Output mode (ASCII):
Report format (SHORT): ?
SHORT - indicates that you want a brief translation of each
selected entry.
FULL - indicates that you want a detailed translation of each
selected entry.
OCTAL - indicates that you want an octal translation of each
selected entry. Normally, octal translations are used
to debug errors in Spear or the software routines that
record the entries.
The prompt also supports standard Help and question mark (?) responses.
@@3.2.1.Z.
If you specify BINARY in response to the "Output mode" prompt, Retrieve
will ask you if you want to merge the selected entries with an existing
binary file.
Output mode (ASCII): BINARY<cr>
Merge with (NONE):
Normally, merging is done only if you are maintaining a device history
file. For example, suppose the processor was experiencing a highly
intermittent failure. Let's say that on the average, the failure occurred
once a week. Given that situation, you might need several weeks or
even a months worth of error information to isolate the cause of the
problem.
Since an event file can get quite large over a period of several weeks
or a month, you might consider establishing a history file to keep track
of the failure. The merge feature is designed to help you do this. It
allows you to combine the currently selected entries with previously
selected entries and merge them in the output file.
@@3.2.1.A1.
The Merge prompt also supports standard Help and question mark (?)
responses.
@@3.2.1.B1.
The eighth and last selection prompt:
Output to ([DSK]:RETRIE.RPT):
allows you to specify the name of the output or file. The default file
name is DSK:RETRIE.RPT for TOPS-10/TOPS-20, and RETRIE.RPT for VAX/VMS
(if you are generating a report). The default becomes DSK:RETRIE.SYS
for TOPS-10/TOPS-20, and RETRIE.SYS for VAX/VMS (if you are building or
maintaining a binary history file).
You can override the entire default by specifying a new file name. You
can also override any field in the default response by specifying only
the field that you want to override.
For example, if you were to type:
Report to (DSK:SUMMAR.SYS): CPU<cr>
the output file specification would become DSK:CPU.SYS
The prompt also supports standard Help and question mark (?) responses.
@@3.2.1.C1.
Finally, the confirmation prompt:
Type <cr> to confirm (/GO):
provides an opportunity for you to review and change any responses
entered up to that point. If you want to review the response list
type /SHOW. If you are satisfied with the response list press the
RETURN key or type /GO.
If you want to change a response, press the backspace key until you
arrive at the corresponding prompt, make the change, and then type /GO.
@@3.2.1.D1.
That concludes the explanation of the Retrieve dialog. Next on the menu
is a set of questions about the Retrieve dialog.
@@RETRIEVE CODES
Generally speaking, the TOPS-10, TOPS-20 and VAX/VMS operating systems
handle errors in a similar manner. That is, when an error occurs they
snapshot pertinent hardware and software status (at error). Then, if
applicable, an error retry algorithm is applied. Next, regardless of
whether or not the retry algorithm was successful, a second snapshot
is taken (at end). Finally, the captured status is put into a record,
assigned a code, and appended to the system event file.
The operating systems differ, however, in the way that they snapshot
the status, implement the retry algorithms, and assign codes to the
error or event record.
@@3.M.
Spear Library
Topic Menu:
1. Introduction
2. Retrieve
3. Compute
4. Summarize
5. Applications
6. Klerr
@@R.T.3.2.1.D1.
STOP - You are moving in a reverse direction thru the menu. You are
about to back into the Retrieve Dialog explanation.
@@ret_dia_q1
Retrieve Dialog - Q1 of 10
True or False - Retrieve can be used to translate and/or save the
records listed in the packets that are generated by Analyze?
@@ret_dia_q1_at
That's Correct.
This feature allows you to translate and/or save the individual records
that were used as evidence to support specific theories.
@@ret_dia_q1_af
The statement is TRUE. Retrieve can translate the Packets generated by
Analyze.
Remember, there is a packet associated with each theory listed in the
Analyze report. The Packet contains pointers that identify the records
that were used as evidence to support the theory.
Thus, anytime you question the validity of a theory and want to examine
the evidence yourself, you can do so by specifying the Packet file as
input to Retrieve. When Retrieve prompts for the packet number, enter
the number that corresponds to the theory that you are investigating and
then, specify the desired output mode (Short, Full, or Octal).
@@ret_dia_q2
Retrieve Dialog - Q2 of 10
True or False - Retrieve can be used to generate and maintain device
history files?
@@ret_dia_q2_at
That's correct.
You can use Retrieve to build and maintain history files for:
a) entire subsystems (disks, tapes, networks, etc.),
b) logical devices (DP220, MT300, CPU0, etc.),
c) physical option types (RP06s, TU45s etc.) or,
d) disk and tape storage media (Pack or Reel IDs).
@@ret_dia_q2_af
The statement is TRUE. Retrieve can be used to build and maintain device
history files. The procedure is relatively simple. Here's what to do:
First, select the device via the "Error class" prompt. Next, specify the
time frame. Then, when Retrieve prompts for Output mode, specify BINARY.
Retrieve will ask you if you want to merge the selected entries with an
existing binary history file. If you are building a new history file
press the RETURN key or type: NONE. If, however,a history already exists
for the selected device and you just want to combine the entries, then
specify the name of the history file in response to the "Merge" prompt.
Finally, Retrieve will prompt for the output file name. Again, if you
are building a new history file, then specify a unique file name. If,
however, you are updating an exiting history file, then specify the
name of the history file you are updating. In most cases it would be
the same file that you specified in response to the "Merge with" prompt.
@@ret_dia_q3
Retrieve Dialog - Q3 of 10
True or False - If, in response to the "Type <cr> to confirm (GO):"
prompt, you type "/DISPLAY" - Retrieve will display the current list
of responses?
@@ret_dia_q3_at
The statement is FALSE. The switch is called "/SHOW". If after making
your selections, you type "/SHOW", Retrieve will display each prompt
and the corresponding response as illustrated in the following example.
Type [cr] to confirm (/GO): /SHOW
RETRIEVE mode
-------------
Event or packet file: SYSTEM:ERROR.SYS
Output to: DSK:RETRIE.TXT
Merge with: NONE
Time from: EARLIEST
Time to: LATEST
Selection to be: INCLUDED
Output mode: ASCII
Report format: SHORT
Selection type: ERROR
Error class: DISK, TAPE,
Disk drives: DP120, DP230,
Tape drives: MT300,
Type [cr] to confirm (/GO):
@@ret_dia_q3_af
That's correct.
It's the "/SHOW" switch that will cause the current list or responses
to be displayed.
If, after reviewing the list, you decide that you want to change a
response, you can press the BACKSPACE key (or type /REVERSE) until you
get back to the response that you want to change. At that point you can
add to the response, or you can type /CLEAR and enter a new response.
@@ret_dia_q4
Retrieve Dialog - Q4 of 10
True or False - Retrieve can be used to select entries that pertain
to specific Disk Packs or Magtape Reel ID's?
@@ret_dia_q4_at
That's correct.
Pack and Reel ID's were added to the selection criteria so that you could
use the EXCLUDE mode to remove entries from the event file that pertain
to known bad media. Thus, you can clean up the file a bit, resubmit it
to Analyze, and see if media problems were covering up other more subtle
hardware problems.
@@ret_dia_q4_af
The statement is TRUE. If you type "?" at the "Error class" prompt you
will see that PACKID and REELID are among the selection criteria
available.
@@ret_dia_q5
Retrieve Dialog - Q5 of 10
True or False - If you specify a file name in response to the
"Merge with (NONE):" prompt, Retrieve will automatically append
the selected entries to that file?
@@ret_dia_q5_at
The statement is FALSE. Retrieve will NOT automatically append the
selected entries to the file that you specify in response to the
"Merge with" prompt.
Instead what happens is: the selected entries and the entries in the
"merge file" are combined and written out to the file that you specify
in response to the "Output to" prompt.
Selected Entries Merge with "file name"
| |
|_____________________|
|
Output "file name"
If at the "Output to" prompt, however, you specify the same file name
that you specified at the "Merge with" prompt, then the entries will be
combined and written in that file. Incidentlly, that is the recommended
method for maintaining device history files.
@@ret_dia_q5_af
That's correct.
Retrieve will NOT change the "merge" file in any way unless you direct
it to do so by specifying the same file name at the "Output to" prompt.
@@ret_dia_q6
Retrieve Dialog - Q6 of 10
True or False - Sequence numbers are used to identify the relative
position of the records in a system event file?
@@ret_dia_q6_at
That is correct.
Record sequence numbers are included as part the header in all Short
Full, and Octal reports translated by Retrieve. The sequence number is
the simplest way to refer to a specific record.
As long as the order of the records in the file are not disturbed, the
sequence numbers will remain valid. Thus, if you request a Short ASCII
translation of several records and then decide that you want a Full
translation of one or two of those records, you can do so by specifying
the sequence numbers to Retrieve.
@@ret_dia_q6_af
The statement is TRUE. Sequence numbers reflect each records relative
position in a file.
Remember, sequence numbers are dynamically assigned to each record as a
file is read. For example, if a file contains 623 records then, the
first record in the file will be assigned sequence number 1, the second
record will be assigned sequence number 2, etc. Finally, the last record
in the file will be assigned sequence number 623.
@@ret_dia_q7
Retrieve Dialog - Q7 of 10
True or False - Retrieve can be used to select entries based on the
event codes assigned to the entries by the operating system.
@@ret_dia_q7_at
That's correct.
If, for example, you wanted to select all KS10 Halt Status Block entries
you could: reference the Spear Manual or look on the back of the Spear
Reference Card to get the code number, specify "CODE" at the "Selection
type" prompt, and, when Retrieve prompted for "Event code:" enter 033
for TOPS-10 or 133 for TOPS-20.
@@ret_dia_q7_af
The statement is TRUE.
If you type "?" in response to the "Selection type" prompt, you will see
"CODE" listed as one of the acceptable responses. If you select "Code"
Retrieve will prompt you for the "Event codes".
The event types and the corresponding event codes are listed on the back
panel of the Spear Reference Card. In addition, the detailed information
contained in of each entry types is described in the Spear Manual.
@@ret_dia_q8
Retrieve Dialog - Q8 of 10
True or False - Typing /C in response to the "Next Category (FINISHED):"
prompt will clear all entries selected up to that point.
@@ret_dia_q8_at
That's correct.
Keep in mind, however, that in addition to clearing selected entries,
the /Clear switch will also reset the prompt response to the default.
In other words, suppose you type /SHOW before starting Retrieve. Then,
let's say that you decide that you don't want the selected magtape
entries after all, so you press the BACKSPACE key until you get back to
the "Error class" prompt. At that point you specify "Tape", Retrieve
prompts for tape drives and you type /CLEAR.
You might think that you are no longer selecting any Magtape entries.
But that is not the case. Instead, what you did was cleared the selected
list and thus, reinstated the default (ALL).
@@ret_dia_q8_af
The statement is TRUE. The /CLEAR switch provides a mechanism for
changing selected entry types.
For example, suppose you had just selected some event codes for
translation and you're about to press the RETURN key to start Retrieve
but, before doing so, you typed "/SHOW" just to double check yourself.
Now, suppose you discover that, for some reason, you entered the wrong
list of event codes. Here's what to do:
1. Press the BACKSPACE key until you get back to the "Selection type"
prompt.
2. Then, in response to the "Selection type" prompt specify "CODE".
3. When Retrieve prompts for the Event codes, type "/CLEAR" to clear the
existing list of event codes and then enter the correct list.
4. Finally, type "/SHOW" as a last check and then, if everything is OK
type "/GO" to start Retrieve.
@@ret_dia_q9
Retrieve Dialog - Q9 of 10
True or False - Entries can be retrieved by logical names (i.e. CPU0)
as well as by physical names (i.e. RP06)?
@@ret_dia_q9_at
Technically, the statement is FALSE. Retrieve can recognize some, but
not all, logical and physical names.
@@ret_dia_q9_af
That is correct.
Retrieve recognizes some, but not all, physical and logical names. Just
as a double check before running, Retrieve will list all selected names
that it considers to be logical. Thus, if you made a typing error or
entered a physical name that it does NOT recognize, you'll know because
Retrieve will list it as a logical name.
@@ret_dia_q10
Retrieve Dialog - Q10 of 10
True or False - Retrieve can be used to extract entries based on
STOPCODES or BUGxxx code names?
@@ret_dia_q10_at
That's correct.
The "Mainframe Error and Crash Summary" section of the Analyze report
breaks down STOPCODES (for TOPS-10) and BUGxxx (for TOPS-20 and VAX/VMS)
by: type, name, and number of occurances.
Thus, given the Analyze report, you can then use Retrieve to translate
or save the STOPCODE or BUGxxx entries for further investigation. This
feature is particularlly helpful when it come to saving and investigating
very intermittent system crashes.
@@ret_dia_q10_af
The statement is TRUE. Retrieve can be used to extract entries based
on STOPCODES and BUGxxx code names.
If you specify "CODE", Retrieve will prompt for "Event codes".
At that point you can enter the names of one or more STOPCODES or
BUGxxx that you want retrieved.
For example, if you typed:
Selection type (ALL): CODE<cr>
Event codes: DX2FUS,P2RAE<cr>
Retrieve will translate (or save) all entries that are related to
either of the Event codes (DX2FUS and P2RAE).
@@3.2.1.1.
That's it. There are only ten questions about the Retrieve dialog. If
you have gotten this far, then chances are you have a pretty good idea
of how to use Retrieve. Therefore, it is with great honor, that Instruct
pronounces you a "Retrieve-Dialog Subject Matter Expert".
@@3.3.0.
Compute calculates the following system performance factors:
System Availability (AS) - System Availability is the percentage of time
that the system was available for use. (It includes Standalone time.)
User Availability (UA) - User Availability is the percentage of time
that the system was available for use by the user community.
System Effectiveness (SE) - System Effectiveness (SE) is the percentage
of probability that the system remained available for a given period of
time (t).
The remainder of this introduction briefly explains the formulas used
by Compute to calculate these factors. For a more detailed explanation
of the formulas refer to the Spear Manual.
@@3.3.0.A.
The following formula is used to calculate System Availability (SA):
SA = (1.0) - CDT/(TDT + TRT)
where:
CDT = Chargeable Down Time
TDT = Total Down Time
TRT = Total Run Time
Remember - System Availability is the percentage of time that the system
was available for use. (It includes Standalone time.)
@@3.3.0.B.
The following formula is used to calculate User Availability (UA):
UA = (1.0) - CDT/(CDT + TRT)
where:
CDT = Chargeable Down Time
TRT = Total Run Time
Remember - User Availability is the percentage of time that the system
was available for use by the user community.
@@3.3.0.C.
The following formula is used to calculate System Effectiveness (SE):
SE = (SA) * (e** (-t/MTBF))
where:
SA = System Availability
e = the Napierian or natural base of logarithms (2.71828+)
t = an arbitrary period of time for which the SE factor is calculated.
Typically, Compute calculates the SE factor for four time periods:
6 minutes, 30 minutes, 1 hour, and 4 hours.
MTBF= The mean, or average time between failures (chargeable Downtimes).
e** means "e" raised to the power of (-t/MTBF).
Remember - System Effectiveness (SE) is the percentage of probability
that the system remained available for a given period of time (t).
@@R.T.3.3.0.
STOP - You are moving in a reverse direction through the menu. You are
about to back into the Introduction to Compute.
@@3.3.M.
Spear Library - Compute
Topic menu:
1. Overview
2. Compute Dialog
3. Questions & Answers
@@3.3.1.
Compute Dialog - The Compute dialog consists of seven selection prompts
and one confirmation prompt.
COMPUTE mode
------------
Event file (default):
Report period (LAST-WEEK):
Time from (EARLIEST):
Time to (LATEST):
Report type (SINGLE-REPORT):
Availability Report to ([DSK]:COMPUT.RPT):
Reload report to ([DSK]:RELOAD.RPT):
Type <cr> to confirm (/GO):
@@3.3.1.A.
The first selection prompt:
Event file (default):
allows you to specify the name of the file that contains the system
performance entries that you want Compute to use in its calculations.
The default response (SYS:AVAIL.SYS for TOPS-10, SERR:ERROR.SYS for
TOPS-20, and SYS$SYSDISK:[SYSERR]:ERRLOG.SYS for VAX/VMS) is enclosed
in parentheses, and can be selected by pressing the RETURN key. You can
override the entire default response by specifying a new file name, or
you can override any field in the default response by specifying
only the field that you want to override.
For example, if you were to type:
Event file (SERR:ERROR.SYS): .LWK<cr>
the input file specification would become SERR:ERROR.LWK
The prompt also supports standard Help and question mark (?) responses.
@@COMPUTE INPUT
.--------------. .___ Summary Report
| Calculate | |
System Event File ___| System |___|___ Availability Report
(or AVAIL.Ann) | Availability | |
|______________| |___ Reload Report
TOPS-10, TOPS-20, and VAX/VMS record entries that are used by Compute
to calculate overall system performance. Under TOPS-10 the entries are
recorded in a file called AVAIL.SYS. Under TOPS-20 and VAX/VMS the
entries are recorded in ERROR.SYS and ERRLOG.SYS respectively.
@@3.3.1.B.
The second selection prompt
Report period (LAST-WEEK):
allows you to specify the time period for which you want system
performance calculated. Compute is designed to calculate system
performance for the previous week. That is, from a week ago Sunday at
00:00:01 to last Sunday at 00:00:01. Thus, by running Compute weekly
you can monitor overall system performance and note any trends in
availability or effectiveness.
You can also direct Compute to calculate system performance for this
week or any other period of time. If you specify THIS-WEEK, then
Compute calculates system performance from last Sunday at 00:00:01 to
the present. If you specify OTHER Compute will prompt for the specific
time period.
The prompt also supports standard Help and question mark (?) responses.
@@3.3.1.C.
The third and fourth selection prompts
Time from (EARLIEST):
Time to (LATEST):
are displayed only if you specify OTHER in response to the Report Period
Prompt. The time prompts allow you to specify the specific time period
for which you want system performance calculated. You can specify the
default times (Earliest and Latest respectively), or you can specify
either real or relative time.
Both of these prompts also support standard Help and question mark (?)
responses.
@@3.3.1.D.
The fifth selection prompt
Report type (SINGLE-REPORT):
is also displayed only if you specify OTHER in response to the Report
Period Prompt. The Report Type prompt allows you to specify the type of
report that you want. You can specify the default, SINGLE-REPORT, in
which case Compute will generate a single report that reflects system
performance for the selected time period.
You can also specify MULTIPLE-REPORTS, in which case Compute will
generate (in addition to the single report) a set of weekly reports that
reflect system performance for the selected time period.
The prompt also supports standard Help and question mark (?) responses.
@@3.3.1.E.
The sixth selection prompt
Availability Report to ([DSK]:COMPUT.RPT):
allows you to specify the destination of the 132 column Availability
Report. The default destination (DSK:COMPUT.RPT for TOPS-10/TOPS-20,
and COMPUT.RPT for VAX/VMS) is enclosed in parentheses and can be
selected by pressing the RETURN key. Compute automatically outputs
a 72 column Summary Report to your terminal.
You can replace the entire default destination by specifying a new file
name, or you can replace any field in the default by specifying only the
field that you want to override. For example, if you were to type:
Availability Report to (DSK:COMPUT.RPT): FS:
the output file specification would become FS:COMPUT.RPT
The prompt also supports standard Help and question mark (?) responses.
@@compute output
Compute generates two reports; a 72 column Summary Report, and a 132
column Availability Report. The Summary Report is automatically output
to your terminal. At this prompt Compute is waiting for you to specify
a destination for the Availability Report. You can:
1. Press the RETURN key to select the default file specification:
DSK:COMPUT.RPT.
2. Enter a unique file specification (e.g., DSK:WK21.RPT). The file
specification format is: dev:<user>filename.filetype.version.
If you specified multiple reports, then Compute will generate a set of
weekly reports in addition to COMPUT.RPT. The reports will be named
Cmmdd.RPT. Where mmdd corresponds to the month and day of each week.
@@3.3.1.F.
The last selection Prompt
Reload report to ([DSK]:RELOAD.RPT):
allows you to specify the destination of the Reload Log Report.
The Reload Report uses 132 columns and lists the system name, the
operating system version, the number of times the system was reloaded,
and the operator's response to the question "Why Reload?"
You can select the default response (DSK:RELOAD.RPT for TOPS-10/TOPS-20,
and RELOAD.RPT for VAX/VMS) by pressing the RETURN key.
You can replace the entire default destination by specifying a new file
name, or you can replace any field in the default response by specifying
only the field that you want to replace. For example, if you typed:
Reload report to (DSK:RELOAD.RPT): .LWK<cr>
the output file specification would become DSK:RELOAD.LWK
The prompt also supports standard Help and question mark (?) responses.
@@3.3.1.G.
The confirmation prompt:
Type <cr> to confirm (/GO):
provides an opportunity for you to review and change any responses
entered up to that point. If you want to review the response list
type /SHOW. If you are satisfied with the response list press the
RETURN key or type /GO.
If you want to change a response, press the BACKSPACE key until you
arrive at the corresponding prompt, make the change, and then type /GO.
@@com_dia_q1
Compute Dialog Q1 of 5
True or False - The formulas used by Compute to calculate: System
Availability (SA), User Availability (UA), and System Effectiveness (SE)
are described in the Spear Manual?
@@com_dia_q1_at
That's correct.
The formulas:
SA = (1.0) - CDT/(TDT + TRT)
UA = (1.0) - CDT/(CDT + TRT)
SE = (SA) * (e** (-t/MTBF))
are also briefly explained in the Introduction section of this module.
@@com_dia_q1_af
The statement is TRUE. The formulas used by Compute to calculate
system availability, user availability and system effectiveness are
described in the Spear Manual.
You should become familiar with those formulas before you attempt to
interpret the reports generated by Compute.
@@com_dia_q2
Compute Dialog Q2 of 5
True or False - The entries used by Compute to calculate system
performance are recorded in the system event file: ERROR.SYS for
TOPS-10 and TOPS20, and ERRLOG.SYS for VAX/VMS?
@@com_dia_q2_at
The statement is FALSE. Under TOPS-20 and VAX/VMS the entries are
recorded in the system event files. However, under TOPS-10 the entries
are recorded in a file called AVAIL.SYS.
@@com_dia_q2_af
That's correct.
TOPS-10 records the entries in a file called AVAIL.SYS. As a general
rule, most TOPS-10 sites rename the AVAIL.SYS file to AVAIL.Ann weekly.
(Where nn is a number in the range of 01 to 99.) Typically, the first
AVAIL.SYS file becomes AVAIL.A01, the second AVAIL.A02, etc.
Thus, if the latest AVAIL.Ann file was AVAIL.A25, and you wanted Compute
to calculate system performance for the last four weeks, then you would
specify AVAIL.A22 as the input file.
@@com_dia_q3
Compute Dialog Q3 of 5
True or False - Compute generates two types of reports; a 72 column
Summary Report that highlights overall system performance, and a 132
column Full Report that provides more detail?
@@com_dia_q3_at
That's correct.
The Summary report is automatically displayed on your terminal. It will
provide a picture of overall performance. The Full report backs up the
Summary report with specific details.
Note: The Full report requires 132 columns and is generally not suited
for display on most terminals.
@@com_dia_q3_af
The statement is TRUE. Compute generates two types of reports; a Summary
Report that highlights overall system performance, and a Full Report
that details system availability and effectiveness.
The Summary report is automatically output to your terminal when you run
Compute. The following example illustrates a typical Summary report:
Compute Summary Report From: 7-Jun-81 01:00 To: 14-Jun-81 01:00
period length (HRS): 168.000
SYSTEM Availability % : 100.000
USER Availability % : 100.000
Effectiveness Six minutes Thirty minutes One Hour Four Hours
factor 99.584 97.938 95.918 94.648
Report file name: DSK:COMPUT.RPT
Note: The Effectiveness Factor is the probability that a six minute, a
thirty minute, a one hour and a four hour job will run to completion.
@@com_dia_q4
Compute Dialog Q4 of 5
True or False - Compute uses the operators response to the question:
"Why Reload" to determine User Availability?
Downtime?
@@com_dia_q4_at
The statement is FALSE. The operators response to the question: "Why
Reload" is used by Compute to distinguish between Chargeable Downtime
and Non-chargeable Downtime.
@@com_dia_q4_af
That's correct.
The operators response to the question: "Why Reload" is to distinguish
between Chargeable Downtime and Non-chargeable Downtime. The following
operator responses constitute:
Chargeable Downtime - STOPCD, BUGHLT, HALT, PARITY, HARDWARE, NXM, HUNG,
LOOP, AND CM (Corrective Maintenance).
Non-chargeable Downtime - PM (Preventive Maintenance), OPERATOR, POWER,
STATIC, NEW, SCHEDULED, STANDALONE, and OTHER.
@@com_dia_q5
Compute Dialog Q5 of 5
True or False - In addition to the Summary Report and the Full Report,
Compute also generates a Reload Report called COMPUT.RLD?
@@com_dia_q5_at
The statement is true, only in that Compute generates a Reload Report.
The report is actually called RELOAD.RPT not COMPUT.RLD.
@@com_dia_q5_af
That's correct.
The name of the report is: RELOAD.RPT. The following example illustrates
the type of information it contains.
SYSTEM 2116 THE BIG ORANGE, TOPS-20 MONITOR 4(3530)
Built on: 28-May-81 11:41:11 Version: 400,,3530
Loaded on: 10-Jun-81 20:20:45 Crashed on: 14-Jun-81 07:00:16
Reloaded on: 14-Jun-81 07:25:08 Why reload: OTHER
Run time: 6.004 Down time: 0.414
SYSTEM 2116 THE BIG ORANGE, TOPS-20 MONITOR 4(3530)
Built on: 28-May-81 11:41:11 Version: 400,,3530
Loaded on: 14-Jun-81 07:25:10 Crashed on: 15-Jun-81 08:38:20
Reloaded on: 15-Jun-81 08:38:20 Why reload: OTHER
Run time: 25.219 Down time: 0.000
The Reload Report and the Full Report, are intended to help you complete
system Crash and Uptime reports.
@@3.3.1.1.
That's the last question about the Compute dialog. Press the RETURN key
to return to the menu.
@@3.3.2.
The Compute Report questions were not ready in time for this Field Test
Version of Spear. Press the BACKSPACE key or type MENU to return to the
Compute menu.
@@com_rpt_q1
Compute Report - Q1 of 5
True or False -
@@com_rpt_q1_at
That's correct.
@@com_rpt_q1_af
The statement is TRUE.
@@com_rpt_q2
Compute Report - Q2 of 5
True or False -
@@com_rpt_q2_at
The statement is FALSE.
@@com_rpt_q2_af
That's correct.
@@com_rpt_q3
Compute Report - Q3 of 5
True or False -
@@com_rpt_q3_at
The statement is FALSE.
@@com_rpt_q3_af
That's correct.
@@com_rpt_q4
Compute Report - Q4 of 5
True or False -
@@com_rpt_q4_at
That's correct.
@@com_rpt_q4_af
The statement is TRUE.
@@com_rpt_q5
Compute Report - Q5 of 5
True or False -
@@com_rpt_q5_at
The statement is FALSE.
@@com_rpt_q5_af
That's correct.
@@3.3.2.1.
That's it. There are only five questions about the Compute Report.
Press the RETURN key to return to the Compute menu.
@@3.4.0.
Summarize Overview - Summarize is designed to read and summarize the
contents of system event files.
The purpose of this Instruct module is to ensure that you understand
the dialog and the report associated with the Summarize function. The
module consists of two parts. Part one briefly explains the Summarize
dialog and then asks some questions to ensure that there are no mis-
understandings.
Part two of this module briefly explains the format and organization of
the Summarize Report. (You will be asked to generate or obtain a typical
Summarize Report.) The remainder of the module consists of a series of
questions about the report. Again, the purpose of the questions is to
ensure that there are no misunderstandings about the general format and
content of the report.
Objective - Upon completion of this module you should have no difficulty
using the Summarize dialog or understanding the format, organization and
content of a typical Summarize report.
@@R.T.3.4.0.
STOP - You are moving in a reverse direction through the menu. You are
about to back into the Summarize Overview.
@@3.4.M.
Spear Library - Summarize
Topic menu:
1. Overview
2. Summarize Dialog
3. Summarize Dialog Questions & Answers
4. Summarize Report
5. Summarize Report Questions & Answers
@@3.4.1.
Summarize Dialog - The Summarize dialog consists of six selection prompts
and one confirmation prompt.
SUMMARIZE mode
--------------
Event file (default):
Category (ALL):
Time from (EARLIEST):
Time to (LATEST):
Show Error Distribution(YES):
Report to ([DSK]:SUMMAR.RPT):
Type <cr> to confirm (/GO):
@@3.4.1.A.
The first selection prompt:
Event file (default):
allows you to specify the name of the system event file that you want
summarized. The default response (SYS:ERROR.SYS for TOPS-10,
SERR:ERROR.SYS for TOPS-20, and SYS$ERRORLOG:ERRLOG.SYS for VAX/VMS)
is enclosed in parentheses and can be selected by pressing the RETURN key.
You can override the entire default response by specifying a new file
name. You can also override any field in the default response by
specifying only the field that you want to override.
For example, if you were to type:
Event file (SERR:ERROR.SYS): .LWK<cr>
the input file specification would become SERR:ERROR.LWK
The prompt also supports standard Help and question mark (?) responses.
@@3.4.1.AA.
After you have specified the source of input, SUMMARIZE prompts you
for the category.
Category(ALL):
ALL
MAINFRAME
DISK
TAPE
CI
NI
UNITRECORD
NETWORK
OPERATING-SYSTEM
COMM
PACKID
REELID
HELP
@@3.4.1.AB.
ALL (or the RETURN key) - indicates that you want to select
all errors. (This is the default).
MAINFRAME - indicates that you want to select errors occurring in
specific mainframe components.
DISK - indicates that you want to select errors occurring on disk
units. After selecting DISK, you can specify ALL the specific
disks by name (DPA3, RPB7), or by disk type (RP06, RM05).
TAPE - indicates that you want to select errors occurring on tape
units. After selecting TAPE, you can specifiy ALL, or specify the
tape names or types in question.
CI - indicates that you want to select CI-related errors. After
selecting CI, you can specify ALL, or the specific component of
interest.
NI - indicates that you want to select NI-related errors.
@@3.4.1.AC.
UDA - indicates that you want to select UDA-related errors.
After selecting UDA, you can specify ALL, or the specific
component of interest.
UNITRECORD - indicates that you want to select errors occurring
on unit-record devices such as card readers and line printers.
After selecting UNITRECORD, you can specify ALL, or type the
specific device names or types in question.
OPERATING-SYSTEM - indicates that you want to select operating
system codes. After selecting OPERATING-SYSTEM, you can specify
ALL, or type the name of a specific STOPCODE or BUG type.
COMM - indicates that you want to select errors occurring on
communication devices.
@@3.4.1.AD.
PACKID - indicates that you want to select specific disk packs.
After typing PACKID, you can type ALL, or type the specific pack
identifiers.
REELID - indicates that you want to select specific tape reels.
After typing REELID, you can type ALL, or the specific tape
identifiers.
HELP - indicates that you want to get detailed information
on the above categories.
All categories except for COMM and NI prompt further for specific
device types. Type ? at the subprompt level to get a list of
acceptable responses.
@@3.4.1.AE.
SUMMARIZE keeps prompting you for categories until you either type
FINISHED, or press the RETURN key.
Next Category (FINISHED):
Type one of the following:
The RETURN key, or FINISHED to take the default,
or,
another category.
@@3.4.1.B.
The third selection prompt:
Time from (EARLIEST):
allows you to specify the time at which to begin summarizing the system
event file. The default response (EARLIEST) is enclosed in parentheses
and can be selected by pressing the RETURN key. You can also specify
real and relative time.
The prompt also supports standard Help and question mark (?) responses.
@@3.4.1.C.
The fourth selection prompt:
Time to (LATEST):
allows you to specify the time at which to end summarizing the system
event file. The default response (LATEST) is enclosed in parentheses
and can be selected by pressing the RETURN key. Again, you can also
specify real and relative time.
The prompt also supports standard Help and question mark (?) responses.
@@3.4.1.DA.
The fifth selection prompt:
Show Error Distribution (YES):
allows you to specify whether or not you want to receive error
distribution tables. The default response (YES) is enclosed in
parentheses and can be selected by pressing the RETURN key.
If you type NO, you will suppress the error distribution tables
from the report.
@@3.4.1.D.
The sixth selection prompt:
Report to ([DSK]:SUMMAR.RPT):
allows you to specify the name of the output or Report file. The default
response (DSK:SUMMAR.RPT for TOPS-10/TOPS-20, and SUMMAR.RPT for VAX/VMS)
is enclosed in parentheses and can be selected by pressing the RETURN key.
You can override the entire default response by specifying a new file name.
You can also override any field in the default response by specifying only
the field that you want to override.
For example, if you were to type:
Report to (DSK:SUMMAR.RPT): FS:<cr>
the output file specification would become FS:SUMMAR.RPT
The prompt also supports standard Help and question mark (?) responses.
@@3.4.1.E.
Finally, the confirmation prompt:
Type <cr> to confirm (/GO):
provides an opportunity for you to review and change any responses
entered up to that point. If you want to review the response list
type /SHOW. If you are satisfied with the response list press the
RETURN key or type /GO.
If you want to change a response, press the backspace key until you
arrive at the corresponding prompt, make the change, and then type /GO.
@@3.4.1.F.
That concludes the explanation of the Summarize dialog. Next on the
menu is a set of questions about the Summarize dialog.
@@sum_dia_q1
Summarize Dialog - Q1 of 7
True or False - If you do NOT want to change any of the Summarize
default responses, you can type /GO at the Event file prompt?
@@sum_dia_q1_at
That is correct.
All Spear Library functions begin by setting the response list to the
default values. You can change the responses or type /GO at any time.
The function will use the responses that you have specified up to
that point and default the rest. If you make no changes the default
response list is used.
@@sum_dia_q1_af
The statement is TRUE. When you first enter a Spear library dialog, the
response list is set to the default values. Thus, if you type /GO at the
Event file prompt Summarize will begin execution using the defaults. The
result will be report that summarizes the contents of the entire event
file.
@@sum_dia_q2
Summarize Dialog - Q2 of 7
True or False - If you type HELP in response to any Summarize prompt,
a ONE page message explaining the prompt and the acceptable response
to that prompt will be displayed?
@@sum_dia_q2_at
That is correct.
All Spear Library prompts support the HELP and (?) command. The Help
messages are limited to one page, and the prompt is repeated immediately
following the message. Typing (?) will result in a list of acceptable
responses without explanation.
@@sum_dia_q2_af
The statement is TRUE. You can type HELP<cr> any time you are not sure
how you should respond to a particular prompt. You will receive a one
page HELP message that explains the prompt and the acceptable responses
to that prompt.
@@sum_dia_q3
Summarize Dialog - Q3 of 7
True or False - Summarize will accept and summarize the contents of any
binary event file, including a binary event file generated by Retrieve?
@@sum_dia_q3_at
That is correct.
Summarize will accept (as input) any file that conforms to the standard
binary event file format. Currently, that includes event files generated
by: TOPS-10, TOPS-20, VAX/VMS, or Retrieve.
There is one restriction, however, the event file must have been
generated by the same type of system that you are using to summarize
the file. In other words, the TOPS-10 version of Spear can NOT be used
to process event files generated by TOPS-20 etc.
@@sum_dia_q3_af
The statement is TRUE. Retrieve does not change the file format when it
generates a binary (or History) file. Therefore, since Summarize is
designed to handle standard binary event files, it will accept binary
event files generated by Retrieve.
@@sum_dia_q4
Summarize Dialog - Q4 of 7
True or False - In order to take the default response at a Summarize
prompt you must press the ESCAPE key before pressing the RETURN key?
@@sum_dia_q4_at
The statement is FALSE. You don't have to press ESCAPE/RETURN to take
the default response. You need only press the RETURN key.
Originally, the purpose of the ESCAPE key was to display the default
response. However, as a result of feedback during product Field Test,
the prompts were changed. They now display the default responses in
parentheses. Thus, the original purpose of the ESCAPE was nullified.
@@sum_dia_q4_af
That is correct.
Since the default response is enclosed in parentheses, there is no need
to use the ESCAPE key.
@@sum_dia_q5
Summarize Dialog - Q5 of 7
True or False - Summarize will accept and summarize Packet Files
generated by Analyze?
@@sum_dia_q5_at
The statement is FALSE. A Packet file is not a standard binary event
file. It is a special file produced by Analyze that contains pointers
that identify the records that were used as evidence to support the
theories listed in the corresponding Analyze Report file.
@@sum_dia_q5_af
That is correct.
Summarize only accepts standard binary event files. Since a Packet file
is not a standard binary event file, Summarize will not accept it.
@@sum_dia_q6
Summarize Dialog - Q6 of 7
True or False - If you want to change the name of the report file from
DSK:SUMMAR.RPT to DSK:TEST.RPT, you need only type TEST at the Report
prompt?
@@sum_dia_q6_at
That is correct.
You can substitute fields at any Spear file specification prompt. For
example, if you wanted the report to go to FS: and you wanted to call
it SUMMAR.LWK, you could type:
Report to(DSK:SUMMAR.RPT): FS:.LWK<cr>
@@sum_dia_q6_af
The statement is TRUE. All Spear Library file-name prompts accept field
substitution. You can substitute the output device, the file name, the
file extension, or any combination thereof.
@@sum_dia_q7
Summarize Dialog - Q7 of 7
True or False - Both the "Time from" and the "Time to" prompt accept
real and relative time?
@@sum_dia_q7_at
That is correct.
All Spear Library "Time" prompts accept both real and relative time
specifications.
@@sum_dia_q7_af
The statement is TRUE. All Spear Library "Time" prompts accept both real
and relative time specifications. The format for real time is:
dd-mmm-yy hh:mm:ss where dd is the numerical day, mmm is the first
three letters of the month, yy is the last two digits of the year,
and hh:mm:ss represent the hour, minute, and second respectively.
The format for relative time is:
-dd where dd represents some number of past days. The time defaults
to 00:00:01.
@@3.4.1.1.
That's it. If you have gotten this far, then chances are you have a good
handle on the Summarize dialog. Next on the menu is a brief explanation
of the Summarize Report format.
@@3.4.2.
Summarize Report - The Summarize Report consists of four major sections:
1. A File Environment and Entry Occurrence Count section.
2. A Monitor Detected Error and Reload section.
3. A Front-end, Channel and Device Summary section.
4. A Channel and Device Breakdown section.
This part of Instruct involves a series of questions. The questions are
designed to ensure that you understand the format and general content of
a typical Summarize Report.
@@3.4.2.A.
Before proceeding further, you should have a copy of a Summarize Report.
You can type /BREAK and generate one using the Spear Library or, you can
use the one in the Spear Manual.
When you are ready to proceed press the RETURN key.
@@sum_rpt_q1
Summarize Report - Q1 of 8
True or False - If you are running on a TOPS-20 System, the "Monitor
Detected Errors and Reloads" section of the Summarize Report identifies
the number of BUGHLT, BUGCHK, and BUGINF that occurred during the summary
period?
@@sum_rpt_q1_at
That is correct.
The BUGHLTs, BUGCHKs, and BUGINFs described in the TOPS-20 Software
Notebooks (Volume 16).
@@sum_rpt_q1_af
The statement is TRUE. You would have no way of knowing this, however,
if, during the summary period that you selected, there were no BUGHLTs,
BUGCHKs, or BUGINFs recorded. Summarize does not print this section of
the report unless there were BUGxxx events recorded during the summary
period.
@@sum_rpt_q2
Summarize Report - Q2 of 8
True or False - The "File Environment" section of the Summarize Report
always lists the total number and type of entries recorded in the system
event file that was submitted as input?
@@sum_rpt_q2_at
The statement is FALSE. The Summarize Report only lists the entries that
were recorded during the period of time being summarized. Although that
period of time could, it does not always reflect the entire event file.
@@sum_rpt_q2_af
That is correct.
Only the events that occurred on or between the time the user specified,
at the "Time from" and the "Time to" prompts, are summarized.
@@sum_rpt_q3
Summarize Report - Q3 of 8
True or False - Under the "File Environment" section of the Summarize
Report, the term "inconsistencies" refers to the number of unknown event
types that were found in the summarized period of the event file?
@@sum_rpt_q3_at
The statement is FALSE.
The term "inconsistencies" means that Spear encountered a nonrecoverable
read error while reading the event file. In such cases it loses sync and
must use the resynchronization word in the next data block to recover.
For further information about the resync process refer to the DEFINE.LIS
file and the Spear Manual.
@@sum_rpt_q3_af
That is correct.
The term "inconsistencies" refers to the number of times Summarize lost
sync reading the event file and had to use the resynchronization word
in the next data block to recover.
@@sum_rpt_q4
Summarize Report - Q4 of 8
True or False - The "Entry Occurrence Counts" section of the Summarize
Report lists the event code and the number of times each event type
appeared in the summarized period of the system event file?
@@sum_rpt_q4_at
That is correct.
The entry types are catalogued by entry code and described in Appendix B
of the Spear Manual. Sometime, when you get a chance, you should take a
look at Appendix B. It lists, in detail, the information recorded for
each entry type in the system event file.
@@sum_rpt_q4_af
The statement is TRUE. If you take a look at the report you'll see a
decimal number, followed by name, followed by a number in parentheses.
The decimal number indicates the number of times a particular entry type
appeared in the file; the name refers to the entry type; and the number
in parentheses refers to the code assigned to the entry type by the
system software developers.
@@sum_rpt_q5
Summarize Report - Q5 of 8
True or False - Under the "RP04/RP05/RP06 Breakdown" section of the
Summarize Report only the contents of Error Register 1 are listed?
@@sum_rpt_q5_at
The statement is FALSE. If there are any error bits set in Error
Register 2 they will be listed also. However, if none of the disk
error summarized had a bit set in Error Register 2 then, of course,
the contents of Error Register 2 would not be listed. If that's the
case, then you're correct.
@@sum_rpt_q5_af
That is correct.
Summarize does not try to hide information. However, because the report
was designed so that it could be displayed on a terminal (i.e., 72
columns), the contents of Error Register 2 are listed below the contents
of Error Register 1. The purpose of the question was to point that out
because, at a glance, you might think that Error Register 2 was part of
a different summary.
@@sum_rpt_q6
Summarize Report - Q6 of 8
True or False - For the most part the Summarize Report is easy to read
and understand?
@@sum_rpt_q6_at
We're glad that you're satisfied. However, if have any suggestion or
ideas that will improve the format or content of the report please use
the FEEDBACK feature on the Main Course Menu to let us know.
@@sum_rpt_q6_af
OK. Changing the report format is a relatively easy task. If you would
take the time to let us know how the report could be improved we'll do
our best to make the changes in the next release. You will find our
address listed under FEEDBACK on the Main Course Menu.
@@sum_rpt_q7
Summarize Report - Q7 of 8
True or False - In a Summarize Report, asterisks (***) will be printed
if a number exceeds the maximum digits for a field?
@@sum_rpt_q7_at
That is correct.
Each asterisk represents one digit of the total spaces set aside for
a numeric value (that includes the decimal point, if the number is
decimal). In other words, if three spaces were set aside for a value
(say 99.), then three asterisks (***) will be printed should the value
exceed 99.
@@sum_rpt_q7_af
The statement is TRUE. The number of digits that can be printed is
limited to the space available in the report (i.e., 72 columns). Thus,
there is always a possibility that the number of digits necessary to
report a count will exceed the available space. When such a case occurs
a string of asterisks (***) will be printed.
@@sum_rpt_q8
Summarize Report - Q8 of 8
True or False - The following Summarize report indicates that DP160
experienced 5 errors: 2 Hard Errors and 3 Soft Errors?
RP04/RP05/RP06 Breakdown:
Error Register 1
D U O D W I A H H E W F P R I I
C N P T L A O C C C C E A M L L
K S I E E E E R E H F R R R R F
C
S/N 1957
DP160 H 1. 1.
S 3.
@@sum_rpt_q8_at
The statement is FALSE. You cannot determine how many Hard and Soft
error a device experienced by looking at the Breakdown section because;
(and this is important to remember) the Breakdown section indicates the
number of times the error bit was set when Hard errors occurred, and the
number of times the error bit was set when Soft errors occurred.
The following RP04/RP05/RP06 Summary taken from the same Summarize
report that the Breakdown was taken from bears this out. It indicates
that DP160 experienced a total of 4 errors; 1 Hard and 3 Soft.
RP04/5/6 Summary:
Hard Soft
S/N 1957
DP160 1. 3.
The point is; don't be tricked into thinking that the system had more
errors than it actually had. When you want to know the total number of
errors experienced by a Channel or a Device go by the Summary NOT the
Breakdown.
@@sum_rpt_q8_af
That is correct.
The Breakdown reflects the number of times each bit was set during Hard
and Soft errors. If you want to know the total number of Hard and Soft
for a given Channel or Device refer to the Summaries.
@@3.4.2.1.
Well, that's it. You have just completed the Summarize Report section
of Instruct. Assuming that you have also completed the Dialog section,
you should feel that you are a qualified Summarize user.
If for some reason you do not agree, or again, if you have any ideas or
suggestions that will make either Instruct or Summarize a better product
please let us know. You will find our mailing address listed under
FEEDBACK on the Course Menu.
Press the RETURN key to return to the Spear Library Menu.
@@SUMMARIZE INPUT
System Event File ___. .----------.
! ! Event !
!___! File !____ Summary Report
Retrieve ! ! Summary !
(binary) File ___! !__________!
INPUT PROCESS OUTPUT
Summarize reads the specified event file, summarizes its contents and
produces a report file. The contents are summarized by: event code,
STOPCODE or BUGxxx code types, front-end reloads, channel errors, disk
errors and magtape errors.
@@sum_dia_qx
Summarize Dialog - Qx of x
True or False -
@@sum_dia_qx_at
That is correct.
The statement is FALSE
@@sum_dia_qx_af
That is correct.
The statement is TRUE
@@sum_rpt_qx
Summarize Report - Qx of x
True or False -
@@sum_rpt_qx_at
That is correct.
The statement is FALSE
@@sum_rpt_qx_af
That is correct.
The statement is TRUE
@@3.5.1.
Spear Library Applications
The Spear Library can be used in conjunction with either the Systematic
Substitution Troubleshooting Approach, or the Formal Troubleshooting
Approach to isolate the cause of intermittent failures.
@@3.5.1.A.
The first thing you want to do is ensure that Summarize is run on a daily
basis. The best way to do this is to run it via a daily Batch job. If
you not sure how to do that you can ask an experienced operator to give
you a hand, or, if your on a TOPS-20 system, you can try the using this
Batch Control File
@SUBMIT SPEAR /TIME:30 /AFTER:TODAY ! Resubmit SPEAR again tomorrow.
@RENAME *.RPT *.RPO ! Rename yesterdays report file.
@SPEAR ! Run SPEAR.
*SUMMARIZE /GO ! Summarize yesterday's errors.
*EXIT ! Then leave.
@IF (ERROR) ! Continue even if there's an error.
@PRINT *.RPT /NOTE:"SPEAR - F-S" ! Print two copies of the report:
@PRINT *.RPT /NOTE:"SPEAR - OPER" ! one for FS and one for Operations.
@@3.5.1.B.
Or, if your on a TOPS-10 system you can try using this Control File.
.SUBMIT SPEAR /TIME:30 /AFTER:23:59 ! Resubmit Spear again tomorrow.
.R SPEAR ! Run Spear.
*SUMMARIZE /GO ! SUMMARIZE yesterday's errors.
*EXIT ! Then leave.
.IF (ERROR) ! Continue even if there's an error.
.PRINT *.RPT /NOTE:"SPEAR - SITE" ! Print two copies of the report:
.PRINT *.RPT /NOTE:"SPEAR - F-S" ! one for FS and one for the Site.
.RENAME *.RPD = *.RPT ! Rename today's report so that it
! won't be printed again tomorrow.
@@3.5.1.C.
Once you have the Batch File running you can use the daily reports to
monitor the over all performance of the system. If the error rate for
a particular device or subsystem starts to go up, you will see it
reflected in the various summaries and histograms.
@@3.5.1.D.
Next, a few hours before you get the system for routine maintenance
submit the last seven days or so of the event file for summarization. Allow
yourself about an hour to look over the report and decide on a fault
isolation strategy. For example; suppose the report indicates
that, among other things:
DP140 reported 5 recoverable Index Errors while PS1: was mounted.
Since intermittent Index Errors are generally caused by either a faulty
Servo Track or a faulty Index Module; during the maintenance period you
could swap the Index module in DP140 with the Index module in another
drive (let's say DP220).
Then, when you return the system to operations you could ask that they
move PS1: to a different drive (perhaps DP110). The rest is a matter of
"wait and watch". You do the waiting and you use summarize (on a daily or
weekly basis) to do the watching.
@@3.5.1.E.
1. If the problem moves to DP220, then you know that the Index module
was the cause of the failure.
2. If the problem moves to DP110, then you know that the medium (PS1:)
was the cause of the failure.
3. If the problem does not move, then you know that cause was not the
Index module nor was it the medium. So, the next chance you get, put
everything back the way it was and try something else. Sooner or
later the report is bound to reflect the fact that you have
identified and either moved or eliminated the cause of the problem.
@@3.5.1.F.
When used in this manner SPEAR becomes a very powerful troubleshooting
tool. The principal is simple. If you move a faulty component from one
piece of equipment to another the error symptoms will move with it. If
they don't, then at least you know what the problem is not.
This particular isolation technique was developed, during product load
test, by the South Massachusetts Field Service Office. If you can come
up with any neat ways of using the Spear Library to simplify system
maintenance please let us know. We'd be glad to try and include it in
this Application Section. You'll find our address listed under Feedback
on the main course menu.
@@3.6.0
The KLERR function provides expanded reporting of the KL10 function
reads supplied by the Front-End on a monitor crash.
SPEAR can be used to generate detailed reports of and/or summaries of
KLERR data blocks. You can always get a summary, but you must select
one of three formats if you want a detailed report of each event.
@@3.6.1
The following summary options will be available:
o ALL -- This will result in a complete listing containing the
number of times each signal was true and false.
o ERRORS-ONLY -- This will result in a single-page list
containing the number of times an error signal was true and
the number of times it was false.
o NONE -- This will result in no summary at all.
@@3.6.2
The following report format options will be available:
o SUMMARY-ONLY -- This will result in no entry-by-entry output.
Only the final summary of signals will be printed.
o FULL -- The result will be a set of detailed reports that
list all of the registers and signals (true or false) as well
as fields.
o TRUE-SIGNALS -- The result will be a set of detailed reports
that list all of the registers but only the "true" signals
and not the fields.
o CRAM-BAD-WORD -- The result will be a set of reports,
consisting of one line for each record which included a CRAM
parity error. This line will report the CRAM location and
contents.
@@3.6.3
The following output formats will be available for the CRAM word:
o MICROCODE -- This format is used to compare the bad cram word
with the microcode listing.
o OCTAL -- This format matches the one shown in the KL10
Maintenance Handbook and can help isolate the failing cram
module.
o TRACON -- Used to compare with "TRACON" snapshots.
@@KLERR END
This concludes the KLERR section of the course. We hope you found it
useful. Also, if you have any comments about this section please get
in touch with us. Our address is found under FEEDBACK on the main
course menu.
@@4.0.
The Guaranteed Uptime Program is a service that allows you and DIGITAL to
work together to select and maintain the highest level of reliability for
your system.
Together you and DIGITAL determine the percentage of Uptime your site
requires, from 96% to 99%. Uptime is defined as any time the
system is NOT down - with downtime defined as:
(1) that time within the hours of contract coverage when the system is
turned over to DIGITAL for corrective maintenance due to operating
system malfunction resulting in a system crash and failure to restart.
(2) Failure of DIGITAL-supplied hardware which in your opinion makes the
system unavailable for use.
@@4.0.A.
The NOTIFY program and the SPEAR function COMPUTE are the two programs
that provide the tools to monitor the operation of the system and
calculate the statistics needed to measure uptime. NOTIFY is the program
that allows you to keep the current contract coverage in a file known as
the contract file.
The NOTIFY program also allows you to keep an outage log that contains the
date and time you report the system inoperable and the date and time you
accept the system back from DIGITAL as being fixed.
When you run NOTIFY, you input two types of information:
(1) The date and time you notified DIGITAL that the system was
down and the date and time DIGITAL returned the repaired
system to you
(2) The number of hours a day that you have DIGITAL maintenance
coverage. The NOTIFY program then creates a binary file in
your area called NOTIFY.SYS.
This is the file COMPUTE uses to produce the system uptime statistics.
@@4.0.B.
The NOTIFY program contains three modes:
DISPLAY
PURGE
UPDATE
The DISPLAY mode allows you to translate NOTIFY.SYS into ASCII so you can
display all or part of the outage log or contract file.
The PURGE mode allows you to delete a portion of the data base in NOTIFY.SYS,
either from the contract file or from the outage log.
The UPDATE mode allows you to write log entries or to insert or modify
contract coverage into NOTIFY.SYS.
@@4.0.C.
To collect the data needed to measure uptime, do the following:
1. Run NOTIFY to establish a contract file containing the number
of hours you have DIGITAL coverage for corrective maintenance.
2. When you determine that the system is inoperable, call DIGITAL
to report the system-down condition and turn your system over
to DIGITAL for service.
3. When the system is returned to you, run NOTIFY from the same
directory containing the contract file to log:
a) reported time (the date and time you notified DIGITAL).
b) accepted time (the date and time DIGITAL returned the
system to you).
4. After collecting 13 weeks of data run COMPUTE from the same
directory that you have been running NOTIFY.
@@4.0.D.
To run the NOTIFY program, type one of the following:
$ RUN SYS$SYSTEM:NOTIFY<cr> on VAX/VMS,
@NOTIFY<cr> on TOPS-20,
.R NOTIFY<cr> on TOPS-10.
NOTIFY responds with the following prompt:
NOTIFY>
At this point, as well as after any other prompt, you can type ? or HELP
to get detailed information on both the prompt and on acceptable responses.
Type DISPLAY if you want to check the outage log or if
you want to check the contract.
Type UPDATE if you want to enter or revise contract coverage, or if you want
to report an outage.
Type PURGE if you want to delete entries from either the contract file or
from the outage log.
@@4.0.E.
The NOTIFY program and the SPEAR function COMPUTE look for the NOTIFY.SYS
file in your default directory. If more than one person will be using
NOTIFY and COMPUTE, you may want to agree on where the NOTIFY.SYS file will
reside. Or you may want to change the location of NOTIFY.SYS.
To change the location of NOTIFY.SYS, use a text editor to modify the file
called NOTIFY.SPE. You can modify the file specification for NOTIFY.SYS to
specify a specific device and directory, or you can even change the name of
the file itself. Both NOTIFY and COMPUTE will use this file specification.
For a more detailed explanation of the NOTIFY program refer to the
GUIDE TO MEASURING UPTIME document.
@@GUP END
This concludes the Guaranteed Uptime Program/NOTIFY section of the course.
We hope you found it useful. Also, if you have any comments about this
section please get in touch with us. Our address is found under FEEDBACK on
the main course menu.
@@
@@rec_alg
Recovery Algorithms - Most operating systems have some sort of algorithm
or procedure for error recovery. This section of Instruct explains the
algorithms used by TOPS-10 and TOPS-20 to recover from disk read errors.
@@R.T.rec_alg
STOP - You are moving in a reverse direction through the course. You are
about to back into the Introduction to the Recovery Algorithms.
@@rec_menu
Disk Read Error Recovery Algorithms
Topic Menu
0. Introduction
1. TOPS-10 Disk Recovery Algorithm
2. TOPS-20 Disk Recovery Algorithm
@@t10_dsk_rec_alg
TOPS-10 RP04/05/06 Disk Read Error Recovery Algorithm
TOPS-10 and TOPS-20 use a similar algorithm to recover from disk read
data errors. The algorithm involves 31 retry attempts. Under TOPS-10,
if an ECC correctable error is detected during a read header or data
operation the following occurs:
1. The transfer is terminated.
2. The software reconstructs the data using the calculated ECC value.
3. The transfer is restarted beginning at the next sector (i.e., the
sector following the sector in error).
If the read data error is not ECC correctable, however, the following
recovery algorithm is evoked.
@@t10_dsk_rec_alg_a
1. Non (ECC) recoverable read error
2. Repeat read operation (attempt ECC correction)
3. Repeat read operation (attempt ECC correction)
4. Repeat read operation (attempt ECC correction)
5. Repeat read operation (attempt ECC correction)
6. Repeat read operation (attempt ECC correction)
7. Repeat read operation (attempt ECC correction)
8. Repeat read operation (attempt ECC correction)
9. Repeat read operation (attempt ECC correction)
10. Repeat read operation (attempt ECC correction)
11. Repeat read operation (attempt ECC correction)
12. Repeat read operation (attempt ECC correction)
13. Repeat read operation (attempt ECC correction)
14. Repeat read operation (attempt ECC correction)
15. Repeat read operation (attempt ECC correction)
16. Repeat read operation (attempt ECC correction)
17. Repeat read operation (attempt ECC correction)
Next Offset is tried.
@@t10_dsk_rec_alg_b
Offset heads (+400 microinches if RP04/05, +200 if RP06).
18. Repeat read operation (attempt ECC correction).
19. Repeat read operation (attempt ECC correction).
Offset heads (-400 microinches if RP04/05, -200 if RP06).
20. Repeat read operation (attempt ECC correction).
21. Repeat read operation (attempt ECC correction).
Offset heads (+800 microinches if RP04/05, +400 if RP06).
22. Repeat read operation (attempt ECC correction).
23. Repeat read operation (attempt ECC correction).
Offset heads (-800 microinches if RP04/05, -400 if RP06).
24. Repeat read operation (attempt ECC correction).
25. Repeat read operation (attempt ECC correction).
@@t10_dsk_rec_alg_c
Offset heads (+1200 microinches if RP04/05, +600 if RP06).
26. Repeat read operation (attempt ECC correction).
27. Repeat read operation (attempt ECC correction).
Offset heads (-1200 microinches if RP04/05, -600 if RP06).
28. Repeat read operation (attempt ECC correction).
29. Repeat read operation (attempt ECC correction).
Return to center line.
Set Error Correction Inhibit (ECC INHIBIT = 1)
30. Repeat read operation.
Reset Error Correction Inhibit (ECC INIBIT = 0)
31. Repeat read operation (attempt ECC correction).
If all 31 retries are unsuccessful, then the read error is defined as
non-recoverable (Hard) and an entry is made in the structures BAT block.
@@t20_dsk_rec_alg
TOPS-20 RP04/5/6 Disk Read Error Retry Algorithm
TOPS-10 and TOPS-20 use a similar algorithm to recover from disk read
data errors. The algorithm involves 31 retry attempts.
If any of the retry attempts are successful, then the error is defined
as Soft (recoverable) and the system continues in a normal manner. If,
however, all 31 retries are unsuccessful, then the error is defined as
Hard (non recoverable) and the system takes the appropriate action.
@@t20_dsk_rec_alg_a
The following details each of the 31 steps in the disk read error
retry algorithm. Assume that a read operation was initiated and
a read data error (DCK) was detected.
The first three retries do not attempt ECC correction.
1. Repeat read operation. Do not attempt ECC correction.
2. Repeat read operation. Do not attempt ECC correction.
3. Repeat read operation. Do not attempt ECC correction.
@@t20_dsk_rec_alg_b
The next 13 retries will attempt ECC correction if ECC Hard = zero (0).
4. Repeat read operation. Attempt ECC correction.
5. Repeat read operation. Attempt ECC correction.
6. Repeat read operation. Attempt ECC correction.
7. Repeat read operation. Attempt ECC correction.
8. Repeat read operation. Attempt ECC correction.
9. Repeat read operation. Attempt ECC correction.
10. Repeat read operation. Attempt ECC correction.
11. Repeat read operation. Attempt ECC correction.
12. Repeat read operation. Attempt ECC correction.
13. Repeat read operation. Attempt ECC correction.
14. Repeat read operation. Attempt ECC correction.
15. Repeat read operation. Attempt ECC correction.
16. Repeat read operation. Attempt ECC correction.
@@t20_dsk_rec_alg_c
The next 12 retries attempt offset and ECC correction.
The first offset value listed is used for RP04s and RP05s.
The second offset value listed is used for RP06s.
17. Offset (+400/+200). Repeat read operation. Attempt ECC correction.
18. Repeat read operation at this offset. Attempt ECC correction.
19. Offset (-400/-200). Repeat read operation. Attempt ECC correction.
20. Repeat read operation at this offset. Attempt ECC correction.
21. Offset (+800/+400). Repeat read operation. Attempt ECC correction.
22. Repeat read operation at this offset. Attempt ECC correction.
23. Offset (-800/-400). Repeat read operation. Attempt ECC correction.
24. Repeat read operation at this offset. Attempt ECC correction.
25. Offset (+1200/+600). Repeat read operation. Attempt ECC correction.
26. Repeat read operation at this offset. Attempt ECC correction.
27. Offset (-1200/-600). Repeat read operation. Attempt ECC correction.
28. Repeat read operation at this offset. Attempt ECC correction.
@@t20_dsk_rec_alg_d
The final three retries are a last ditch effort to get the data.
29. Return to centerline. Repeat read operation. Attempt ECC correction.
30. Set Error Correction Inhibit. Repeat read operation.
31. Set Error Correction Inhibit. Repeat read operation.
If all 31 retries are unsuccessful, then the read error is defined as
non-recoverable (Hard) and an entry is made in the structures BAT block.
@@dialog_change
The following dialog changes must be made to all SPEAR version 1.x
command and control files in order for them to operate under SPEAR
version 2.0. Although the examples use TOPS-20 style commands, the
changes apply to the TOPS-10 and VMS versions of SPEAR version 2.0
as well.
Retrieve:
The only changes in the Retrieve dialog from version 1.x to version
2.0 are in the "Selection type" "Error" and "NonError" areas.
However, there are no changes for a "Selection type" of "Error" "All".
@@dialog_change_a
The following example illustrates changes in Retrieve
"Selection type" "Error".
SPEAR v1.x SPEAR V2.0 Comments
__________ __________ ________
*Error *Error Selection type
*Disk *Disk Device category
*RP06 *RP06 Specific device(s)
*All Device error type
*Finished *Finished End device selection
To retrieve the events for a specific device error type, replace
"*All" in the version 2.0 dialog above with one or more device
error types. For example, *Software, Bus, Channel-controller.
@@dialog_change_b
The following example illustrates changes in Retrieve
"Selection type" "NonError".
SPEAR v1.x SPEAR v2.0 Comments
__________ __________ ________
*NonError *Stat, Diag, Config, Other Device Category
*All Device Selection
To retrieve the events for a specific device or class of device,
replace the "*All" in the version 2.0 dialog above with one of
the following command sequences:
*Disk *Disk Device category
*All *RA60, RA80, RA81 Specific device(s)
*Finished *Finished End device selection
@@dialog_change_c
The same functionality in Summarize may be maintained by changing the
version 1.x dialog to the version 2.0 dialog below.
SPEAR v1.x SPEAR v2.0 Comments
__________ ___________ ________
@SPEAR @SPEAR Run SPEAR
*Summarize *Summarize Invoke Summarize
*SERR:ERROR.SYS *SERR:ERROR.SYS Event file
*All Device category
*Earliest *Earliest Time from
*Latest *Latest Time to
*Yes Error distribution
*DSK:SUMMAR.RPT *DSK:SUMMAR.RPT Report to
*/Go */Go Start processing
@@dialog_change_d
To get summaries for a specific device or class of device,
replace the "*All" in the version 2.0 dialog above with
one of the following command sequences:
*Disk *Disk Device category
*All *RA60, RA80, RA81 Specific device(s)
*Finished *Finished End device selection
To suppress the Error Distributions, change the "*Yes" to
"*No" in the version 2.0 dialog above.
@@dialog_change_e
There are no dialog changes in Compute.
@@