PDP-10 Archive: decus/20-0148/spell.rnd from decus_20tap5

Trailing-Edge - PDP-10 Archives - decus_20tap5_198111 - decus/20-0148/spell.rnd
There are 4 other files named spell.rnd in the archive. Click here to see a list.
.C;SPELL: Spelling Check and Correction Program
.hl1 Introduction
SPELL is a program designed to read text files and check them for
correctness of spelling. In addition to the spelling check, the
program provides a means for correcting words that it thinks are
misspelled. This program was written by Ralph E. Gorin of Stanford
University Artificial Intelligence Laboratory. It has been augmented
by William Plummer and Jerry Wolf of BBN and Marshall Abrams of NBS. 
.S
In its normal mode of usage, SPELL reads through an input text file,
asks the user about each word it does not recognize, and creates an
output file in which corrections have been made. Provisions exist
for:
.ls.le
Loading, incrementally augmenting, and dumping special
dictionaries. Such dictionary files are ordinary text
files which may be listed and edited. Arbitrarily
many dictionaries may be loaded, subject only to
availability of (virtual) main memory. 
.le
Training modes where SPELL scans an input file and
makes a list of all words it does not recognize. Such
a list can be used as an auxiliary dictionary.
.le
Termination of spelling checking part way through a
file and a way of picking up where you left off in a
later session.
.S.i-9
Other features of the program are:
.le
SPELL will read either SOS, TECO or E/TV files. The
corrected output will be written in the same mode,
except E/TV directories must be deleted. Dictionary
files may be SOS, TECO or E/TV format.
.le
An exception file may also produced. This file
contains all words SPELL did not recognize (and their
contexts), all corrections, plus all words which SPELL
recognized by stripping off prefixes and/or suffixes.
This last class of words appears marked with "[" or
"]" to denote prefix- or suffix- removal. E.g., 
FOOING]
[PREFOOED]
The affix-stripping algorithms are not foolproof
(e.g., CHOSES] ), so this gives a quick way to scan
for the exceptions which may slip through.
.le
SPELL keeps track of all word spellings which were
corrected by the user. Subsequent occurrences of such
are automatically corrected by the program. This is
reported to the user by a typeout of "MISSPELLING ==>
CORRECTION".
.le
When a word is corrected, the output file will be
rewritten with either upper case, lower case, or mixed
(first letter upper, the remainder in lower),
depending on the cases of the first two letters in the
original word. Note: this will be incorrect in some
cases (e.g., McCarthy).
.F.S

.els.hl1 Using SPELL
.hl2 Starting SPELL
Type the command R SPELL (under TENEX, type "SPELL" to the EXEC). 
All typeins to SPELL must be terminated by carriage return. (In the
TENEX version, editing by _^A, _^Q, and _^R may be done.)
(TOPS-10 accepts _^U, DEL, and, depending on MONGEN parameters,
BS.)
.hl2 Augmenting the built-in Dictionary
The non-TENEX version will first attempt to load the default private
dictionary file WORDS.DIC into dictionary 1 with incremental insertion set
(see Incremental Insertion below).
.s
Next, SPELL will ask: "Do you want to augment the dictionary?" If you
wish to use only the main dictionary (plus WORDS.DIC) 
presently in memory, type <cr>. You can then skip the next paragraph. 
.hl3 Private auxiliary dictionary
If in fact you have an additional auxiliary dictionary (of specific terms,
infrequently used words, etc.) you wish to use, type "Y" <cr>. You
will then be asked for the name of the dictionary file.
.hl4 Incremental insertions
After typing the file name you will be
given the option of marking the new entries as incremental
insertions. If the new entries are marked as incremental then they
will be included in an incremental dump of the dictionary. To have
the new entries marked as incremental, type "I" <cr>; otherwise, type
<cr>. (If any of the words in your auxiliary dictionary are already
in the main dictionary then no second copy of the word will be made.
Hence, if your words are marked as incremental then in a subsequent
incremental dump, any words that were already in the dictionary will
not be dumped.)
.hl4 Additional dictionaries
After loading an auxiliary dictionary the program
will type the new total number of words in the dictionary (and,
except under TENEX, the amount of core used). You will then have an
opportunity to save the new core image (normally you won't do this).
You will again be asked, "Do you want to augment the dictionary?",
thus allowing you to enter a number of auxiliary dictionaries
(limited only by the availability of {virtual} core). 
.hl4 Dictionary format
The format
of the file is one dictionary entry (word) on a line; words must be
composed of alphabetic characters or apostrophe and less than 40
letters long. The dictionary entries need not be in alphabetical
order. A misspelling-correction pair may occur one line in the form
"MISSPELLING>CORRECTION".
.hl2 Switches
You will then be given an opportunity to specify zero or more switch
options. The meanings of the switches are:
.s.lm10
.i-5
T####Training mode. SPELL will treat the input file as a
training set rather than a file to be corrected. All
words in the file which are unfamiliar to SPELL will be
entered in the dictionary as incremental insertions.
After SPELL finishes reading the file, the user has an
opportunity to dump all the words that were inserted in
this manner. The resulting list of words may be
edited, and any words which are incorrect may be
deleted. Then this file can be used as an auxiliary
dictionary while correcting the original source file.
.S
This feature is provided for the purpose of easing the
problem of creating a specialized dictionary of jargon
and infrequently used words.
.S
.i-5
Q####Q-Training mode. In this mode, all words in the source
file that are unfamiliar to SPELL will be added to the
dictionary; the difference is, if any "new" word is
"close to" some old word, the new word will be output
to the exception file. The exception file will contain
only such words. In this way, the spelling checker
calls to your attention the fact that these words may
be misspelled.
.S
.i-5
N####No suffix removal. This switch suppresses the attempt
to remove suffixes to recognize a correctly spelled
root word. SPELL will then find many more questionable
words, but it will work more correctly than the
heuristic affix removal. 
.S
.i-5
A####No prefix removal. This switch suppresses the attempt
to remove prefixes to recognize a correctly spelled
root word.
.S
.i-5
U####Accept Upper case mode. In this mode, all words that
are written entirely in upper case will be inserted in
the dictionary. This is useful when a manuscript file
contains jargon terms that are written in upper case,
and text-processor (e.g., PUB, TJ6, or RUNOFF) commands
in upper case. Before reading the input file, SPELL
will ask for a dictionary number to use for all words
inserted this way (see "How to Use Multiple
Dictionaries"). 
.S
.i-5
P####Pickup mode. After specifying input and output file
names, you will be asked to specify a page and line
number for pickup. The effect is to suspend spelling
checking until the page and line specified. When a
user has a partially corrected file, this mode will
enable him to skip over the portion of the file that
has already been corrected. The input file will be
copied without checking to the output until the page
and line specified, at which point spelling checking
begins.
.lm0.hl2 File to be corrected
Next you will be asked for the name of the file that you want
to check for spelling errors. File names are specified in the usual
format of "name.ext[prj,prg]" where name is the filename, ext is the
file extension, and [prj,prg] is the name of the file owner, which
may be omitted if the file is on the present user's disk area. If you
omit the file name then you will immediately enter the exit sequence
(see below).
.S
.hl2 Output file
You will be next asked to name the output file. Enter a file
name, or just a <cr> if you wish to use the default (see next
paragraph).
.hl3 Default output file(s)
In the TENEX version there are always output and
exception files; by default the output file will have the same name
and extension as the input, and the exception file will have the
extension "EXCEPTIONS". In non-TENEX versions the default output
is to rename the input file with extension .BAK and to place the
corrected text in the (previous) input file name.
 The exception file may be omitted. 
.hl3 Exception file
The exception file, should you chose to make one, will contain each
line on which an error was found, the indication of the page and line
number, and the suspect word. Words accepted via the affix removal
heuristics will also appear in the exception file. 
.hl2 Checking and Correcting
After you have specified all the files, the program will respond with
"Working..." and start checking the input file for spelling errors. 
.hl3 Choices when an unknown word is found
When the spelling checker encounters a word that isn't in the
dictionary, it will type the page and line number, the line in which
the word occurs, and the word itself.
.S
In general, when a word is found that is not in the dictionary, a
brief message will be typed to remind you of the possible choices. In the
special case where the program finds precisely one possible
correction for the word, you will be given the choice of typing C to accept the
"guess" or any other option. The options are:
.s.lm10.i-5
C####The "guess" is correct. Correct the word in the text. Enter misspelling
in dictionary so that it will be corrected if it appears again.
.s.i-5
A####Accept this word, this one time. 
.S
.i-5
I####Accept this word and insert it in the dictionary so
that subsequent occurrences of this word will be
recognized and accepted. Words that are inserted this
way are marked as incremental insertions and they may
be dumped to form an auxiliary dictionary.
.S
.i-5
R####Replace this word. Type "R" <cr> and the program will
ask you for the replacement word. If the replacement
word is not already in the dictionary, the program will
give you an opportunity to insert it.
.S
If the misspelling is due to an omitted space between
words, use the "R" command to retype the words with the
space.
.S
If the replacement word contains no spaces or illegal
characters, you will be asked if you wish to add this
replacement to the list of misspelling-correction
pairs. If you do add the replaced word as a misspelling
then all subsequent occurrences of that word will be
replaced with the replacement string. 
.S
.i-5
X####Accept this word and finish. The word will be
accepted. Then the remainder of the input file will be
copied without checking to the output file.
.S
.i-5
W####Save my incremental insertions. After you type
"W" <cr> you will be asked for a file name. Then an
incremental dump of the dictionary will be written into
the file. After the dump is complete you may then
decide what to do with the excepted word.
.S
.i-5
L####Load an auxiliary dictionary. The present word is
accepted and you will be asked for the name of the
dictionary file to load. This is useful if you
encounter a jargon term but forgot to load the
appropriate dictionary. 
.S
.i-5
D####Display the line and offending word again. The line
that is displayed will not have any corrections shown
in it. If a line has more than one error the line will
only be typed once. Subsequent errors on that line
will cause only the particular word to be typed, unless
this command is used.
.S
.i-5
S####If this choice is offered then the spelling checker has
discovered several words that could be possible
corrections of this word. If you type "S" <cr> then
you will enter a mode where you can look at the words
that were found by the program and (optionally) select
one of the words from the list.
.S
When you enter this selection submode, the first word
in the list of possible corrections will be typed
followed by an asterisk. Then you have the following
choices:
.S.lm17.i-7
C<cr>##Use this word as the Correction.
.S.i-7
<cr>###Show the next possible choice. When you
exhaust the choices you are returned to the
outer mode, and asked again.
.S.i-7
_^<cr>##Back up in the list.
.S.i-7
<esc>##Escape from this submode and return to the
outer command mode.
.lm0.F.S
Note that when you make a correction via the C command or by
selection from the list presented by the S command, that correction
is entered in the misspelling-correction list and subsequent
occurrences of the same misspelling will be corrected automatically. 
.hl2 Finished processing the file
When the input file is exhausted, all files are closed, the program
types "Finished.", and the exit sequence is entered.
Except on TENEX, you will be asked if you want the default dictionary
WORDS.DIC dumped. Answer "yes" or "no" (actually only a single letter
answer is required).  The user then
has several options:
.lm5.S
.i-5
E####Exit now.
.S
.i-5
S####Save this core image.
.S
.i-5
C####Go back and correct another file.
.S
.i-5
A####Augment the dictionary, set new switches, and correct
another file.
.S
.i-5
D####Complete dump of the dictionary. This will create a
very large file, and it is not usually recommended.
.S
.i-5
I####Incremental dump of the dictionary. All the words that
were read in from a private dictionary and marked with
an I plus those inserted while running the program are dumped to a
file. The user specifies a file name (the default is
WORDS.DIC). This incremental file is in a format
suitable for editing or for use as an auxiliary
dictionary. The words in this file are in alphabetical
order. 
.S
.i-5
X####This command is used to get a trace count of the
program. It is for diagnostic purposes only, and is
displayed as a possible choice only if the program has
been assembled as a debugging program.
.F.S

.lm0.hl1 How to Use Multiple Dictionaries
.S
SPELL has a set of features whereby the user can cause the creation
of several disjoint incremental dictionaries. In this way, the user
may collect several dictionaries of special terms. Internally, all
dictionary entries are considered equivalent as regards word
searches. The distinction between dictionaries becomes relevant when
doing incremental dumps (the I command during the exit sequence or
the W command while in the middle of execution). When an incremental
dump is requested, the user may specify a number, e.g., W9, which
selects the particular incremental dictionary to be dumped. In this
example, dictionary 9 will be dumped. 
.hl2 Dictionaries 0 and 1
Dictionary 0 is the main dictionary. Words cannot be added to this
dictionary, except by reading an auxiliary file. In general, words
that are inserted incrementally are marked as being in dictionary 1.
All words that are incremental insertions in the dictionary will be
marked in dictionary 1, unless the user specifies otherwise. 
.hl2 Specifying a dictionary
The following places are where the user may specify which dictionary
to add to:
.ls.le
When loading an auxiliary dictionary, if the user
responds with "In" to the question about marking new
entries as incremental, then the new entries will be
marked in dictionary number n (where n is
interpreted as decimal and should be less than 31).
.le
After a word has been rejected, type "In" to insert
the word in dictionary number n.
.le
After replacing a word, if the replacement is not in
the dictionary, then type "In" to insert the
replacement into dictionary n.
.els.s
When requesting an incremental dump, the user may specify the
particular dictionary to dump. This is allowed in two cases:
.ls.le
After some word has been rejected, the command "Wn"
will cause dictionary number n to be dumped.
.le
During the exit sequence, the command "In" will
cause dictionary number n to be dumped.
.els.F.S
In all five cases above, if n is either 0 or omitted, then it will be
taken as being 1. 
.S
.hl2 Caution!
There is no provision in SPELL for remembering which
dictionary numbers have been used. Therefore, it remains the
individual user's responsibility to remember the numbers of all the
dictionaries that he creates. (Forgetting the number will mean that
the forgotten dictionary can not be dumped incrementally. The words
in a forgotten dictionary will still be available, but the only way
to actually get them dumped out is to dump the entire dictionary). 
.S
.hl2 Hint:
In the course of correcting a file, it is likely that you will
be asked about words which you wish to have accepted during this
file, but which you don't wish to have saved in your incremental
dictionary(s). In these cases, simply insert them in a "throwaway"
incremental dictionary which you don't bother to dump when you're
finished. 
.S
.hl1 Abnormal Conditions
.S
While the program is running it is possible that certain abnormal
conditions may obtain. The usual response of the program is to type
some sort of error message. The following is a list of the error
messages in SPELL, with an indication of the severity of the error. 
.s.lm10.i-5

Illegal dictionary entry: <word>
.br
This error occurs if an entry in a dictionary file
exceeds 40 (decimal) characters. The word is
ignored.
.i-5

0 LENGTH WORD AT HASHCP
.br
Somebody just asked to compute the hash address of
an empty word. The program continues, but there
is a possibility of error.
.i-5

HASHING ERROR
.br
Somebody asked for the hash address of a word that
doesn't begin with letters or apostrophe as the
first two characters. This is a fatal error; the
program halts.
.i-5

DEVICE DATA ERROR (OUTPUT)
.br
This message means that while writing a file,
something screwed up. The program halts.
.i-5

DEVICE ERROR (INPUT)
.br
The input file is screwed up in some way. The
program halts.
.i-5

Internal confusion in the spelling checker.
.i-5
Called from location <loc>.
.br
The spelling checker
has discovered a (possible) bug in itself. The
program halts, but the user may type CONTINUE.
Please note the location mentioned and the
circumstances that evoked the message.
.i-5

Dictionary number too large, Maximum is 30.
.br
This message means that the user attempted to
select for insertion or dumping a dictionary
beyond the range of allowed numbers. The user
will get another chance to do the right thing.
.i-5

Unrecognized switch.
.br
The user asked for an unknown switch. He must
repeat the entire switch specification again. 
.s.i-10
The following messages occur only in the non-TENEX version:
.s.i-5

Illegal Character in Scan.
.br
This is a message from the routine that reads file
names. You will be asked to try retyping the name.
.i-5

File not Found. <filename>
.br
The indicated file could not be found. The user
gets to specify some other file. 
.i-5

Enter failed on: <file name>
.br
An enter uuo failed while trying to select the
indicated file for output. The user may specify
another name. 
.i-5

Open Failed on Device <dev>:
.br
You asked for a device that doesn't exist or isn't
available to the program. You will get a chance to
ask for something else. 
.i-5

Insufficient Core Available.
.br
SPELL requires more core while expanding the
dictionary but none is available. The program
exits. 
.F.S
.lm0.hl1 Internal Workings
.hl2 Data Structures - Hashing Function
.S
The data structure is the heart of the program, and any efficiency in
the program operation is due primarily to this choice of data
structure. The data structure is basically a hash coding scheme
where dictionary entries are accessed by both their alphabetic order
and by their length. There is a base table that contains 26 * 26 *
10 halfwords; this table gives anchors for 6760 chains. Each chain
contains exactly all words with the same two first letters and some
given length. To be precise, the hashing function is:
(L1*26+L2)*10+min(WL-2,9),
where L1 and L2 are numeric representations of the first and second
letters (A=0, B=1, ... Z=25, and apostrophe also is 25), and WL is
the length of the word in characters.
.S
This scheme was chosen since it provides both an efficient
way to probe the dictionary and a quick way to select a small subset
of all words that are close to a given input word.
.S
.hl2 Data Structures - Dictionary Entry Format
.S
Entries are added to the appropriate hash chain by the INSERT
subroutine. Entries are added to the head of the chain, saving the
time and effort of searching to the end of the chain. This scheme
means that the last item entered on a chain is the first item seen by
a search. The format of the entry is given by:
.NF.S
Word 0:	xwd flags,nextlk
Word 1: 5 bit representation
Word 2: 5 bit representation
	...
.F.S
There are precisely 1+ceiling(WL/7) machine words used for each
dictionary entry. WL is the length of the entry in characters.
Nextlk is the pointer to the next entry in the list, or zero if this
is the last in the chain. The left side contains flags; bits 13-17
specify the incremental dictionary number (0 for main, 1-30 for
incremental dictionaries, and 31 for misspellings). One can imagine
that bits 5-12 could be used to store semantic information about the
entry. The unused bytes in the last word of an entry must be zero,
since they are used to stop the routine that converts the five bit to
7 bit. 
.S
Misspellings are entered in the main dictionary with the incremental
dictionary number set to decimal 31. If a word is marked as a
misspelling then the word preceding the flags and link contains a
pointer to the flag and link word of the correction. Misspellings
may be deleted before the full dictionary is dumped, or whenever
SPELL asks about saving the core image. The space obtained by
deleting misspellings is not reutilized at present, although that
feature could be added without great difficulty. 
.hl2 Spelling Correction Heuristics
.S
There are four kinds of errors that the program attempts to correct:
.S.LM5.NF
1. one wrong letter.
2. one missing letter.
3. one extra letter.
4. two transposed letters.
.LM0.F.S
For a wrong letter in the third or subsequent character, all words
that are candidates must exist on the same chain that the suspect
word hashes to. Hence, each entry on that chain is inspected to
determine if the suspect differs from the entry by exactly one
character. This is accomplished by an exclusive-or (XOR) between the
suspect and the dictionary. Then a JFFO instruction selects the
first non zero byte in the XOR. This byte is zeroed and if the
result is all zero then the dictionary word differs from the suspect
in only one letter. All such words are listed at CANDBF, where they
can be inspected later. 
.S
For a wrong letter in the first or second character, the program
tries varying the second letter through all 26 possible values,
searching for an exact match. Then all 26 possible values of the
first letter are tried, after setting the second letter to its
original value. This means that 52 more chains are searched for
possible matches. 
.S
To correct transposed letters, all combinations of transposed letters
are tried. There are only WL-1 such combinations, so it is fairly
cheap to do that. 
.S
To correct one extra letter, wl copies of the word are made, each
with some letter removed. Each of these is looked up in the
dictionary. This takes WL searches. 
.S
To correct one missing letter, WL+1 copies of the word are made, each
time inserting a null character in a new position in the suspect.
The null character is never part of any word, so the suspect word
augmented by an embedded null can be thought of as a word with one
wrong letter (the null) then the algorithm for matching one wrong
letter is used. If the first character is omitted, all 26 possible
first characters are tried. Also, 26 more words are formed by
varying the second character in case that had been omitted. 
.S

.hl1 Assembly _& Loading Instructions
.S
There are three assembly time switches, TENEX, STANSW and SANSW. If
STANSW is set then there are SIXBIT ppn's and the SWAP UUO; if STANSW
is zero, then normally there are octal ppn's except if SANSW is set,
in which case, there are decimal ppn's. If the TENEX switch is set
it overrides the others and TENEX style file names are used. Compile
the program using MACRO or FAIL and load it.
The non-TENEX versions will produce a sharable high segment and a
non-sharable low segment. When you start the
program the first time after loading, it will demand a dictionary.
This dictionary will be read in as dictionary zero, which is the
built-in dictionary for future use. In the non-TENEX versions dictionary
zero is read into the sharable high segment. Use the largest dictionary
you have which meets your storage limitations. The file SPELLD.ALL is 
recommended.
.S
When dictionary zero is loaded, SPELL will ask "Do you wish to save
this core image?" Answer "Yes" and save the resulting core image.
On non-TENEX systems, do an SSAVE to produce a sharable high segment containing
dictionary zero.
.S
There are various other assembly switches, but they default to
reasonable settings, and you meddle with them at your peril. 

.hl1 Possible Expansions
.S
No program that is still in use is complete. The following
paragraphs include suggestions of possible future work in this area. 
.s
For non-paged systems, the main dictionary (0) should be in the
sharable segment and the private additions in the non-sharable segment. This
would require adjusting the hash chains to point across the gap.
.S
The dictionary should be expanded to include all suffixes for every
word. There is a feature that strips suffixes for the purpose of
finding the stem of the word in the dictionary, but this heuristic is
error prone and incompatible with later attempts to correct the word. 
.S
If semantic information were included in the dictionary, it could
help guide the selection of a correction. 
.S
Either of these would require a major restructuring of the program,
since the dictionary would no longer fit in core. Probably, the
dictionary should be kept on the disk, with a data structure similar
to the one used in core, but arranged to keep each hash chain in the
minimum number of disk pages, so that searches through the dictionary
can be made efficient. 
.S