Trailing-Edge
-
PDP-10 Archives
-
decuslib20-05
-
decus/20-0137/spell/spell.doc
There are 6 other files named spell.doc in the archive. Click here to see a list.
SPELL: Spelling Check and Correction Program
1.0 INTRODUCTION
SPELL is a program designed to read text files and check
them for correctness of spelling. In addition to the
spelling check, the program provides a means for correcting
words that it thinks are misspelled. This program was
written by Ralph E. Gorin of Stanford University Artificial
Intelligence Laboratory. It has been augmented by William
Plummer and Jerry Wolf of BBN and Marshall Abrams of NBS.
In its normal mode of usage, SPELL reads through an input
text file, asks the user about each word it does not
recognize, and creates an output file in which corrections
have been made. Provisions exist for:
1. Loading, incrementally augmenting, and dumping
special dictionaries. Such dictionary files are
ordinary text files which may be listed and edited.
Arbitrarily many dictionaries may be loaded,
subject only to availability of (virtual) main
memory.
2. Training modes where SPELL scans an input file and
makes a list of all words it does not recognize.
Such a list can be used as an auxiliary dictionary.
3. Termination of spelling checking part way through a
file and a way of picking up where you left off in
a later session.
Other features of the program are:
4. SPELL will read either SOS, TECO or E/TV files.
The corrected output will be written in the same
mode, except E/TV directories must be deleted.
Dictionary files may be SOS, TECO or E/TV format.
5. An exception file may also produced. This file
contains all words SPELL did not recognize (and
their contexts), all corrections, plus all words
which SPELL recognized by stripping off prefixes
and/or suffixes. This last class of words appears
marked with "[" or "]" to denote prefix- or suffix-
removal. E.g., FOOING] [PREFOOED] The
affix-stripping algorithms are not foolproof (e.g.,
CHOSES] ), so this gives a quick way to scan for
the exceptions which may slip through.
6. SPELL keeps track of all word spellings which were
corrected by the user. Subsequent occurrences of
such are automatically corrected by the program.
Page 2
This is reported to the user by a typeout of
"MISSPELLING ==> CORRECTION".
7. When a word is corrected, the output file will be
rewritten with either upper case, lower case, or
mixed (first letter upper, the remainder in lower),
depending on the cases of the first two letters in
the original word. Note: this will be incorrect
in some cases (e.g., McCarthy).
2.0 USING SPELL
2.1 Starting SPELL
Type the command R SPELL (under TENEX, type "SPELL" to the
EXEC). All typeins to SPELL must be terminated by carriage
return. (In the TENEX version, editing by ^A, ^Q, and ^R
may be done.) (TOPS-10 accepts ^U, DEL, and, depending on
MONGEN parameters, BS.)
2.2 Augmenting The Built-in Dictionary
The non-TENEX version will first attempt to load the default
private dictionary file WORDS.DIC into dictionary 1 with
incremental insertion set (see Incremental Insertion below).
Next, SPELL will ask: "Do you want to augment the
dictionary?" If you wish to use only the main dictionary
(plus WORDS.DIC) presently in memory, type <cr>. You can
then skip the next paragraph.
2.2.1 Private Auxiliary Dictionary - If in fact you have an
additional auxiliary dictionary (of specific terms,
infrequently used words, etc.) you wish to use, type "Y"
<cr>. You will then be asked for the name of the dictionary
file.
2.2.1.1 Incremental Insertions - After typing the file name
you will be given the option of marking the new entries as
incremental insertions. If the new entries are marked as
incremental then they will be included in an incremental
dump of the dictionary. To have the new entries marked as
incremental, type "I" <cr>; otherwise, type <cr>. (If any
of the words in your auxiliary dictionary are already in the
main dictionary then no second copy of the word will be
Page 3
made. Hence, if your words are marked as incremental then
in a subsequent incremental dump, any words that were
already in the dictionary will not be dumped.)
2.2.1.2 Additional Dictionaries - After loading an
auxiliary dictionary the program will type the new total
number of words in the dictionary (and, except under TENEX,
the amount of core used). You will then have an opportunity
to save the new core image (normally you won't do this).
You will again be asked, "Do you want to augment the
dictionary?", thus allowing you to enter a number of
auxiliary dictionaries (limited only by the availability of
{virtual} core).
2.2.1.3 Dictionary Format - The format of the file is one
dictionary entry (word) on a line; words must be composed
of alphabetic characters or apostrophe and less than 40
letters long. The dictionary entries need not be in
alphabetical order. A misspelling-correction pair may occur
one line in the form "MISSPELLING>CORRECTION".
2.3 Switches
You will then be given an opportunity to specify zero or
more switch options. The meanings of the switches are:
T Training mode. SPELL will treat the input file as
a training set rather than a file to be corrected.
All words in the file which are unfamiliar to
SPELL will be entered in the dictionary as
incremental insertions. After SPELL finishes
reading the file, the user has an opportunity to
dump all the words that were inserted in this
manner. The resulting list of words may be
edited, and any words which are incorrect may be
deleted. Then this file can be used as an
auxiliary dictionary while correcting the original
source file.
This feature is provided for the purpose of easing
the problem of creating a specialized dictionary
of jargon and infrequently used words.
Q Q-Training mode. In this mode, all words in the
source file that are unfamiliar to SPELL will be
added to the dictionary; the difference is, if
any "new" word is "close to" some old word, the
new word will be output to the exception file.
The exception file will contain only such words.
Page 4
In this way, the spelling checker calls to your
attention the fact that these words may be
misspelled.
N No suffix removal. This switch suppresses the
attempt to remove suffixes to recognize a
correctly spelled root word. SPELL will then find
many more questionable words, but it will work
more correctly than the heuristic affix removal.
A No prefix removal. This switch suppresses the
attempt to remove prefixes to recognize a
correctly spelled root word.
U Accept Upper case mode. In this mode, all words
that are written entirely in upper case will be
inserted in the dictionary. This is useful when a
manuscript file contains jargon terms that are
written in upper case, and text-processor (e.g.,
PUB, TJ6, or RUNOFF) commands in upper case.
Before reading the input file, SPELL will ask for
a dictionary number to use for all words inserted
this way (see "How to Use Multiple Dictionaries").
P Pickup mode. After specifying input and output
file names, you will be asked to specify a page
and line number for pickup. The effect is to
suspend spelling checking until the page and line
specified. When a user has a partially corrected
file, this mode will enable him to skip over the
portion of the file that has already been
corrected. The input file will be copied without
checking to the output until the page and line
specified, at which point spelling checking
begins.
2.4 File To Be Corrected
Next you will be asked for the name of the file that you
want to check for spelling errors. File names are specified
in the usual format of "name.ext[prj,prg]" where name is the
filename, ext is the file extension, and [prj,prg] is the
name of the file owner, which may be omitted if the file is
on the present user's disk area. If you omit the file name
then you will immediately enter the exit sequence (see
below).
Page 5
2.5 Output File
You will be next asked to name the output file. Enter a
file name, or just a <cr> if you wish to use the default
(see next paragraph).
2.5.1 Default Output File(s) - In the TENEX version there
are always output and exception files; by default the
output file will have the same name and extension as the
input, and the exception file will have the extension
"EXCEPTIONS". In non-TENEX versions the default output is
to rename the input file with extension .BAK and to place
the corrected text in the (previous) input file name. The
exception file may be omitted.
2.5.2 Exception File - The exception file, should you chose
to make one, will contain each line on which an error was
found, the indication of the page and line number, and the
suspect word. Words accepted via the affix removal
heuristics will also appear in the exception file.
2.6 Checking And Correcting
After you have specified all the files, the program will
respond with "Working..." and start checking the input file
for spelling errors.
2.6.1 Choices When An Unknown Word Is Found - When the
spelling checker encounters a word that isn't in the
dictionary, it will type the page and line number, the line
in which the word occurs, and the word itself.
In general, when a word is found that is not in the
dictionary, a brief message will be typed to remind you of
the possible choices. In the special case where the program
finds precisely one possible correction for the word, you
will be given the choice of typing C to accept the "guess"
or any other option. The options are:
C The "guess" is correct. Correct the word in the
text. Enter misspelling in dictionary so that it
will be corrected if it appears again.
A Accept this word, this one time.
I Accept this word and insert it in the dictionary
so that subsequent occurrences of this word will
Page 6
be recognized and accepted. Words that are
inserted this way are marked as incremental
insertions and they may be dumped to form an
auxiliary dictionary.
R Replace this word. Type "R" <cr> and the program
will ask you for the replacement word. If the
replacement word is not already in the dictionary,
the program will give you an opportunity to insert
it.
If the misspelling is due to an omitted space
between words, use the "R" command to retype the
words with the space.
If the replacement word contains no spaces or
illegal characters, you will be asked if you wish
to add this replacement to the list of
misspelling-correction pairs. If you do add the
replaced word as a misspelling then all subsequent
occurrences of that word will be replaced with the
replacement string.
X Accept this word and finish. The word will be
accepted. Then the remainder of the input file
will be copied without checking to the output
file.
W Save my incremental insertions. After you type
"W" <cr> you will be asked for a file name. Then
an incremental dump of the dictionary will be
written into the file. After the dump is complete
you may then decide what to do with the excepted
word.
L Load an auxiliary dictionary. The present word is
accepted and you will be asked for the name of the
dictionary file to load. This is useful if you
encounter a jargon term but forgot to load the
appropriate dictionary.
D Display the line and offending word again. The
line that is displayed will not have any
corrections shown in it. If a line has more than
one error the line will only be typed once.
Subsequent errors on that line will cause only the
particular word to be typed, unless this command
is used.
S If this choice is offered then the spelling
checker has discovered several words that could be
possible corrections of this word. If you type
"S" <cr> then you will enter a mode where you can
look at the words that were found by the program
and (optionally) select one of the words from the
Page 7
list.
When you enter this selection submode, the first
word in the list of possible corrections will be
typed followed by an asterisk. Then you have the
following choices:
C<cr> Use this word as the Correction.
<cr> Show the next possible choice. When you
exhaust the choices you are returned to the
outer mode, and asked again.
^<cr> Back up in the list.
<esc> Escape from this submode and return to the
outer command mode.
Note that when you make a correction via the C command or by
selection from the list presented by the S command, that
correction is entered in the misspelling-correction list and
subsequent occurrences of the same misspelling will be
corrected automatically.
2.7 Finished Processing The File
When the input file is exhausted, all files are closed, the
program types "Finished.", and the exit sequence is entered.
Except on TENEX, you will be asked if you want the default
dictionary WORDS.DIC dumped. Answer "yes" or "no" (actually
only a single letter answer is required). The user then has
several options:
E Exit now.
S Save this core image.
C Go back and correct another file.
A Augment the dictionary, set new switches, and correct
another file.
D Complete dump of the dictionary. This will create a
very large file, and it is not usually recommended.
I Incremental dump of the dictionary. All the words that
were read in from a private dictionary and marked with
an I plus those inserted while running the program are
dumped to a file. The user specifies a file name (the
default is WORDS.DIC). This incremental file is in a
format suitable for editing or for use as an auxiliary
dictionary. The words in this file are in alphabetical
order.
Page 8
X This command is used to get a trace count of the
program. It is for diagnostic purposes only, and is
displayed as a possible choice only if the program has
been assembled as a debugging program.
3.0 HOW TO USE MULTIPLE DICTIONARIES
SPELL has a set of features whereby the user can cause the
creation of several disjoint incremental dictionaries. In
this way, the user may collect several dictionaries of
special terms. Internally, all dictionary entries are
considered equivalent as regards word searches. The
distinction between dictionaries becomes relevant when doing
incremental dumps (the I command during the exit sequence or
the W command while in the middle of execution). When an
incremental dump is requested, the user may specify a
number, e.g., W9, which selects the particular incremental
dictionary to be dumped. In this example, dictionary 9 will
be dumped.
3.1 Dictionaries 0 And 1
Dictionary 0 is the main dictionary. Words cannot be added
to this dictionary, except by reading an auxiliary file. In
general, words that are inserted incrementally are marked as
being in dictionary 1. All words that are incremental
insertions in the dictionary will be marked in dictionary 1,
unless the user specifies otherwise.
3.2 Specifying A Dictionary
The following places are where the user may specify which
dictionary to add to:
1. When loading an auxiliary dictionary, if the user
responds with "In" to the question about marking
new entries as incremental, then the new entries
will be marked in dictionary number n (where n is
interpreted as decimal and should be less than 31).
2. After a word has been rejected, type "In" to insert
the word in dictionary number n.
3. After replacing a word, if the replacement is not
in the dictionary, then type "In" to insert the
replacement into dictionary n.
Page 9
When requesting an incremental dump, the user may specify
the particular dictionary to dump. This is allowed in two
cases:
1. After some word has been rejected, the command "Wn"
will cause dictionary number n to be dumped.
2. During the exit sequence, the command "In" will
cause dictionary number n to be dumped.
In all five cases above, if n is either 0 or omitted, then
it will be taken as being 1.
3.3 Caution!
There is no provision in SPELL for remembering which
dictionary numbers have been used. Therefore, it remains
the individual user's responsibility to remember the numbers
of all the dictionaries that he creates. (Forgetting the
number will mean that the forgotten dictionary can not be
dumped incrementally. The words in a forgotten dictionary
will still be available, but the only way to actually get
them dumped out is to dump the entire dictionary).
3.4 Hint:
In the course of correcting a file, it is likely that you
will be asked about words which you wish to have accepted
during this file, but which you don't wish to have saved in
your incremental dictionary(s). In these cases, simply
insert them in a "throwaway" incremental dictionary which
you don't bother to dump when you're finished.
4.0 ABNORMAL CONDITIONS
While the program is running it is possible that certain
abnormal conditions may obtain. The usual response of the
program is to type some sort of error message. The
following is a list of the error messages in SPELL, with an
indication of the severity of the error.
Illegal dictionary entry: <word>
This error occurs if an entry in a dictionary file
exceeds 40 (decimal) characters. The word is
Page 10
ignored.
0 LENGTH WORD AT HASHCP
Somebody just asked to compute the hash address of
an empty word. The program continues, but there
is a possibility of error.
HASHING ERROR
Somebody asked for the hash address of a word that
doesn't begin with letters or apostrophe as the
first two characters. This is a fatal error; the
program halts.
DEVICE DATA ERROR (OUTPUT)
This message means that while writing a file,
something screwed up. The program halts.
DEVICE ERROR (INPUT)
The input file is screwed up in some way. The
program halts.
Internal confusion in the spelling checker.
Called from location <loc>.
The spelling checker has discovered a (possible)
bug in itself. The program halts, but the user
may type CONTINUE. Please note the location
mentioned and the circumstances that evoked the
message.
Dictionary number too large, Maximum is 30.
This message means that the user attempted to
select for insertion or dumping a dictionary
beyond the range of allowed numbers. The user
will get another chance to do the right thing.
Unrecognized switch.
The user asked for an unknown switch. He must
repeat the entire switch specification again.
The following messages occur only in the non-TENEX version:
Illegal Character in Scan.
This is a message from the routine that reads file
names. You will be asked to try retyping the
name.
File not Found. <filename>
The indicated file could not be found. The user
gets to specify some other file.
Enter failed on: <file name>
An enter uuo failed while trying to select the
indicated file for output. The user may specify
another name.
Open Failed on Device <dev>:
You asked for a device that doesn't exist or isn't
available to the program. You will get a chance
to ask for something else.
Insufficient Core Available.
SPELL requires more core while expanding the
dictionary but none is available. The program
exits.
Page 11
5.0 INTERNAL WORKINGS
5.1 Data Structures - Hashing Function
The data structure is the heart of the program, and any
efficiency in the program operation is due primarily to this
choice of data structure. The data structure is basically a
hash coding scheme where dictionary entries are accessed by
both their alphabetic order and by their length. There is a
base table that contains 26 * 26 * 10 halfwords; this table
gives anchors for 6760 chains. Each chain contains exactly
all words with the same two first letters and some given
length. To be precise, the hashing function is:
(L1*26+L2)*10+min(WL-2,9), where L1 and L2 are numeric
representations of the first and second letters (A=0, B=1,
... Z=25, and apostrophe also is 25), and WL is the length
of the word in characters.
This scheme was chosen since it provides both an efficient
way to probe the dictionary and a quick way to select a
small subset of all words that are close to a given input
word.
5.2 Data Structures - Dictionary Entry Format
Entries are added to the appropriate hash chain by the
INSERT subroutine. Entries are added to the head of the
chain, saving the time and effort of searching to the end of
the chain. This scheme means that the last item entered on
a chain is the first item seen by a search. The format of
the entry is given by:
Word 0: xwd flags,nextlk
Word 1: 5 bit representation
Word 2: 5 bit representation
...
There are precisely 1+ceiling(WL/7) machine words used for
each dictionary entry. WL is the length of the entry in
characters. Nextlk is the pointer to the next entry in the
list, or zero if this is the last in the chain. The left
side contains flags; bits 13-17 specify the incremental
dictionary number (0 for main, 1-30 for incremental
dictionaries, and 31 for misspellings). One can imagine
that bits 5-12 could be used to store semantic information
about the entry. The unused bytes in the last word of an
entry must be zero, since they are used to stop the routine
that converts the five bit to 7 bit.
Misspellings are entered in the main dictionary with the
Page 12
incremental dictionary number set to decimal 31. If a word
is marked as a misspelling then the word preceding the flags
and link contains a pointer to the flag and link word of the
correction. Misspellings may be deleted before the full
dictionary is dumped, or whenever SPELL asks about saving
the core image. The space obtained by deleting misspellings
is not reutilized at present, although that feature could be
added without great difficulty.
5.3 Spelling Correction Heuristics
There are four kinds of errors that the program attempts to
correct:
1. one wrong letter.
2. one missing letter.
3. one extra letter.
4. two transposed letters.
For a wrong letter in the third or subsequent character, all
words that are candidates must exist on the same chain that
the suspect word hashes to. Hence, each entry on that chain
is inspected to determine if the suspect differs from the
entry by exactly one character. This is accomplished by an
exclusive-or (XOR) between the suspect and the dictionary.
Then a JFFO instruction selects the first non zero byte in
the XOR. This byte is zeroed and if the result is all zero
then the dictionary word differs from the suspect in only
one letter. All such words are listed at CANDBF, where they
can be inspected later.
For a wrong letter in the first or second character, the
program tries varying the second letter through all 26
possible values, searching for an exact match. Then all 26
possible values of the first letter are tried, after setting
the second letter to its original value. This means that 52
more chains are searched for possible matches.
To correct transposed letters, all combinations of
transposed letters are tried. There are only WL-1 such
combinations, so it is fairly cheap to do that.
To correct one extra letter, wl copies of the word are made,
each with some letter removed. Each of these is looked up
in the dictionary. This takes WL searches.
To correct one missing letter, WL+1 copies of the word are
made, each time inserting a null character in a new position
in the suspect. The null character is never part of any
word, so the suspect word augmented by an embedded null can
be thought of as a word with one wrong letter (the null)
then the algorithm for matching one wrong letter is used.
Page 13
If the first character is omitted, all 26 possible first
characters are tried. Also, 26 more words are formed by
varying the second character in case that had been omitted.
6.0 ASSEMBLY & LOADING INSTRUCTIONS
There are three assembly time switches, TENEX, STANSW and
SANSW. If STANSW is set then there are SIXBIT ppn's and the
SWAP UUO; if STANSW is zero, then normally there are octal
ppn's except if SANSW is set, in which case, there are
decimal ppn's. If the TENEX switch is set it overrides the
others and TENEX style file names are used. Compile the
program using MACRO or FAIL and load it. The non-TENEX
versions will produce a sharable high segment and a
non-sharable low segment. When you start the program the
first time after loading, it will demand a dictionary. This
dictionary will be read in as dictionary zero, which is the
built-in dictionary for future use. In the non-TENEX
versions dictionary zero is read into the sharable high
segment. Use the largest dictionary you have which meets
your storage limitations. The file SPELLD.ALL is
recommended.
When dictionary zero is loaded, SPELL will ask "Do you wish
to save this core image?" Answer "Yes" and save the
resulting core image. On non-TENEX systems, do an SSAVE to
produce a sharable high segment containing dictionary zero.
There are various other assembly switches, but they default
to reasonable settings, and you meddle with them at your
peril.
7.0 POSSIBLE EXPANSIONS
No program that is still in use is complete. The following
paragraphs include suggestions of possible future work in
this area.
For non-paged systems, the main dictionary (0) should be in
the sharable segment and the private additions in the
non-sharable segment. This would require adjusting the hash
chains to point across the gap.
The dictionary should be expanded to include all suffixes
for every word. There is a feature that strips suffixes for
the purpose of finding the stem of the word in the
dictionary, but this heuristic is error prone and
incompatible with later attempts to correct the word.
Page 14
If semantic information were included in the dictionary, it
could help guide the selection of a correction.
Either of these would require a major restructuring of the
program, since the dictionary would no longer fit in core.
Probably, the dictionary should be kept on the disk, with a
data structure similar to the one used in core, but arranged
to keep each hash chain in the minimum number of disk pages,
so that searches through the dictionary can be made
efficient.