PDP-10 Archive: more-uns/spell.doc from walsh_goodStuff

Trailing-Edge - PDP-10 Archives - walsh_goodStuff_1600 - more-uns/spell.doc
There are 6 other files named spell.doc in the archive. Click here to see a list.
	SPELL: Spelling Check and Correction Program


				Ralph E. Gorin
				Artificial Intelligence Laboratory
				Computer Science Department
				Stanford University
				Stanford, California 94305

				4 March 1971
				11 March 1974 revision
				Tenex Mods: March-October 1974
				   William Plummer, Jerry Wolf
				14 December 1974 revision
				CMU mods: April 9, 1975 by
				   Richard Johnsson and Philip Karlton
				19 May 1975 revision

---------------------------------------------------------------------



SPELL is  a program designed  to read text  files and check  them for
correctness  of spelling.   In  addition to  the spelling  check, the
program provides  a means  for correcting  words that  it thinks  are
misspelled.  This  program was written by Ralph E.  Gorin of Stanford
University Artificial Intelligence Laboratory.  It has been augmented
by  William Plummer  and Jerry Wolf  of BBN.   The CMU  interface was
written by Richard Johnsson and Philip Karlton. 

In its normal mode of usage, SPELL reads  through an input text file,
asks the  user about each word it does  not recognize, and creates an
output file in  which corrections have  been made.   Provisions exist
for:

   a. Loading, incrementally augmenting, and dumping special
      dictionaries.  Such dictionary files are ordinary text
      files  which may  be  listed and  edited.  Arbitrarily
      many dictionaries  may be loaded,  subject   only   to
      availability of (virtual) main memory. 


   b. Training modes where SPELL scans  an  input  file  and
      makes a list of all words it does not recognize.  Such
      a list can be used as an auxiliary dictionary.

   c. Termination of spelling checking part  way  through  a
      file  and  a way of picking up where you left off in a
      later session.

Other features of the program are:

   a. SPELL will read either  SOS, TECO or E/TV files.   The
      corrected  output will  be written  in the  same mode,
      except E/TV directories must  be deleted.   Dictionary
      files may be SOS, TECO or E/TV format.

   b. An  exception  file  may also  produced.   This   file
      contains  all words SPELL did not recognize (and their
      contexts), all corrections, plus all words which SPELL
      recognized  by stripping off prefixes and/or suffixes.
      This last class of words appears marked  with  "["  or
      "]" to denote prefix- or suffix- removal.  E.g., 
        FOOING]
        [PREFOOED]
      The  affix-stripping  algorithms  are  not   foolproof
      (e.g.,  CHOSES] ),  so  this gives a quick way to scan
      for the exceptions which may slip through.

   c. SPELL keeps track of all  word  spellings  which  were
      corrected by the user.  Subsequent occurrences of such
      are automatically corrected by the program.   This  is
      reported  to the user by a typeout of "MISSPELLING ==>
      CORRECTION".

   d. When a word is corrected,  the  output  file  will  be
      rewritten with either upper case, lower case, or mixed
      (first  letter  upper,  the   remainder   in   lower),
      depending on the cases of the first two letters in the
      original word.  Note: this will be incorrect  in  some
      cases (e.g., McCarthy).

---------------------------------------------------------------------


Using SPELL

Type the  command  R SPELL (under  TENEX,  type "SPELL" to the EXEC). 
All typeins to  SPELL must be terminated by carriage return.  (In the
TENEX version, editing by ^A, ^Q, and ^R may be done.)

It  will first ask: "Do  you want to augment  the dictionary?" If you
wish to use only  the main dictionary  presently in memory, type  "N"
<cr>.  You can then skip the next paragraph. 

If in  fact  you have  an auxiliary  dictionary  (of specific  terms,
infrequently used  words, etc.) you wish to use,  type "Y" <cr>.  You
will then be asked for the name of your dictionary file.   The format
of the file is  one dictionary entry (word) on a  line; words must be
composed  of alphabetic  characters  or apostrophe  and less  than 40
letters long.   The dictionary  entries need  not be in  alphabetical
order.  A misspelling-correction pair  may occur one line in the form
"MISSPELLING>CORRECTION".   After typing  the file  name you  will be
given  the  option   of  marking  the  new  entries   as  incremental
insertions.   If the new entries are  marked as incremental then they
will be included in an incremental  dump of the dictionary.  To  have
the new entries marked as incremental, type "I" <cr>; otherwise, type
<cr>.   (If any of the words in your auxiliary dictionary are already
in the main dictionary then no second copy of the  word will be made.
Hence, if your  words are marked as incremental  then in a subsequent
incremental dump, any words that were already in the dictionary  will
not be  dumped.)  After loading an  auxiliary dictionary the  program
will type  the  new total  number of  words in  the dictionary  (and,
except under TENEX, the  amount of core used).  You will then have an
opportunity to save the new core image (normally you won't  do this).
You will  again be asked, "Do  you want to augment  the dictionary?",
thus  allowing  you  to  enter  a  number  of auxiliary  dictionaries
(limited only by the availability of virtual core). 

You will then be given an opportunity to  specify zero or more switch
options.  The meanings of the switches are:

T    Training mode.  SPELL will treat the input  file  as  a
     training  set  rather than a file to be corrected.  All
     words in the file which are unfamiliar to SPELL will be
     entered  in  the  dictionary as incremental insertions.
     After SPELL finishes reading the file, the user has  an
     opportunity to dump all the words that were inserted in
     this manner.   The  resulting  list  of  words  may  be
     edited,  and  any  words  which  are  incorrect  may be
     deleted.  Then this file can be used  as  an  auxiliary
     dictionary while correcting the original source file.

     This feature is provided for the purpose of easing  the
     problem  of creating a specialized dictionary of jargon
     and infrequently used words.

Q    Q-Training mode.  In this mode, all words in the source
     file  that are unfamiliar to SPELL will be added to the
     dictionary; the difference is, if  any  "new"  word  is
     "close  to"  some old word, the new word will be output
     to the exception file.  The exception file will contain
     only  such  words.   In  this way, the spelling checker
     calls to your attention the fact that these  words  may
     be misspelled.

N    No suffix removal.  This  switch suppresses the attempt
     to  remove suffixes  to  recognize a  correctly spelled
     root word.  SPELL will then find many more questionable
     words,  but  it  will  work  more  correctly  than  the
     heuristic affix removal. 

A    No prefix removal.  This  switch suppresses the attempt
     to  remove prefixes  to  recognize a  correctly spelled
     root word.

U    Accept Upper case mode.   In this mode, all  words that
     are written entirely  in upper case will be inserted in
     the dictionary.  This is useful when a  manuscript file
     contains jargon  terms that are written  in upper case,
     and text-processor (e.g., PUB, TJ6, or RUNOFF) commands
     in upper case.   Before reading  the input file,  SPELL
     will ask  for a dictionary number to  use for all words
     inserted   this  way   (see   "How  to   Use   Multiple
     Dictionaries"). 

D    PUB mode.  In this mode, lines that begin with a period
     (.) will be passed to the  output  without  any attempt
     at correction being made.  Also, the percent  character 
     (%) denotes a font change, so the character immediately
     following the percent will be passed  to the output and
     will be considered as  delimiting a word.   Thus if the
     text "%Bimportant%*" appears in the input the "%B"  and
     the "%*" are passed to the  output  and the "important" 
     is checked and, if necessary, corrected.  If the  D  is
     followed by some non-alphanumeric character that is not
     a  carriage  control  character (e.g., "!" or "+") then
     that character will be treated as "%" was described and
     "%" will be treated normally.  In this way, if the text
     file turns on some character other  than "%"  for  font
     selection then that file can be handled by SPELL.  Note
     that only one such character can be turned on in SPELL.

P    Pickup mode.  After specifying input  and  output  file
     names,  you  will  be  asked to specify a page and line
     number for pickup.  The effect is to  suspend  spelling
     checking  until  the  page  and line specified.  When a
     user has a partially corrected  file,  this  mode  will
     enable  him  to  skip over the portion of the file that
     has already been corrected.  The  input  file  will  be
     copied  without  checking  to the output until the page
     and line specified, at which  point  spelling  checking
     begins.

	Next you will be asked for the name of the file that you want
to check for spelling errors.  File names are specified in the  usual
format  of "name.ext[prj,prg]" where name is the filename, ext is the
file extension, and [prj,prg] is the name of the  file  owner,  which
may be omitted if the file is on the present user's disk area. If you
omit the file name then you will immediately enter the exit  sequence
(see below).

After specifying the switches, you will be asked to specify the names
of the  file to be checked, the corrected file  to be output, and the
exception file.   In the TENEX  version there  are always output  and
exception  files.   In  the  TENEX  version,  if  no file  names  are
specified  for the output and exception  files they will default; the
output file will have the  same name and extension as the  input, and
the  exception  file  will  have  the  extension  "EXCEPTIONS".    In
non-TENEX  versions  there  are  no  default  names,  and  output  or
exception files may be omitted. 

The exception file, should  you chose to make one,  will contain each
line on which an error was found, the indication of the page and line
number, and the suspect word.   Words accepted via the affix  removal
heuristics will also appear in the exception file. 

After you have specified all the files, the program will respond with
"Working..." and start checking the input file for spelling errors. 

When  the spelling  checker  encounters  a  word that  isn't  in  the
dictionary, it will  type the page and line number, the line in which
the word occurs, and the word itself.  You will be given a  choice of
actions to take: 


A    Accept this word, this one time. 

I    Accept this word and insert it  in  the  dictionary  so
     that  subsequent  occurrences  of  this  word  will  be
     recognized and accepted.  Words that are inserted  this
     way  are  marked as incremental insertions and they may
     be dumped to form an auxiliary dictionary.

R    Replace this word.  Type "R" <cr> and the program  will
     ask  you  for the replacement word.  If the replacement
     word is not already in the dictionary, the program will
     give you an opportunity to insert it.

     If the misspelling is due to an omitted  space  between
     words, use the "R" command to retype the words with the
     space.

     If the replacement  word contains no spaces  or illegal
     characters, you  will be asked if you  wish to add this
     replacement  to  the  list  of   misspelling-correction
     pairs. If you do add the replaced word as a misspelling
     then all  subsequent occurrences  of that word  will be
     replaced with the replacement string. 

X    Accept  this  word  and  finish.   The  word  will   be
     accepted.  Then the remainder of the input file will be
     copied without checking to the output file.

W    Save  my  incremental  insertions.   After   you   type
     "W" <cr>  you  will  be asked for a file name.  Then an
     incremental dump of the dictionary will be written into
     the  file.   After  the  dump  is complete you may then
     decide what to do with the excepted word.

L    Load  an auxiliary  dictionary.   The  present word  is
     accepted  and you  will be  asked for  the name  of the
     dictionary  file to  load.    This  is  useful  if  you
     encounter  a  jargon  term   but  forgot  to  load  the
     appropriate dictionary. 

D    Display the line and offending word  again.   The  line
     that  is  displayed will not have any corrections shown
     in it.  If a line has more than one error the line will
     only  be  typed  once.   Subsequent errors on that line
     will cause only the particular word to be typed, unless
     this command is used.

S    If this choice is offered then the spelling checker has
     discovered   several   words  that  could  be  possible
     corrections of this word.  If you  type  "S" <cr>  then
     you  will  enter a mode where you can look at the words
     that were found by the program and (optionally)  select
     one of the words from the list.

     When you enter this selection submode, the  first  word
     in  the  list  of  possible  corrections  will be typed
     followed by an asterisk.  Then you have  the  following
     choices:

     C<cr>       Use this word as the Correction.

     <cr>        Show the next possible  choice.   When  you
                 exhaust the choices you are returned to the
                 outer mode, and asked again.

     ^<cr>       Back up in the list.

     <alt mode>  Escape from this submode and return to  the
                 outer command mode.

In general, when  a word is  found that is  not in the  dictionary, a
brief message,  either "Type A,I,R,X,W or D"  or "Type A,I,R,X,W,D or
S" will  be typed to  remind you  of the  possible choices.   In  the
special  case   where  the  program  finds   precisely  one  possible
correction  for the word, then the message,  "I guess <word>.  Type C
to make this  correction or A,I,R,X,W  or D" will  be typed.   If you
type "C" <cr> then the indicated substitution will be made; otherwise
you have the usual choices.  In the CMU version, slow  terminals (any
terminal not  an INFOTON, BEEHIVE,  GRAPICS or ARDS) will  have these
messages shortened to  "*", "S*",  and  "==><word> C*", respectively.
Also, on CMU slow terminals, the line on which an exception occurs is
not displayed (use the D command). 

When the input file is  exhausted, all files are closed, the  program
types "Finished.", and  the exit sequence is entered.   The user then
has several options:

E    Exit now.

S    Save this core image.

C    Go back and correct another file.

A    Augment the dictionary, set new switches,  and  correct
     another file.

D    Complete dump of the dictionary.  This  will  create  a
     very large file, and it is not usually recommended.

I    Incremental dump of the dictionary.  All the words that
     were inserted while running the program are dumped to a
     file.  The user specifies  a file name (the default  is
     WORDS.LST).    This incremental  file  is  in a  format
     suitable  for  editing  or  for  use  as  an  auxiliary
     dictionary.  The words in this file are in alphabetical
     order. 

X    This command is used  to  get  a  trace  count  of  the
     program.   It  is  for diagnostic purposes only, and is
     displayed as a possible choice only if the program  has
     been assembled as a debugging program.

---------------------------------------------------------------------


How to Use Multiple Dictionaries

SPELL has a set  of features whereby the user can  cause the creation
of several disjoint  incremental dictionaries.  In this way, the user
may collect several dictionaries  of special terms.  Internally,  all
dictionary  entries   are  considered  equivalent  as   regards  word
searches.  The distinction between dictionaries becomes relevant when
doing incremental dumps  (the I command  during the exit sequence  or
the W command while in the middle of execution).  When an incremental
dump is requested,  the user may  specify a number,  e.g., W9,  which
selects the particular incremental dictionary to be  dumped.  In this
example, dictionary 9 will be dumped. 

Dictionary 0  is the main dictionary.  Words  cannot be added to this
dictionary, except by reading an  auxiliary file.  In general,  words
that are inserted incrementally are  marked as being in dictionary 1.
All  words that are incremental insertions  in the dictionary will be
marked in dictionary 1, unless the user specifies otherwise. 

	When requesting an incremental dump, the user may specify the
particular dictionary to dump.  This is allowed in two cases:

The following places are where the user may  specify which dictionary
to add to:

    1.  When loading an auxiliary dictionary,  if  the  user
        responds with "In" to the question about marking new
        entries as incremental, then the new entries will be
        marked   in   dictionary   number   n  (where  n  is
        interpreted as decimal and should be less than 31).

    2.  After a word has been rejected, type "In" to  insert
        the word in dictionary number n.

    3.  After replacing a word, if the replacement is not in
        the   dictionary,  then  type  "In"  to  insert  the
        replacement into dictionary n.

When  requesting  an  incremental dump,  the  user  may  specify  the
particular dictionary to dump.  This is allowed in two cases:

    1.  After some word has been rejected, the command  "Wn"
        will cause dictionary number n to be dumped.

    2.  During the exit  sequence,  the  command  "In"  will
        cause dictionary number n to be dumped.

In all five cases above, if n is either 0 or omitted, then it will be
taken as being 1. 

Caution!  There  is  no  provision  in  SPELL for  remembering  which
dictionary  numbers have  been  used.    Therefore,  it  remains  the
individual user's responsibility  to remember the numbers  of all the
dictionaries that  he creates.  (Forgetting the number will mean that
the forgotten dictionary can not be dumped incrementally.   The words
in a forgotten  dictionary will still be available,  but the only way
to actually get them dumped out is to dump the entire dictionary). 

Hint: In the course of correcting a file, it is likely that you  will
be asked  about words  which you  wish to  have accepted during  this
file,  but which  you don't  wish to have  saved in  your incremental
dictionary(s).  In these cases,  simply insert them in a  "throwaway"
incremental dictionary  which you  don't bother  to dump  when you're
finished. 

---------------------------------------------------------------------


Abnormal Conditions

While  the program  is running it  is possible  that certain abnormal
conditions may obtain.  The usual response of the  program is to type
some sort  of error message.   The following  is a list  of the error
messages in SPELL, with an indication of the severity of the error. 

    Illegal dictionary entry: <word>
          This error occurs if an entry in a dictionary file
          exceeds  40  (decimal)  characters.   The  word is
          ignored.

    0 LENGTH WORD AT HASHCP
          Somebody just asked to compute the hash address of
          an  empty  word.  The program continues, but there
          is a possibility of error.

    HASHING ERROR
          Somebody asked for the hash address of a word that
          doesn't  begin  with  letters or apostrophe as the
          first two characters.  This is a fatal error;  the
          program halts.

    DEVICE DATA ERROR (OUTPUT)
          This message means  that  while  writing  a  file,
          something screwed up.  The program halts.

    DEVICE ERROR (INPUT)
          The input file is screwed up  in  some  way.   The
          program halts.

    Internal confusion in the spelling checker.
          Called from location <loc>.  The spelling  checker
          has  discovered  a  (possible) bug in itself.  The
          program halts, but the  user  may  type  CONTINUE.
          Please   note   the  location  mentioned  and  the
          circumstances that evoked the message.

    Dictionary number too large, Maximum is 30.
          This message means  that  the  user  attempted  to
          select  for  insertion  or  dumping  a  dictionary
          beyond the range of  allowed  numbers.   The  user
          will get another chance to do the right thing.

    Unrecognized switch.
         The user  asked  for an  unknown switch.   He  must
         repeat the entire switch specification again. 

The following messages occur only in the non-TENEX version:

    Illegal Character in Scan.
         This is a message from the  routine that reads file
         names.  You  will be asked to try retyping the name
         (non-TENEX only). 

    File not Found. <filename>
         The indicated  file could  not be  found. The  user
         gets to specify some other file. 

    Enter failed on: <file name>
         An  enter uuo  failed while  trying  to select  the
         indicated  file for output.   The  user may specify
         another name. 

    Open Failed on Device <dev>:
         You asked for a device that  doesn't exist or isn't
         available to the program.  You will get a chance to
         ask for something else. 

    Insufficient Core Available.
         SPELL  requires  more  core   while  expanding  the
         dictionary  but  none is  available.    The program
         exits. 

---------------------------------------------------------------------


		Appendix - Internal Workings

	Data Structures - Hashing Function

The data structure is the heart of the program, and any efficiency in
the  program  operation  is due  primarily  to  this  choice of  data
structure.   The data  structure is  basically a  hash coding  scheme
where dictionary entries are accessed  by both their alphabetic order
and  by their length.  There is a base  table that contains 26 * 26 *
10 halfwords; this table gives  anchors for 6760 chains.   Each chain
contains exactly all  words with the same two  first letters and some
given length.  To be precise, the hashing function is:
	(L1*26+L2)*10+min(WL-2,9),
where L1 and L2  are numeric representations of the  first and second
letters  (A=0, B=1, ...  Z=25, and apostrophe also is 25), and  WL is
the  length of the  word in characters.

This scheme was chosen since it provides  both  an  efficient
way  to probe the dictionary and a quick way to select a small subset
of all words that are close to a given input word.

	Data Structures - Dictionary Entry Format

Entries  are added  to  the  appropriate  hash chain  by  the  INSERT
subroutine.   Entries are added to the head  of the chain, saving the
time and effort of  searching to the end of  the chain.  This  scheme
means that the last item entered on a chain is the first item seen by
a search. The format of the entry is given by:

	Word 0:	xwd flags,nextlk
	Word 1: 5 bit representation
	Word 2: 5 bit representation
		...

There  are precisely  1+ceiling(WL/7)  machine  words used  for  each
dictionary  entry.   WL is  the  length of  the entry  in characters.
Nextlk is the pointer to the next entry in the list, or zero  if this
is the last in  the chain.  The left side  contains flags; bits 13-17
specify  the  incremental dictionary  number  (0 for  main,  1-30 for
incremental dictionaries, and 31 for misspellings).   One can imagine
that bits 5-12 could be  used to store semantic information about the
entry.  The unused bytes in the  last word of an entry must be  zero,
since they are used to stop the routine that converts the five bit to
7 bit. 

Misspellings are entered  in the main dictionary with the incremental
dictionary number  set to  decimal 31.   If  a word  is  marked as  a
misspelling then  the word  preceding the flags  and link  contains a
pointer  to the flag and  link word of  the correction.  Misspellings
may be  deleted before  the full  dictionary is  dumped, or  whenever
SPELL  asks about  saving the  core image.   The  spaced  obtained by
deleting misspellings  is not  reutilized at  present, although  that
feature could be added without great difficulty. 

Spelling Correction Heuristics

There are four kinds  of errors that the program attempts to correct:
	1. one wrong letter.
	2. one missing letter.
	3. one extra letter.
	4. two transposed letters.

For a wrong  letter in the  third or subsequent character,  all words
that  are candidates must  exist on the  same chain  that the suspect
word hashes to.   Hence,  each entry  on that chain  is inspected  to
determine  if the  suspect  differs  from the  entry  by exactly  one
character.  This is accomplished by an exclusive-or (XOR) between the
suspect and  the dictionary.   Then  a JFFO  instruction selects  the
first  non zero byte  in the  XOR.   This byte is  zeroed and  if the
result is all zero then the dictionary word differs from the  suspect
in only one letter.  All such words are  listed at CANDBF, where they
can be inspected later. 

For  a wrong  letter in  the first or  second character,  the program
tries varying  the  second letter  through  all 26  possible  values,
searching for  an exact match.   Then all  26 possible values  of the
first  letter  are tried,  after  setting  the second  letter  to its
original value.   This  means that 52  more chains  are searched  for
possible matches. 

To correct transposed letters, all combinations of transposed letters
are tried.  There  are only WL-1 such  combinations, so it is  fairly
cheap to do that. 

To correct  one extra letter,  wl copies of  the word are  made, each
with  some  letter  removed.   Each  of  these is  looked  up  in the
dictionary.  This takes WL searches. 

To correct one missing letter, WL+1 copies of the word are made, each
time inserting  a null character  in a  new position in  the suspect.
The  null character is  never part of  any word, so  the suspect word
augmented by an embedded  null can be thought  of as a word  with one
wrong  letter (the null)  then the  algorithm for matching  one wrong
letter is used.  If the  first character is omitted, all 26  possible
first  characters are  tried.   Also,  26 more  words  are formed  by
varying the second character in case that had been omitted. 

---------------------------------------------------------------------


		Appendix - Assembly Instructions

There  are four  assembly  time switches,  CMUSW,  TENEX, STANSW  and
SANSW.   If STANSW is  set then there  are SIXBIT ppn's  and the SWAP
UUO; if STANSW is zero, then normally there are octal ppn's except if
SANSW is set, in  which case, there are decimal ppn's.   If the TENEX
switch is  set it overrides the others and TENEX style file names are
used.  The CMUSW provides  a CMU-style PPN scanner and  provision for
CMU slow terminals. 

Compile the program using MACRO or FAIL and load  it.  When you start
the  program  the   first  time  after  loading,  it  will  demand  a
dictionary.  Use the file SLOVAR (at BBN,  SPELL.SLOVAR), or anything
else that's handy. 

Save the resulting core image when the dictionary load is complete.  

There  are various  other  assembly  switches,  but they  default  to
reasonable settings, and you meddle with them at your peril. 

---------------------------------------------------------------------


		Appendix - Possible Expansions

No  program  that  is  still in  use  is  complete.    The  following
paragraphs include suggestions of possible future work in this area. 

The dictionary  should be expanded to include  all suffixes for every
word.  There  is a feature  that strips suffixes  for the purpose  of
finding the stem of the word in the dictionary, but this heuristic is
error prone and incompatible with later attempts to correct the word. 

If  semantic information  were included  in the dictionary,  it could
help guide the selection of a correction. 

Either of these would  require a major restructuring of  the program,
since  the dictionary would  no longer  fit in  core.   Probably, the
dictionary should be kept on the disk, with a data structure  similar
to the one used in core, but arranged to  keep each hash chain in the
minimum number of disk pages, so that searches through the dictionary
can be made efficient.