PDP-10 Archive: decus/20-0148/spell.doc from decuslib20-05

Trailing-Edge - PDP-10 Archives - decuslib20-05 - decus/20-0148/spell.doc

There are 6 other files named spell.doc in the archive. Click here to see a list.



        SPELL: Spelling Check and Correction Program



1.0  INTRODUCTION

SPELL is a program designed to read  text  files  and  check
them  for  correctness  of  spelling.   In  addition  to the
spelling check, the program provides a means for  correcting
words  that  it  thinks  are  misspelled.   This program was
written by Ralph E.  Gorin of Stanford University Artificial
Intelligence  Laboratory.   It has been augmented by William
Plummer and Jerry Wolf of BBN and Marshall Abrams of NBS.  

In its normal mode of usage, SPELL reads  through  an  input
text  file,  asks  the  user  about  each  word  it does not
recognize, and creates an output file in  which  corrections
have been made.  Provisions exist for:

     1.  Loading,  incrementally  augmenting,  and   dumping
         special  dictionaries.   Such  dictionary files are
         ordinary text files which may be listed and edited.
         Arbitrarily   many   dictionaries  may  be  loaded,
         subject only  to  availability  of  (virtual)  main
         memory.  

     2.  Training modes where SPELL scans an input file  and
         makes  a  list  of all words it does not recognize.
         Such a list can be used as an auxiliary dictionary.

     3.  Termination of spelling checking part way through a
         file  and a way of picking up where you left off in
         a later session.

Other features of the program are:

     4.  SPELL will read either SOS,  TECO  or  E/TV  files.
         The  corrected  output  will be written in the same
         mode, except  E/TV  directories  must  be  deleted.
         Dictionary files may be SOS, TECO or E/TV format.

     5.  An exception file may  also  produced.   This  file
         contains  all  words  SPELL  did not recognize (and
         their contexts), all corrections,  plus  all  words
         which  SPELL  recognized  by stripping off prefixes
         and/or suffixes.  This last class of words  appears
         marked with "[" or "]" to denote prefix- or suffix-
         removal.     E.g.,    FOOING]    [PREFOOED]     The
         affix-stripping algorithms are not foolproof (e.g.,
         CHOSES] ), so this gives a quick way  to  scan  for
         the exceptions which may slip through.

     6.  SPELL keeps track of all word spellings which  were
         corrected  by  the user.  Subsequent occurrences of
         such are automatically corrected  by  the  program.

                                                      Page 2


         This  is  reported  to  the  user  by  a typeout of
         "MISSPELLING ==> CORRECTION".

     7.  When a word is corrected, the output file  will  be
         rewritten  with  either  upper case, lower case, or
         mixed (first letter upper, the remainder in lower),
         depending  on the cases of the first two letters in
         the original word.  Note:  this will  be  incorrect
         in some cases (e.g., McCarthy).





2.0  USING SPELL

2.1  Starting SPELL

Type the command R SPELL (under TENEX, type "SPELL"  to  the
EXEC).   All typeins to SPELL must be terminated by carriage
return.  (In the TENEX version, editing by ^A,  ^Q,  and  ^R
may  be  done.)  (TOPS-10 accepts ^U, DEL, and, depending on
MONGEN parameters, BS.)



2.2  Augmenting The Built-in Dictionary

The non-TENEX version will first attempt to load the default
private  dictionary  file  WORDS.DIC  into dictionary 1 with
incremental insertion set (see Incremental Insertion below).

Next,  SPELL  will  ask:   "Do  you  want  to  augment   the
dictionary?"  If  you  wish  to use only the main dictionary
(plus WORDS.DIC) presently in memory, type  <cr>.   You  can
then skip the next paragraph.  



2.2.1  Private Auxiliary Dictionary - If in fact you have an
additional   auxiliary   dictionary   (of   specific  terms,
infrequently used words, etc.) you wish  to  use,  type  "Y"
<cr>.  You will then be asked for the name of the dictionary
file.



2.2.1.1  Incremental Insertions - After typing the file name
you  will  be given the option of marking the new entries as
incremental insertions.  If the new entries  are  marked  as
incremental  then  they  will  be included in an incremental
dump of the dictionary.  To have the new entries  marked  as
incremental,  type "I" <cr>;  otherwise, type <cr>.  (If any
of the words in your auxiliary dictionary are already in the
main  dictionary  then  no  second  copy of the word will be

                                                      Page 3


made.  Hence, if your words are marked as  incremental  then
in  a  subsequent  incremental  dump,  any  words  that were
already in the dictionary will not be dumped.)



2.2.1.2  Additional Dictionaries - After     loading      an
auxiliary  dictionary  the  program  will type the new total
number of words in the dictionary (and, except under  TENEX,
the amount of core used).  You will then have an opportunity
to save the new core image (normally  you  won't  do  this).
You  will  again  be  asked,  "Do  you  want  to augment the
dictionary?",  thus  allowing  you  to  enter  a  number  of
auxiliary  dictionaries (limited only by the availability of
{virtual} core).  



2.2.1.3  Dictionary Format - The format of the file  is  one
dictionary  entry  (word) on a line;  words must be composed
of alphabetic characters or  apostrophe  and  less  than  40
letters  long.   The  dictionary  entries  need  not  be  in
alphabetical order.  A misspelling-correction pair may occur
one line in the form "MISSPELLING>CORRECTION".



2.3  Switches

You will then be given an opportunity  to  specify  zero  or
more switch options.  The meanings of the switches are:

     T    Training mode.  SPELL will treat the input file as
          a training set rather than a file to be corrected.
          All words in the  file  which  are  unfamiliar  to
          SPELL   will  be  entered  in  the  dictionary  as
          incremental  insertions.   After  SPELL   finishes
          reading  the  file, the user has an opportunity to
          dump all the words  that  were  inserted  in  this
          manner.   The  resulting  list  of  words  may  be
          edited, and any words which are incorrect  may  be
          deleted.   Then  this  file  can  be  used  as  an
          auxiliary dictionary while correcting the original
          source file.

          This feature is provided for the purpose of easing
          the  problem  of creating a specialized dictionary
          of jargon and infrequently used words.

     Q    Q-Training mode.  In this mode, all words  in  the
          source  file  that are unfamiliar to SPELL will be
          added to the dictionary;  the  difference  is,  if
          any  "new"  word  is "close to" some old word, the
          new word will be output  to  the  exception  file.
          The  exception  file will contain only such words.

                                                      Page 4


          In this way, the spelling checker  calls  to  your
          attention   the  fact  that  these  words  may  be
          misspelled.

     N    No suffix removal.   This  switch  suppresses  the
          attempt   to   remove   suffixes  to  recognize  a
          correctly spelled root word.  SPELL will then find
          many  more  questionable  words,  but it will work
          more correctly than the heuristic affix removal.  

     A    No prefix removal.   This  switch  suppresses  the
          attempt   to   remove   prefixes  to  recognize  a
          correctly spelled root word.

     U    Accept Upper case mode.  In this mode,  all  words
          that  are  written  entirely in upper case will be
          inserted in the dictionary.  This is useful when a
          manuscript  file  contains  jargon  terms that are
          written in upper case, and  text-processor  (e.g.,
          PUB,  TJ6,  or  RUNOFF)  commands  in  upper case.
          Before reading the input file, SPELL will ask  for
          a  dictionary number to use for all words inserted
          this way (see "How to Use Multiple Dictionaries").

     P    Pickup mode.  After specifying  input  and  output
          file  names,  you  will be asked to specify a page
          and line number for  pickup.   The  effect  is  to
          suspend  spelling checking until the page and line
          specified.  When a user has a partially  corrected
          file,  this  mode will enable him to skip over the
          portion  of  the  file  that  has   already   been
          corrected.   The input file will be copied without
          checking to the output until  the  page  and  line
          specified,   at   which  point  spelling  checking
          begins.



2.4  File To Be Corrected

Next you will be asked for the name of  the  file  that  you
want to check for spelling errors.  File names are specified
in the usual format of "name.ext[prj,prg]" where name is the
filename,  ext  is  the file extension, and [prj,prg] is the
name of the file owner, which may be omitted if the file  is
on  the present user's disk area.  If you omit the file name
then you will  immediately  enter  the  exit  sequence  (see
below).

                                                      Page 5


2.5  Output File

You will be next asked to name the  output  file.   Enter  a
file  name,  or  just  a <cr> if you wish to use the default
(see next paragraph).



2.5.1  Default Output File(s) - In the TENEX  version  there
are  always  output  and  exception  files;   by default the
output file will have the same name  and  extension  as  the
input,  and  the  exception  file  will  have  the extension
"EXCEPTIONS".  In non-TENEX versions the default  output  is
to  rename  the  input file with extension .BAK and to place
the corrected text in the (previous) input file  name.   The
exception file may be omitted.  



2.5.2  Exception File - The exception file, should you chose
to  make  one,  will contain each line on which an error was
found, the indication of the page and line number,  and  the
suspect   word.    Words  accepted  via  the  affix  removal
heuristics will also appear in the exception file.  



2.6  Checking And Correcting

After you have specified all the  files,  the  program  will
respond  with "Working..." and start checking the input file
for spelling errors.  



2.6.1  Choices When An Unknown Word Is Found - When      the
spelling  checker  encounters  a  word  that  isn't  in  the
dictionary, it will type the page and line number, the  line
in which the word occurs, and the word itself.

In general, when  a  word  is  found  that  is  not  in  the
dictionary,  a  brief message will be typed to remind you of
the possible choices.  In the special case where the program
finds  precisely  one  possible correction for the word, you
will be given the choice of typing C to accept  the  "guess"
or any other option.  The options are:

     C    The "guess" is correct.  Correct the word  in  the
          text.   Enter misspelling in dictionary so that it
          will be corrected if it appears again.

     A    Accept this word, this one time.  

     I    Accept this word and insert it in  the  dictionary
          so  that  subsequent occurrences of this word will

                                                      Page 6


          be  recognized  and  accepted.   Words  that   are
          inserted   this  way  are  marked  as  incremental
          insertions and they  may  be  dumped  to  form  an
          auxiliary dictionary.

     R    Replace this word.  Type "R" <cr> and the  program
          will  ask  you  for  the replacement word.  If the
          replacement word is not already in the dictionary,
          the program will give you an opportunity to insert
          it.

          If the misspelling is  due  to  an  omitted  space
          between  words,  use the "R" command to retype the
          words with the space.

          If the replacement  word  contains  no  spaces  or
          illegal  characters, you will be asked if you wish
          to  add  this   replacement   to   the   list   of
          misspelling-correction  pairs.   If you do add the
          replaced word as a misspelling then all subsequent
          occurrences of that word will be replaced with the
          replacement string.  

     X    Accept this word and finish.   The  word  will  be
          accepted.   Then  the  remainder of the input file
          will be copied  without  checking  to  the  output
          file.

     W    Save my incremental insertions.   After  you  type
          "W"  <cr> you will be asked for a file name.  Then
          an incremental dump  of  the  dictionary  will  be
          written into the file.  After the dump is complete
          you may then decide what to do with  the  excepted
          word.

     L    Load an auxiliary dictionary.  The present word is
          accepted and you will be asked for the name of the
          dictionary file to load.  This is  useful  if  you
          encounter  a  jargon  term  but forgot to load the
          appropriate dictionary.  

     D    Display the line and offending  word  again.   The
          line   that   is   displayed  will  not  have  any
          corrections shown in it.  If a line has more  than
          one  error  the  line  will  only  be  typed once.
          Subsequent errors on that line will cause only the
          particular  word  to be typed, unless this command
          is used.

     S    If  this  choice  is  offered  then  the  spelling
          checker has discovered several words that could be
          possible corrections of this word.   If  you  type
          "S"  <cr> then you will enter a mode where you can
          look at the words that were found by  the  program
          and  (optionally) select one of the words from the

                                                      Page 7


          list.

          When you enter this selection submode,  the  first
          word  in  the list of possible corrections will be
          typed followed by an asterisk.  Then you have  the
          following choices:

          C<cr>  Use this word as the Correction.

          <cr>   Show the next possible  choice.   When  you
                 exhaust the choices you are returned to the
                 outer mode, and asked again.

          ^<cr>  Back up in the list.

          <esc>  Escape from this submode and return to  the
                 outer command mode.

Note that when you make a correction via the C command or by
selection  from  the  list  presented by the S command, that
correction is entered in the misspelling-correction list and
subsequent  occurrences  of  the  same  misspelling  will be
corrected automatically.  



2.7  Finished Processing The File

When the input file is exhausted, all files are closed,  the
program types "Finished.", and the exit sequence is entered.
Except on TENEX, you will be asked if you want  the  default
dictionary WORDS.DIC dumped.  Answer "yes" or "no" (actually
only a single letter answer is required).  The user then has
several options:

E    Exit now.

S    Save this core image.

C    Go back and correct another file.

A    Augment the dictionary, set new switches,  and  correct
     another file.

D    Complete dump of the dictionary.  This  will  create  a
     very large file, and it is not usually recommended.

I    Incremental dump of the dictionary.  All the words that
     were  read in from a private dictionary and marked with
     an I plus those inserted while running the program  are
     dumped  to a file.  The user specifies a file name (the
     default is WORDS.DIC).  This incremental file is  in  a
     format  suitable for editing or for use as an auxiliary
     dictionary.  The words in this file are in alphabetical
     order.

                                                      Page 8


X    This command is used  to  get  a  trace  count  of  the
     program.   It  is  for diagnostic purposes only, and is
     displayed as a possible choice only if the program  has
     been assembled as a debugging program.




3.0  HOW TO USE MULTIPLE DICTIONARIES


SPELL has a set of features whereby the user can  cause  the
creation  of  several disjoint incremental dictionaries.  In
this way, the  user  may  collect  several  dictionaries  of
special  terms.   Internally,  all  dictionary  entries  are
considered  equivalent  as  regards  word   searches.    The
distinction between dictionaries becomes relevant when doing
incremental dumps (the I command during the exit sequence or
the  W  command  while in the middle of execution).  When an
incremental dump  is  requested,  the  user  may  specify  a
number,  e.g.,  W9, which selects the particular incremental
dictionary to be dumped.  In this example, dictionary 9 will
be dumped.  



3.1  Dictionaries 0 And 1

Dictionary 0 is the main dictionary.  Words cannot be  added
to this dictionary, except by reading an auxiliary file.  In
general, words that are inserted incrementally are marked as
being  in  dictionary  1.   All  words  that are incremental
insertions in the dictionary will be marked in dictionary 1,
unless the user specifies otherwise.  



3.2  Specifying A Dictionary

The following places are where the user  may  specify  which
dictionary to add to:

     1.  When loading an auxiliary dictionary, if  the  user
         responds  with  "In"  to the question about marking
         new entries as incremental, then  the  new  entries
         will  be  marked in dictionary number n (where n is
         interpreted as decimal and should be less than 31).

     2.  After a word has been rejected, type "In" to insert
         the word in dictionary number n.

     3.  After replacing a word, if the replacement  is  not
         in  the  dictionary,  then  type "In" to insert the
         replacement into dictionary n.

                                                      Page 9


When requesting an incremental dump, the  user  may  specify
the  particular  dictionary to dump.  This is allowed in two
cases:

     1.  After some word has been rejected, the command "Wn"
         will cause dictionary number n to be dumped.

     2.  During the exit sequence,  the  command  "In"  will
         cause dictionary number n to be dumped.


In all five cases above, if n is either 0 or  omitted,  then
it will be taken as being 1.  




3.3  Caution!

There  is  no  provision  in  SPELL  for  remembering  which
dictionary  numbers  have  been used.  Therefore, it remains
the individual user's responsibility to remember the numbers
of  all  the  dictionaries that he creates.  (Forgetting the
number will mean that the forgotten dictionary  can  not  be
dumped  incrementally.   The words in a forgotten dictionary
will still be available, but the only way  to  actually  get
them dumped out is to dump the entire dictionary).  




3.4  Hint:

In the course of correcting a file, it is  likely  that  you
will  be  asked  about words which you wish to have accepted
during this file, but which you don't wish to have saved  in
your  incremental  dictionary(s).   In  these  cases, simply
insert them in a "throwaway"  incremental  dictionary  which
you don't bother to dump when you're finished.  




4.0  ABNORMAL CONDITIONS


While the program is running it  is  possible  that  certain
abnormal  conditions  may obtain.  The usual response of the
program  is  to  type  some  sort  of  error  message.   The
following  is a list of the error messages in SPELL, with an
indication of the severity of the error.  

     Illegal dictionary entry:  <word>
          This error occurs if an entry in a dictionary file
          exceeds  40  (decimal)  characters.   The  word is

                                                     Page 10


          ignored.
     0 LENGTH WORD AT HASHCP
          Somebody just asked to compute the hash address of
          an  empty  word.  The program continues, but there
          is a possibility of error.
     HASHING ERROR
          Somebody asked for the hash address of a word that
          doesn't  begin  with  letters or apostrophe as the
          first two characters.  This is a fatal error;  the
          program halts.
     DEVICE DATA ERROR (OUTPUT)
          This message means  that  while  writing  a  file,
          something screwed up.  The program halts.
     DEVICE ERROR (INPUT)
          The input file is screwed up  in  some  way.   The
          program halts.
     Internal confusion in the spelling checker.
     Called from location <loc>.
          The spelling checker has discovered  a  (possible)
          bug  in  itself.   The program halts, but the user
          may  type  CONTINUE.   Please  note  the  location
          mentioned  and  the  circumstances that evoked the
          message.
     Dictionary number too large, Maximum is 30.
          This message means  that  the  user  attempted  to
          select  for  insertion  or  dumping  a  dictionary
          beyond the range of  allowed  numbers.   The  user
          will get another chance to do the right thing.
     Unrecognized switch.
          The user asked for an  unknown  switch.   He  must
          repeat the entire switch specification again.  

The following messages occur only in the non-TENEX version:

     Illegal Character in Scan.
          This is a message from the routine that reads file
          names.   You  will  be  asked  to try retyping the
          name.
     File not Found.  <filename>
          The indicated file could not be found.   The  user
          gets to specify some other file.  
     Enter failed on:  <file name>
          An enter uuo failed while  trying  to  select  the
          indicated  file  for output.  The user may specify
          another name.  
     Open Failed on Device <dev>:
          You asked for a device that doesn't exist or isn't
          available  to  the program.  You will get a chance
          to ask for something else.  
     Insufficient Core Available.
          SPELL  requires  more  core  while  expanding  the
          dictionary  but  none  is  available.  The program
          exits.

                                                     Page 11


5.0  INTERNAL WORKINGS

5.1  Data Structures - Hashing Function


The data structure is the heart  of  the  program,  and  any
efficiency in the program operation is due primarily to this
choice of data structure.  The data structure is basically a
hash  coding scheme where dictionary entries are accessed by
both their alphabetic order and by their length.  There is a
base table that contains 26 * 26 * 10 halfwords;  this table
gives anchors for 6760 chains.  Each chain contains  exactly
all  words  with  the  same two first letters and some given
length.   To  be   precise,   the   hashing   function   is:
(L1*26+L2)*10+min(WL-2,9),  where  L1  and  L2  are  numeric
representations of the first and second letters  (A=0,  B=1,
...   Z=25, and apostrophe also is 25), and WL is the length
of the word in characters.

This scheme was chosen since it provides both  an  efficient
way  to  probe  the  dictionary  and a quick way to select a
small subset of all words that are close to  a  given  input
word.




5.2  Data Structures - Dictionary Entry Format


Entries are added to  the  appropriate  hash  chain  by  the
INSERT  subroutine.   Entries  are  added to the head of the
chain, saving the time and effort of searching to the end of
the  chain.  This scheme means that the last item entered on
a chain is the first item seen by a search.  The  format  of
the entry is given by:

Word 0: xwd flags,nextlk
Word 1: 5 bit representation
Word 2: 5 bit representation
        ...

There are precisely 1+ceiling(WL/7) machine words  used  for
each  dictionary  entry.   WL  is the length of the entry in
characters.  Nextlk is the pointer to the next entry in  the
list,  or  zero  if this is the last in the chain.  The left
side contains flags;  bits  13-17  specify  the  incremental
dictionary   number   (0  for  main,  1-30  for  incremental
dictionaries, and 31 for  misspellings).   One  can  imagine
that  bits  5-12 could be used to store semantic information
about the entry.  The unused bytes in the last  word  of  an
entry  must be zero, since they are used to stop the routine
that converts the five bit to 7 bit.  

Misspellings are entered in the  main  dictionary  with  the

                                                     Page 12


incremental  dictionary number set to decimal 31.  If a word
is marked as a misspelling then the word preceding the flags
and link contains a pointer to the flag and link word of the
correction.  Misspellings may be  deleted  before  the  full
dictionary  is  dumped,  or whenever SPELL asks about saving
the core image.  The space obtained by deleting misspellings
is not reutilized at present, although that feature could be
added without great difficulty.  



5.3  Spelling Correction Heuristics


There are four kinds of errors that the program attempts  to
correct:

     1. one wrong letter.
     2. one missing letter.
     3. one extra letter.
     4. two transposed letters.

For a wrong letter in the third or subsequent character, all
words  that are candidates must exist on the same chain that
the suspect word hashes to.  Hence, each entry on that chain
is  inspected  to  determine if the suspect differs from the
entry by exactly one character.  This is accomplished by  an
exclusive-or  (XOR)  between the suspect and the dictionary.
Then a JFFO instruction selects the first non zero  byte  in
the  XOR.  This byte is zeroed and if the result is all zero
then the dictionary word differs from the  suspect  in  only
one letter.  All such words are listed at CANDBF, where they
can be inspected later.  

For a wrong letter in the first  or  second  character,  the
program  tries  varying  the  second  letter  through all 26
possible values, searching for an exact match.  Then all  26
possible values of the first letter are tried, after setting
the second letter to its original value.  This means that 52
more chains are searched for possible matches.  

To  correct  transposed   letters,   all   combinations   of
transposed  letters  are  tried.   There  are only WL-1 such
combinations, so it is fairly cheap to do that.  

To correct one extra letter, wl copies of the word are made,
each  with  some letter removed.  Each of these is looked up
in the dictionary.  This takes WL searches.  

To correct one missing letter, WL+1 copies of the  word  are
made, each time inserting a null character in a new position
in the suspect.  The null character is  never  part  of  any
word,  so the suspect word augmented by an embedded null can
be thought of as a word with one  wrong  letter  (the  null)
then  the  algorithm  for matching one wrong letter is used.

                                                     Page 13


If the first character is omitted,  all  26  possible  first
characters  are  tried.   Also,  26 more words are formed by
varying the second character in case that had been  omitted.




6.0  ASSEMBLY & LOADING INSTRUCTIONS


There are three assembly time switches,  TENEX,  STANSW  and
SANSW.  If STANSW is set then there are SIXBIT ppn's and the
SWAP UUO;  if STANSW is zero, then normally there are  octal
ppn's  except  if  SANSW  is  set,  in which case, there are
decimal ppn's.  If the TENEX switch is set it overrides  the
others  and  TENEX  style  file names are used.  Compile the
program using MACRO or FAIL  and  load  it.   The  non-TENEX
versions   will  produce  a  sharable  high  segment  and  a
non-sharable low segment.  When you start  the  program  the
first time after loading, it will demand a dictionary.  This
dictionary will be read in as dictionary zero, which is  the
built-in  dictionary  for  future  use.   In  the  non-TENEX
versions dictionary zero is  read  into  the  sharable  high
segment.   Use  the  largest dictionary you have which meets
your  storage   limitations.    The   file   SPELLD.ALL   is
recommended.

When dictionary zero is loaded, SPELL will ask "Do you  wish
to  save  this  core  image?"  Answer  "Yes"  and  save  the
resulting core image.  On non-TENEX systems, do an SSAVE  to
produce a sharable high segment containing dictionary zero.

There are various other assembly switches, but they  default
to  reasonable  settings,  and  you meddle with them at your
peril.  



7.0  POSSIBLE EXPANSIONS


No program that is still in use is complete.  The  following
paragraphs  include  suggestions  of possible future work in
this area.  

For non-paged systems, the main dictionary (0) should be  in
the  sharable  segment  and  the  private  additions  in the
non-sharable segment.  This would require adjusting the hash
chains to point across the gap.

The dictionary should be expanded to  include  all  suffixes
for every word.  There is a feature that strips suffixes for
the  purpose  of  finding  the  stem  of  the  word  in  the
dictionary,   but   this   heuristic   is  error  prone  and
incompatible with later attempts to correct the word.

                                                     Page 14


If semantic information were included in the dictionary,  it
could help guide the selection of a correction.  

Either of these would require a major restructuring  of  the
program,  since  the dictionary would no longer fit in core.
Probably, the dictionary should be kept on the disk, with  a
data structure similar to the one used in core, but arranged
to keep each hash chain in the minimum number of disk pages,
so   that  searches  through  the  dictionary  can  be  made
efficient.