Trailing-Edge
-
PDP-10 Archives
-
decuslib10-08
-
43,50506/keywrd.rnd
There are 2 other files named keywrd.rnd in the archive. Click here to see a list.
.title ^^KEYWRD, Word and Phrase Recognition Logic Generator
.nofill
KK KK EEEEEEEE YY YY WW WW RRRRRR DDDDD
KK KK EE YY YY WW WW RR RR DD DD
KK KK EE YYYY WW WW WW RR RR DD DD
KKKKK EEEEE YY WW WWWW WW RRRRRR DD DD
KKK KK EE YY WWWW WWWW RR RR DD DD
KK KK EE YY WWW WWW RR RR DD DD
KK KK EEEEEEEE YY WW WW RR RR DDDDD
.skip
10 August 1980
.fill
.skip
.center
KEYWRD, Word and Phrase Recognition Logic Generator
.center
------##---- --- ------ ----------- ----- ---------
.skip
The KEYWRD program produces a sequence of tests which can identify
the leading word or the leading phrase formed of a fixed sequence of
words in a line of text without ever having to test a character
which has already been identified. Such a leading word or phrase,
which will be referred to as a command in this document, does not
need to include any characters to the right of the first of the
characters which uniquely identify the command. The word or each of
the words in a phrase can be abbreviated by truncation, leaving at
least the left character in each word of a phrase if additional
words or their abbreviations appear to the right. Spaces are allowed
between the words in a phrase, but are not required. A single
sequence of tests is used to recognize the initial portions of
commands which start with a common series of characters, then the
unique portion of each command is identified by a separate sequence
of tests. After the unique portion of each command has been
identified by the separate sequence of tests, then a single sequence
of tests is similarly used to recognize the final portions of
commands which end with a common series of characters.
.skip
The KEYWRD program reads a single input file and produces an output
listing file and an output FORTRAN language file containing DATA
statements which represent the sequence of tests. The sequence of
tests could, of course, easily be ouput in a form other than FORTRAN
source code. These DATA statements must be merged into the FORTRAN
program by which these tests are to be used. The KEYWRD program is
written in FORTRAN, and is machine independent except for the short
subroutine which asks for the names of the files and then opens
these files.
.skip
The number of tests which are necessary to describe an entire
glossary is usually less than proportional to the number of
characters in the glossary, although it is possible to design a
small glossary for which each character requires a separate test.
The tests are linked together so that when a word or phrase is
identified using these tests, each character in the word or phrase
is tested only against those characters which could logically follow
the characters which have already been identified in the word or
phrase. The description of each test is packed into 2 array
locations and consists of 5 items: the serial location in the
alphabet of the letter to be matched, an indication of whether
spaces can appear before the letter, the location in the arrays of
the description of the next test to be applied if the current match
fails, the value to be associated with the command if the current
match succeeds, and the location in the arrays of the description of
the next test to be applied if the current match succeeds.
.skip
The KEYWRD program required 247 seconds on a DECsystem1091 computer
to produce 395 tests which could be used to identify 45 single words
and 84 phrases representing 85 different commands. The processing of
this glossary, which consisted of 249 words and 1094 letters,
required 5582 locations in each of 6 arrays. Each of these arrays is
dimensioned at 3000 in the distributed version of the KEYWRD
program. The sizes of these arrays can be changed if the value of
the variable named LMTPNT is also changed to the new number of
locations in each array. If the arrays are too small for the
glossary being processed, then the KEYWRD program will inform the
user of the required array dimensions. The number of locations which
must be available in each array for the processing of a glossary is
equal to no more than the sum of the lengths of the single word
commands and of the products of the lengths of each of the words in
each of the phrases times the lengths of all of the words appearing
to the left in the same phrases. The actual number of locations
needed will be smaller if some commands start with the same
character sequences as others. A glossary consisting of the single
word SPACING and the phrase NO FLAGS UPPER CASE would require
7+2+(2*5)+(2*5*5)+(2*5*5*4) or 269 locations in each array. The
execution time for processing a glossary is proportional to the
square of the maximum number of locations which are used in each
array. If the execution time and the size of the KEYWRD program are
to be minimized, then the glossary should contain relatively few
phrases, and any phrases which must be included should be kept
short. Of course, the program which uses the tests produced by the
KEYWRD program will identify phrases nearly as efficiently as single
words. On the DECsystem1091 computer, the excecution time for the
KEYWRD program is given by
.skip
seconds = 8E-6*(number of locations used in single array**2)
.skip
Each line in the input file which does not start with one of the
reserved characters *, /, ( or ), which are described later,
specifies a single word or phrase preceded by the nonzero value by
which the word or phrase is to be identified. The number should not
duplicate the number to be associated with any other command unless
these commands are synonyms or unless some of these commands are
abbreviations which would otherwise be ambiguous. The number can be
preceded by one or more spaces, but leading spaces are not required.
The number cannot contain any characters other than the digits 0
through 9 and a leading minus sign if the value is negative or an
optional leading plus sign if the value is positive. Spaces and/or a
single comma can appear between the leading number and the following
word or phrase, but are not required. The words within a phrase must
be separated by at least 1 space. Extra spaces are ignored. Words
and phrases can be constructed from any characters, but upper and
lower case alphabetic letters are considered to be equivalent. The
sequence of tests produced by the KEYWRD program is independent of
the order in which the commands are defined in the input file, but
with the exception that the numerical values assigned to
nonalphabetic characters are 26 plus the order in which these
unexpected characters are first encountered. The input file is
terminated by a line containing a number which is not followed on
the same line by any word or phrase. The following is an input file
which defines 4 very similar phrases.
.nofill
.skip
.test page 5
10 NEXT RUNE
20 NEAT RULE
30 NEXT RULE
40 NEAT RUNE
0
.skip
.fill
The listing file produced by the KEYWRD program summarizes each of
the words and phrases and phrase abbreviations which can be
recognized. The following listing file would be produced if the
above input file is processed.
.nofill
.skip
0 NE RULE
NE RUNE
N RULE
N RUNE
10 NEXT RU(N)E
NEX RU(N)E
20 NEAT RU(L)E
NEA RU(L)E
30 NEXT RU(L)E
NEX RU(L)E
40 NEAT RU(N)E
NEA RU(N)E
.fill
.skip
The numbers at the left in the listing file are the values which are
to be associated with the words and phrases which are shown to the
right. The unique portion of each word or phrase is enclosed in
parentheses. An abbreviation of a command is unambiguous if it
includes the first of the letters which are enclosed in parentheses.
Since the final letter E in each of the above phrases is not
necessary for the recognition of the phrases, the test sequences can
be merged at the end as well as at the start, so that the same test
for the final letter E can be included in all of the test sequences.
The tests for the next to the last character, the letter N, cannot
be merged in the 2 phrases which contain the word RUNE, and the
tests for the letter L cannot be merged in the 2 phrases which
contain the word RULE, since no tests for unique characters would
then remain in the sequences of tests which identify these phrases.
The abbreviations of all but the rightmost words in each phrase are
included in the listing file since the manner in which the previous
words in a phrase are abbreviated can change which of the following
characters are unique. Abbreviations such as NE RULE, NE RUNE, N
RULE and N RUNE, in the above example, contain no unique characters
and so are ambiguous. Ambiguous abbreviations are shown in the
listing file with the associated value being zero.
.skip
The FORTRAN statement file produced by the KEYWRD program contains a
description of the input file, a description of the tests, the DATA
statements defining 2 arrays, MCHPNT and NOTPNT, which specify the
tests, and a DATA statement specifying the sizes of these arrays,
KNTPNT. The names which are assigned to all of the arrays and
variables which are represented in the DATA statements in the
FORTRAN statement file can be specified by a line starting with an
initial slash as described later. The following FORTRAN statement
file would be produced if the above input file is processed. In the
comment lines, a minus sign to the left of a letter indicates that
the letter appears at the start of a word.
.skip
.nofill
C 10 NEXT RUNE
C 20 NEAT RULE
C 30 NEXT RULE
C 40 NEAT RUNE
C
C FINAL STORAGE USED= 19, MOST USED= 62, LIMIT= 3000
C
C CHECKSUMS 990, 32,2894,2575, 866
C
C COUNT 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
C COMMAND 0 0 0 0 0 0 20 40 0 0 0 0 30 10 0
C LETTER -N E A T -R U L N X T -R U L N -R
C SUCCESS 2 3 4 5 6 7 19 19 10 11 12 13 19 19 16
C FAILURE 0 15 9 5 0 0 8 0 15 11 0 0 14 0 0
C
C COUNT 16 17 18 19
C COMMAND 0 0 0 0
C LETTER U L N E
C SUCCESS 17 19 19 0
C FAILURE 0 18 0 0
C
C DIMENSION MCHPNT(19)
DIMENSION MCHPN1(19)
EQUIVALENCE (MCHPN1(1),MCHPNT(1))
C DIMENSION NOTPNT(19)
DIMENSION NOTPN1(19)
EQUIVALENCE (NOTPN1(1),NOTPNT(1))
DATA KNTPNT,KNTXTR/ 19, 0/
DATA MCHPN1/2,3,4,5,6,7,419,819,10,11,12,13,619,219,
116,17,19,19,0/
DATA NOTPN1/-280,115,29,405,-360,420,248,280,495,411,
1-360,420,254,280,-360,420,258,280,100/
.skip
.fill
If the words and phrases are constructed from characters other than
spaces and the alphabetic letters A through Z, then an additional
DATA statement is generated which specifies a third array, LTRXTR,
containing the unexpected characters. KNTXTR, which is specified by
one of the DATA statements which are always generated, is the size
of the LTRXTR array. If the words and phrases are constructed only
from spaces and the alphabetic letters A through Z, then KNTXTR has
the value zero and a DATA statement defining the LTRXTR array is not
generated.
.skip
The first location in the NOTPNT array describes the first test
which is to be performed. The absolute value of each entry in the
NOTPNT array is the sum of the location within the alphabet of the
letter to be matched times (KNTPNT+1), plus the subscript of the
location in the NOTPNT array which describes the next match which is
to be attempted if the current match is a failure. The subscript of
an array location is its serial position within the array, counting
the first value in the array as being at subscript 1, the second
value as being at subscript 2, and so on. If the entry in the NOTPNT
array is negative, then the character starts a word and any spaces
in the input line can be skipped. If the location within the
alphabet is greater than 26, then this minus 26 is the location
within the LTRXTR array of the character to be matched. If the match
succeeds and the value to be associated with the command is positive
or if the value is zero indicating that the match does not uniquely
identify a particular command, then the parallel location in the
MCHPNT array contains the sum of the value of the command times
(KNTPNT+1), plus the subscript of the location in the NOTPNT array
which describes the next test. If the match succeeds and the value
to be associated with the command is negative, then the parallel
location in the MCHPNT array instead contains the value of the
command times (KNTPNT+1), minus the subscript of the location in the
NOTPNT array which describes the next test. If the subscript of the
location in the NOTPNT array which describes the next test is
indicated to be zero, either by the MCHPNT array if the current
match is a success or by the NOTPNT array if the current match is a
failure, then no additional test remains to be performed, and the
command is identified by the last nonzero value encountered in the
MCHPNT array for a match which succeeded.
.skip
The sequence of tests which the KEYWRD program produces to identify
the commands in the above input file is diagrammed below. In this
diagram, the horizontal lines formed of minus signs indicate the
next tests to be performed if the matches succeed, and the diagonal
lines formed of slashes indicate the next tests to be performed if
the matches fail. The numbers above the letters are the locations in
the NOTPNT and MCHPNT arrays which describe the tests. The numbers
below the letters are the values of the associated commands. The
word "space" indicates a location at which an optional space
character can appear.
.nofill
.test page 22
18 19
N---E
/0 /0
15 16 17/ /
/---/-----------space---R---U---L---/
/ / 0 0 0 /
/ / /
/ / 14 /
/ / N---/
/ / /10 /
/ / 11 12 13/ /
/ / /---/---space---R---U---L---/
/ 9/ 10/ / 0 0 30 /
/ X---T---/ /
/ /0 0 8 /
/ / N---/
/ / /40 /
/ / 5 6 7/ /
/ / /---/---space---R---U---L---/
1 2/ 3/ 4/ / 0 0 20
N---E---A---T---/
0 0 0 0
.fill
.skip
The following input file defines 4 commands which start with the
same first 3 letters.
.skip
.nofill
.test page 5
10 FOGHORN
20 FOG HORN
30 FOG
40 FOGGY
0
.skip
.fill
The sequence of tests which the KEYWRD program produces to identify
the commands in the above input file is diagrammed below. This
sequence of tests is independent of the order in which the commands
appear in the input file. If a continuation of the current word
starts with the same letter as the next word of a phrase, then the
continuation is always sought first and the space which must precede
the next word is sought only if the test for the continuation fails.
Thus, FOGHORN without a space before the letter H will always be
identified as command 10 and FOG HORN with a space before the letter
H will always be identified as command 20. FHORN and FOHORN are not
abbreviations of the single word FOGHORN, and so are identified as
command 20 whether or not a space appears before the letter H.
.skip
.nofill
.test page 11
7 8 9 10
/---/---/---space---H---O---R---N
/ / / 20 /0 0 0
/ / / /
/ / 6/ /
/ / H---------------/
/ / /10
/ / /
1 2/ 3/ 4/ 5
F---O---G---G---Y
0 0 30 40 40
.fill
.skip
Undesired abbreviations can be made unidentifiable by including them
in the input file with zeroes as the associated values. If the
glossary defined by the input file shown above is to include the
abbreviations F HORN and FO HORN when a space or spaces appear
between the abbreviations of the words of the phrase, but FHORN and
FOHORN are not to be allowed as abbreviations, then the input file
might be changed to the following.
.skip
.nofill
.test page 7
10 FOGHORN
20 FOG HORN
30 FOG
40 FOGGY
0 FHORN
0 FOHORN
0
.skip
.fill
The following listing file would be produced if the above input file
is processed.
.skip
.nofill
0 FHORN
FOHORN
10 FOG(H)ORN
20 FOG (H)ORN
FO (H)ORN
F (H)ORN
30 FO(G)
40 FOG(GY)
.skip
.fill
The sequence of tests which the KEYWRD program produces to identify
the commands in the above input file is diagrammed below. The only
differences between this sequence of tests and that shown for the
previous example is that this sequence of tests does not assign a
nonzero value to the command when the letter H is found immediately
after an initial letter F or after the initial letters FO.
.skip
.nofill
.test page 19
9 10 11 12
/---/---/---space---H---O---R---N
/ / / 20 /0 0 0
/ / / /
/ 8/ /
/ H-------------------/
/ /0 /
/ / / /
/ / 7/ /
/ / H---------------/
/ / /10 /
/ / / /
3/ 4/ 5/ 6 /
O---G---G---Y /
/0 30 40 40 /
/ /
1 2/ /
F---H-----------------------/
0 0
.skip
.fill
If, instead, the glossary is to include the abbreviations FHORN and
FOHORN, but F HORN and FO HORN are not be be allowed as
abbreviations when a space or spaces appear between the words, then
the input file might be changed to the following.
.skip
.nofill
.test page 8
10 FOGHORN
20 FOG HORN
30 FOG
40 FOGGY
20 FHORN
20 FOHORN
0 FO HORN
0
.skip
.fill
No harm would result from explicitly assigning the value zero to the
phrase F HORN in the above input file, but this is not necessary
since this phrase is an abbreviation of the phrase FO HORN. The
following listing file would be produced if the above input file is
processed.
.skip
.nofill
0 FO HORN
F HORN
10 FOG(H)ORN
20 F(H)ORN
FOG (H)ORN
FO(H)ORN
30 FO(G)
40 FOG(GY)
.skip
.fill
The sequence of tests which the KEYWRD program produces to identify
the commands in the above input file is diagrammed below.
.skip
.nofill
.test page 23
10 11 12 13
/---/-------space---H---O---R---N
/ / 0 /0 0 0
/ / /
/ 9/ /
/ H-------------------/
/ /20 /
/ / /
/ / 8 /
/ / /---space---H---/
/ / / 20 /
/ / / /
/ / 7/ /
/ / H---------------/
/ / /10 /
/ / / /
3/ 4/ 5/ 6 /
O---G---G---Y /
/0 30 40 40 /
/ /
1 2/ /
F---H-----------------------/
0 20
.skip
.fill
When 2 or more commands start with the same sequence of characters
and the input file defines a shorter command which is itself an
abbreviation of the nonunique sequence, then the shorter command can
be followed by the rest of the nonunique characters and still be
identified as the shorter command. For example, if the above input
files defined FOG HIDDEN as having the value 50, then then both FOG
and FOG H would be identified as command 30. This could be prevented
by assigning FOG H and all other undesired abbreviations of this
sort some single nonzero value which is not used for any valid
command and which will indicate that these undesired commands are
illegal if they are ever encountered.
.skip
If a short command which is explicitly declared in the input file
would, if not declared, serve as an abbreviation of a longer command
having a different identifying value, then the abbreviations of the
shorter command will themselves be ambiguous. If the abbreviations
of the shorter command are to be recognized, then the shortest
possible abbreviations of the word, or of the rightmost words in the
phrases of each possible length, must be explicitly declared. For
example, if the input file contains only the following entries
.nofill
.skip
1 FOOT NOTE
2 FOOT NOTE GAP
3 FOOT NOTE HEADER
.skip
.fill
then abbreviations such as F, FO, FOO, FOOT, FN, FON and FNO will be
unrecognizable. Any abbreviation of the command FOOT NOTE will be
recognizable only if it contains the full spelling of the word NOTE.
This could be prevented by declaring F and FOOT N (the shortest
abbreviations of the rightmost words in the phrases of each possible
length) to have the value 1. If F already was declared as yet
another command, then it would be FO and FOOT N which would be
declared to have the value 1.
.skip 2
.fill
.test page 6
.center
Handling of Input Lines Starting with Special Characters
.center
-------- -- ----- ----- -------- ---- ------- ----------
.skip
Lines in the input file which start with a slash, an asterisk, a
left parenthesis or a right parenthesis are treated specially. These
initial characters cause the following actions to be performed.
.skip
.left margin 5
.test page 4
.indent -3
/##An initial slash indicates that the line specifies the names of
the arrays and variables which are to be represented in the DATA
statements which are to be written into the output FORTRAN statement
file.
.skip
.indent -3
*##An initial asterisk indicates that the line specifies 5 numbers
which characterize the sequence of tests produced by the KEYWRD
program. Such a line would appear in the input file only when the
result is already known and the operation of the KEYWRD program is
being verified.
.skip
.indent -3
(##An initial left parenthesis indicates that the rest of the
current line is to be copied into the output FORTRAN statement file
unchanged.
.skip
.indent -3
)##An initial right parenthesis indicates that the DATA statements
which represent the sequence of tests are to be written into the
output FORTRAN statement file.
.skip
.left margin 0
If a line starts with an slash, then the next 5 groups of printing
characters on the line are used as the names of the 3 arrays and 2
variables which are represented in the DATA statements when the next
line is encountered which starts with a left parenthesis or which
contains only a number. These groups of printing characters can be
separated by spaces and/or by single commas. The names of the 3
arrays must each differ from the others in their first 3 characters.
.left margin 5
.skip
.indent -3
1.#The first group of up to 6 characters is used as the name of the
array which specifies the next operation if a match fails. This name
is NOTPNT if a line starting with an slash does not appear in the
input file.
.skip
.indent -3
2.#The second group of up to 6 characters is used as the name of the
array which specifies the next operation if a match succeeds. This
name is MCHPNT if a line starting with an slash does not appear in
the input file.
.skip
.indent -3
3.#The third group of up to 6 characters is used as the name of the
nondimensioned variable which contains the number of tests specified
by the NOTPNT and MCHPNT arrays. This name is KNTPNT if a line
starting with an slash does not appear in the input file.
.skip
.indent -3
4.#The fourth group of up to 6 characters is used as the name of the
array which specifies all characters, other than spaces and the
letters A through Z, appearing in the commands. This name is LTRXTR
if a line starting with an slash does not appear in the input file.
.skip
.indent -3
5.#The fifth group of up to 6 characters is used as the name of the
nondimensioned variable which contains the number of characters in
the LTRXTR array. This name is KNTXTR if a line starting with an
slash does not appear in the input file.
.left margin 0
.skip
If a line starts with an asterisk, then the rest of the line
predicts the values of 5 checksums which are to be calculated from
the glossary which is being defined. These checksums can be
separated by spaces and/or by single commas. The user of the KEYWRD
program will be informed if the predicted values and the calculated
values do not agree. The predicted checksums are used in
standardized versions of the input file which are processed to
determine if the KEYWRD program still produces the same results
after the KEYWRD program itself has been changed. A sixth group of
up to 6 characters on the line beginning with an asterisk can define
a label which is to be displayed to the user.
.skip
The characters appearing to the right of an initial left parenthesis
are copied directly into the FORTRAN statement file. The appearance
of a line starting with a left parenthesis does not interrupt the
specification of the glossary of keywords, and, in particular, does
not cause the DATA statements to be generated. Lines with an initial
left parenthesis can be used to introduce FORTRAN statements and
FORTRAN comment lines into the FORTRAN statement file.
.skip
Lines which begin with a right parenthesis cause the DATA statements
representing the sequence of tests to be written into the FORTRAN
statement file, but do not indicate that the end of the input file
has been reached. All characters which follow the initial right
parenthesis on the same line are ignored. If the input file is used
for testing the KEYWRD program, then additional glossaries might be
defined on the lines which follow that which begins with the right
parenthesis, but, usually, only lines containing information to be
copied directly to the FORTRAN statement file would appear before
the final line which contains only a number. The final line will
cause the DATA statements to be written out if a line with an
initial right parenthesis has not already done so.
.skip
For example, the following input file
.skip
.nofill
.test page 11
( BLOCK DATA
( COMMON/NUMKEY/NOTPNT(300),MCHPNT(300),KNTPNT,KNTXTR
( COMMON/LTRKEY/LTRXTR(20)
10 FOGHORN
20 FOG HORN
(C COMMENT LINE TO BE COPIED DIRECTLY TO OUTPUT
30 FOG
40 FOGGY
)GENERATE THE DATA STATEMENTS
( END
0
.skip
.fill
would, when processed by the KEYWRD program, produce the following
FORTRAN statement file.
.skip
.nofill
BLOCK DATA
COMMON/NUMKEY/NOTPNT(300),MCHPNT(300),KNTPNT,KNTXTR
COMMON/LTRKEY/LTRXTR(20)
C 10 FOGHORN
C 20 FOG HORN
C COMMENT LINE TO BE COPIED DIRECTLY TO OUTPUT
C 30 FOG
C 40 FOGGY
C
C FINAL STORAGE USED= 10, MOST USED= 24, LIMIT= 3000
C
C CHECKSUMS 650, 8, 736, 306, 101
C
C COUNT 1 2 3 4 5 6 7 8 9 10
C COMMAND 0 0 30 40 40 10 20 0 0 0
C LETTER -F O G G Y H -H O R N
C SUCCESS 2 3 4 5 0 8 8 9 10 0
C FAILURE 0 7 7 6 0 7 0 0 0 0
C
C DIMENSION MCHPNT(10)
DIMENSION MCHPN1(10)
EQUIVALENCE (MCHPN1(1),MCHPNT(1))
C DIMENSION NOTPNT(10)
DIMENSION NOTPN1(10)
EQUIVALENCE (NOTPN1(1),NOTPNT(1))
DATA KNTPNT,KNTXTR/ 10, 0/
DATA MCHPN1/2,3,334,445,440,118,228,9,10,0/
DATA NOTPN1/-66,172,84,83,275,95,-88,165,198,154/
END
.skip 2
.fill
.test page 6
.center
A Sample Program Using the Output from the KEYWRD Program
.center
- ------ ------- ----- --- ------ ---- --- ------ -------
.skip
.fill
The following FORTRAN program demonstrates the manner in which the
variables and the arrays generated by the KEYWRD program are used.
This sample program accepts a short line of text typed by the user,
then informs the user what this line of text contains. The user can
type more than a single command on each line if these commands are
separated by semicolons.
.skip
.nofill
COMMON/NUMKEY/NOTPNT(300),MCHPNT(300),KNTPNT,KNTXTR
COMMON/LTRKEY/LTRXTR(20)
DIMENSION LTRBFR(30),LTRABC(26),LWRABC(26)
C NUMBER OF UNIT FROM WHICH TEXT IS READ
DATA ITTY/5/
C DIMENSION OF LTRBFR ARRAY, NUMBER OF LETTERS TO READ
DATA LMTBFR/30/
C UPPER CASE LETTERS A THROUGH Z
DATA LTRABC/1HA,1HB,1HC,1HD,1HE,1HF,1HG,1HH,1HI,1HJ,
11HK,1HL,1HM,1HN,1HO,1HP,1HQ,1HR,1HS,1HT,1HU,1HV,1HW,
21HX,1HY,1HZ/
C LOWER CASE LETTERS A THROUGH Z
DATA LWRABC/1Ha,1Hb,1Hc,1Hd,1He,1Hf,1Hg,1Hh,1Hi,1Hj,
11Hk,1Hl,1Hm,1Hn,1Ho,1Hp,1Hq,1Hr,1Hs,1Ht,1Hu,1Hv,1Hw,
21Hx,1Hy,1Hz/
C THE SPACE CHARACTER FOR FINDING SPACES IN TEXT
C ASTERISK FOR MARKING UNKNOWN LETTERS IN MESSAGE
DATA LTRSPC,LTRSEM/1H ,1H;/
C
C CALCULATE FACTOR FOR EXTRACTING BYTES FROM ARRAYS
IOFFST=KNTPNT+1
C
C GET NEXT LINE OF TEXT TO BE INTERPRETED
1001 WRITE(ITTY,1002)
1002 FORMAT(2H *,$)
READ(ITTY,1003)LTRBFR
1003 FORMAT(30A1)
C
C SET UP POINTERS WITHIN THE INPUT LINE BEING EVALUATED
LOCBFR=0
1004 MINPRT=LOCBFR+1
C
C SET UP POINTERS WITHIN THE TEST SEQUENCE
KMDNOW=0
LOCPNT=1
C
C GET NEXT CHARACTER TO BE TESTED
1005 MAXPRT=LOCBFR
LOCBFR=LOCBFR+1
IF(LOCBFR.GT.LMTBFR)GO TO 1014
C
C ATTEMPT TO IDENTIFY THE CHARACTER
1006 IF(LOCPNT.LE.0)GO TO 1014
IVALUE=NOTPNT(LOCPNT)
LOCABC=IVALUE
IF(IVALUE.LT.0)LOCABC=-LOCABC
LOCABC=LOCABC/IOFFST
1007 IF(LTRBFR(LOCBFR).EQ.LTRSEM)GO TO 1014
IF(LOCABC.LE.26)GO TO 1008
IF(LTRBFR(LOCBFR).EQ.LTRXTR(LOCABC-26))GO TO 1012
GO TO 1009
1008 IF(LTRBFR(LOCBFR).EQ.LTRABC(LOCABC))GO TO 1012
IF(LTRBFR(LOCBFR).EQ.LWRABC(LOCABC))GO TO 1012
C
C LETTERS DID NOT MATCH
1009 IF(IVALUE.GE.0)GO TO 1011
IVALUE=-IVALUE
C
C CHECK FOR SPACES BEFORE NEXT WORD
1010 IF(LTRBFR(LOCBFR).NE.LTRSPC)GO TO 1007
LOCBFR=LOCBFR+1
IF(LOCPNT.EQ.1)MINPRT=LOCBFR
IF(LOCBFR.LE.LMTBFR)GO TO 1010
GO TO 1014
C
C GET NEXT LETTER TO BE TESTED IF FAILURE
1011 LOCPNT=IVALUE-(IOFFST*LOCABC)
GO TO 1006
C
C LETTERS MATCHED
1012 IVALUE=MCHPNT(LOCPNT)
IF(IVALUE.GE.0)GO TO 1013
IVALUE=-IVALUE
KMDNEW=IVALUE/IOFFST
LOCPNT=IVALUE-(IOFFST*KMDNEW)
IF(KMDNEW.NE.0)KMDNOW=-KMDNEW
GO TO 1005
1013 KMDNEW=IVALUE/IOFFST
LOCPNT=IVALUE-(IOFFST*KMDNEW)
IF(KMDNEW.NE.0)KMDNOW=KMDNEW
GO TO 1005
C
C FIND RIGHTMOST PRINTING CHARACTER FOR USE IN MESSAGES
1014 MAXTST=LOCBFR
MAXSHO=MAXPRT
1015 IF(LOCBFR.GT.LMTBFR)GO TO 1017
IF(LTRBFR(LOCBFR).EQ.LTRSPC)GO TO 1016
IF(LTRBFR(LOCBFR).EQ.LTRSEM)GO TO 1017
MAXSHO=LOCBFR
1016 LOCBFR=LOCBFR+1
GO TO 1015
C
C CHECK IF KNOWN AND NOT FOLLOWED DIRECTLY BY LETTER
1017 IF(LOCPNT.EQ.1)GO TO 1024
IF(KMDNOW.EQ.0)GO TO 1022
IF(MAXTST.GT.MAXSHO)GO TO 1019
IF(MAXTST.GT.(MAXPRT+1))GO TO 1019
LTRNOW=LTRBFR(MAXTST)
DO 1018 I=1,26
IF(LTRNOW.EQ.LTRABC(I))GO TO 1022
IF(LTRNOW.EQ.LWRABC(I))GO TO 1022
1018 CONTINUE
C
C REPORT WHAT WAS FOUND
1019 WRITE(ITTY,1020)KMDNOW,(LTRBFR(I),I=MINPRT,MAXPRT)
1020 FORMAT(8H COMMAND,1I3,2H: ,31A1)
IF(MAXTST.GT.MAXSHO)GO TO 1026
WRITE(ITTY,1021)(LTRBFR(I),I=MAXTST,MAXSHO)
1021 FORMAT(13H ARGUMENTS: ,31A1)
GO TO 1026
1022 WRITE(ITTY,1023)(LTRBFR(I),I=MINPRT,MAXSHO)
1023 FORMAT(13H UNKNOWN: ,31A1)
GO TO 1026
1024 WRITE(ITTY,1025)
1025 FORMAT(11H MISSING)
C
C GET NEXT STATEMENT ON SAME LINE
1026 IF(LOCBFR.LE.LMTBFR)GO TO 1004
GO TO 1001
END
.skip
.fill
The following is a typical dialog between the user and the program
which is listed above. The glossary which is used consists of the
words FOGHORN, FOG HORN, FOG and FOGGY.
.skip
.nofill
*
MISSING
*;
MISSING
MISSING
*FOH;FOGH;FOG H
COMMAND 20: FOH
COMMAND 10: FOGH
COMMAND 20: FOG H
*FOG;FOGHORN;FOG HORN;FOGGY
COMMAND 30: FOG
COMMAND 10: FOGHORN
COMMAND 20: FOG HORN
COMMAND 40: FOGGY
*FOG123;FOG ABC
COMMAND 30: FOG
ARGUMENTS: 123
COMMAND 30: FOG
ARGUMENTS: ABC
*ABC;FO;FOGABC
UNKNOWN: ABC
UNKNOWN: FO
UNKNOWN: FOGABC
.fill
.skip 2
.test page 7
.center
A Description of the Algorithm Used by the KEYWRD Program
.center
- ----------- -- --- --------- ---- -- --- ------ -------
.skip
The first printing character in each new line read from the input
file is checked to determine if it is one of the reserved
characters. If the line instead begins with a number, then this
number is evaluated and the spaces and/or a single comma following
the number are discarded. A decision tree is then constructed which
describes the paths by which the characters in the word or in the
phrase could have been recognized. Each node in the decision tree
specifies the node from which it can be reached and whether it is
reached upon the success or upon the failure, never both, of the
test described by that node. The first character of the second and
of each of the subsequent words in a phrase is included in separate
tests which are reached on failures to match each appearance of the
second and of each of the subsequent characters of the previous
word, and is also included in separate tests which are reached upon
successful matches of each appearance of the final character of the
previous word. If the command consists of a single word, then the
decision tree is a simple chain of tests, each node indicating that
it is reached only if the test described by the previous node is
successful. If the command consists of more than a single word, then
each character in the second word appears in as many tests as there
are characters in the first word. Each character in a third word
would appear in a number of tests equal to the number of characters
in the first word times the number of characters in the second, and
so on. The decision tree which can be used to identify the new
command is appended to the arrays which are used to store the
decision trees already obtained for the previously read commands.
.skip
For example, the input file
.skip
.nofill
.test page 3
10 WHO ARE YOU
20 WHO AM I
0
.fill
.skip
would be converted into the 2 trees which are shown below after the
second line of the input file is read.
.skip
.nofill
.test page 27
17 26 35
Y--O--U
/10 10 10
/20 29 38
/ Y--O--U
/ /10 10 10
5 8/11/14 23 32
A--R--E--Y--O--U
/10 10 10 10 10 10
/ 18 27 36
/ Y--O--U
/ /10 10 10
/ /21 30 39
/ / Y--O--U
/ / /10 10 10
/ 6 9/12/15 24 33
/ A--R--E--Y--O--U
/ /10 10 10 10 10 10
/ / 16 25 34
/ / Y--O--U
/ / /10 10 10
/ / /19 28 37
/ / / Y--O--U
/ / / /10 10 10
1 2/ 3/ 4 7/10/13 22 31
W--H--------O--------A--R--E--Y--O--U
10 10 10 10 10 10 10 10 10
.skip
.test page 20
and
.skip
53
I
/20
44 47/50
A--M--I
/20 20 20
/ 54
/ I
/ /20
/ 45 48/51
/ A--M--I
/ /20 20 20
/ / 52
/ / I
/ / /20
40 41/ 42/ 43 46/49
W--H-----O-----A--M--I
20 20 20 20 20 20
.fill
.skip
After a new command has been read and converted to the back pointing
decision tree, nodes in the new tree which are identical to nodes in
the old trees, and which are reached under identical conditions, are
removed and all references to the removed nodes in the new tree are
revised to point to the nodes in the old trees to which the removed
nodes are identical. The portion of the new tree which is stored
above the node which is being removed is shifted downward 1 location
and all references in this portion of the new tree to the node being
removed are changed to instead point back to the node in the old
trees to which the removed node is identical. The decision tree
shown below would represent the input file shown above after the
redundant subphrase WHO A is removed from the second command. There
are then 2 nodes which branch from node 4 on a success, 2 from 5 and
2 from 6.
.skip
.nofill
.test page 45
17 26 35
Y--O--U
/10 10 10
/20 29 38
/ Y--O--U
/ /10 10 10
8/11/14 23 32
:--R--E--Y--O--U
: 10 10 10 10 10
: 47
: I
: /20
5 41/ 44
A--M--I
/0 20 20
/ 18 27 36
/ Y--O--U
/ /10 10 10
/ /21 30 39
/ / Y--O--U
/ / /10 10 10
/ 9/12/15 24 33
/ :--R--E--Y--O--U
/ : 10 10 10 10 10
/ : 48
/ : I
/ : /20
/ 6 42/ 45
/ A--M--I
/ /0 20 20
/ / 16 25 34
/ / Y--O--U
/ / /10 10 10
/ / /19 28 37
/ / / Y--O--U
/ / / /10 10 10
/ / 7/10/13 22 31
/ / :--R--E--Y--O--U
/ / : 10 10 10 10 10
/ / : 46
/ / : I
/ / : /20
1 2/ 3/ 4 40/ 43
W--H--------------O--------------A--M--I
0 0 0 0 20 20
.fill
.skip
The roots of the remaining trees, which will each begin with a
different character, are chained together in a series of failure
transfers when the end of the input file is read. The resulting
single back pointing decision tree must then be converted into a
forward pointing decision tree in which each node indicates the node
which is to follow it in the event of a success or of a failure. A
node in the back pointing decision tree can be reached only upon the
success or the failure of a single node which is stored lower in the
arrays since all linking and relinking has been to nodes in older
trees and since the nodes are still in the same relative order as
when they were constructed. Since nodes can only be reached from
nodes which are stored lower in the arrays, a single pass through
the arrays can be used to convert from the back pointing decision
tree to the forward pointing decision tree. If a node indicates that
it is reached by a failure of a node which represents a letter which
is higher in the alphabet, then the tree must be relinked to place
the new node earlier in the chain of failures than that which is
indicated to be its predecessor. In this reordering, as in all
situations in which characters are compared, characters which appear
at the start of words are considered to be in a second and higher
alphabet. The following forward pointing decision tree would
represent the sample input file shown above.
.skip
.nofill
.test page 36
17 26 35
Y--O--U
/10 10 10
47/20 29 38
I Y--O--U
/20/10 10 10
8/11/14 23 32
R--E--Y--O--U
/10 10 10 10 10
5 41/ 44
A--M--I
/0 20 20
/ 18 27 36
/ Y--O--U
/ /10 10 10
/ 48/21 30 39
/ I Y--O--U
/ /20/10 10 10
/ 9/12/15 24 33
/ R--E--Y--O--U
/ /10 10 10 10 10
/ 6 42/ 45
/ A--M--I
/ /0 20 20
/ / 16 25 34
/ / Y--O--U
/ / /10 10 10
/ / 46/19 28 37
/ / I Y--O--U
/ / /20/10 10 10
/ / 7/10/13 22 31
/ / R--E--Y--O--U
/ / /10 10 10 10 10
1 2/ 3/ 4 40/ 43
W--H-----------O-----------A--M--I
0 0 0 0 20 20
.fill
.skip
Each chain of failures is next purged of identical characters, since
the second appearance of a particular character in a chain of
failures can never be reached. When a duplicate is found, then the
chain of failures must be linked around the node being removed, and
the chain of failures extending from the success transfer from the
node being removed must be merged with the chain of failures
extending from the success transfer from the node being kept,
weaving the 2 chains together to obtain a single chain in which the
characters are tested in alphabetical order.
.skip
Once the tree has been purged of identical characters in the chains
of failures, the nodes which represent the unique characters in all
possible abbreviations of each command must be identified so that
these nodes can be preserved when identical branches of the decision
tree are merged. The nodes which must be preserved are those which
are on the chains of failures extending from the success transfers
from the nodes which are known to be ambiguous by their having been
on roots which were merged.
.skip
Identical nodes which are at the ends of the branches or which
transfer to the same nodes are merged if these identical nodes are
not marked as representing unique characters in different commands.
Since it is not necessary to preserve the relative order of the
storage of the nodes in the arrays, the node being removed is
interchanged with that at the upper end of the arrays and all
references to both nodes are patched. This interchange prevents
having to shift several nodes each time a node must be removed.
Pruning of the identical branches would convert the decision tree
shown above to that which is shown below.
.skip
.nofill
.test page 13
6 11 12
/--/--/--Y--O--U
5/ / / 10 10 10
I / /
/20/ /
7/10/ /
R--E--/
/10 10
4 9/ 8
/--/--/--A--M--I
1 2/ 3/ / 0 20 20
W--H--O--/
0 0 0
.fill
.skip
To make the decision tree easier to diagram, the nodes are
rearranged into the order in which the nodes are encountered when
the decision tree is climbed. The decision tree shown above would
finally be reduced to that which is shown below.
.skip
.nofill
.test page 13
10 11 12
/--/--/--Y--O--U
9/ / / 10 10 10
I / /
/20/ /
7/ 8/ /
R--E--/
/10 10
4 5/ 6
/--/--/--A--M--I
1 2/ 3/ / 0 20 20
W--H--O--/
0 0 0
.skip
.fill
When the optional spaces before each of the words of each of the
phrases are included, the decision tree would instead appear as is
shown below.
.skip
.nofill
.test page 13
10 11 12
/--/--/-space-Y--O--U
9/ / / 10 10 10
/-space-I / /
/ 20/ /
7/ 8/ /
R----------E--/
/10 10
4 5/ 6
/--/--/-space-A--M--space--I
1 2/ 3/ / 0 20 20
W--H--O--/
0 0 0
.fill
.skip
The following listing file would be produced when the above decision
tree is constructed.
.skip
.nofill
10 WHO A(RE YOU)
WHO A(R YOU)
WHO A (YOU)
WH A(RE YOU)
WH A(R YOU)
WH A (YOU)
W A(RE YOU)
W A(R YOU)
W A (YOU)
20 WHO A(M I)
WHO A (I)
WH A(M I)
WH A (I)
W A(M I)
W A (I)
.skip
.fill
.test page 10
The following FORTRAN statement file represents the decision tree
shown above.
.skip
.nofill
C 10 WHO ARE YOU
C 20 WHO AM I
C
C FINAL STORAGE USED= 12, MOST USED= 54, LIMIT= 3000
C
C CHECKSUMS 880, 30,1121, 448, 288
C
C COUNT 1 2 3 4 5 6 7 8 9 10 11 12
C COMMAND 0 0 0 0 20 20 10 10 20 10 10 10
C LETTER -W H O -A M -I R E -I -Y O U
C SUCCESS 2 3 4 5 6 0 8 10 0 11 12 0
C FAILURE 0 4 4 0 7 0 9 10 10 0 0 0
C
C DIMENSION MCHPNT(12)
DIMENSION MCHPN1(12)
EQUIVALENCE (MCHPN1(1),MCHPNT(1))
C DIMENSION NOTPNT(12)
DIMENSION NOTPN1(12)
EQUIVALENCE (NOTPN1(1),NOTPNT(1))
DATA KNTPNT,KNTXTR/ 12, 0/
DATA MCHPN1/2,3,4,5,266,260,138,140,260,141,142,130/
DATA NOTPN1/-299,108,199,-13,176,-117,243,75,-127,
1-325,195,273/
.skip 2
.fill
.test page 8
The KEYWRD program and this documentation were written at the
Harvard Business School by Donald E_. Barth, who can be reached at
the following address.
.skip
.nofill
Baker Library 21
Graduate School of Business Administration
Harvard University, Soldiers Field
Boston, Massachusetts 02163
.left margin 0
.fill
.skip 2
.center
Appendix: KEYWRD Development History
.center
--------##------ ----------- -------
.skip
.left margin 4
.indent -4
May 1980
.break
Original release of the KEYWRD program through DECUS library.
.skip
.indent -4
August 1980
.break
(modification) Listing indicates commands which are subsets of other
commands.
.break
(modification) Program estimates necessary array size if glossary is
too large.