Trailing-Edge
-
PDP-10 Archives
-
decuslib10-02
-
43,50227/kwic.doc
There are 5 other files named kwic.doc in the archive. Click here to see a list.
PROGRAM DISCRIPTION:
THIS ROUTINE TAKES TWO FILES. A USER DEFINED STOP LIST,
AND A FILE TO BE KEY-WORD-IN-CONTEXT INDEXED.
THE USER SUPPLIES THE LOCATION OF THE INPUT FILES AND
A PLACE TO WRITE THE INDEX FILE AND A TITLE FOR THE LISTING.
THIS ROUTINE READS THE ENTIRE MASTER FILE (DATA TO BE
INDEXED) INTO CORE AND MUST BE ABLE TO READ IT ALL
INTO CORE AT ONCE TO RUN. THE PROGRAM ALSO MAKES A FREQUENCY FILE
WHICH CONSISTS OF THE NUMBER OF TIMES EACH INDEX TERM
WAS USED.
THIS PROGRAM WAS WRITTEN BY G.B. MOERSDORF
AT THE OHIO STATE UNIVERSITY. THE SYSTEM WAS DEVELOPED
ON OSU'S NON DISK NON SWAPPING 32K PDP-10. THE SYSTEM RUNS UNDER
A 4NN72 OR BETTER MONITOR. THE CODE WAS WRITTEN
TO BE COMPLETELY DEVICE INDEPENDANT. THE ONLY RESTRICTION ON THE
INPUT DEVICES IS THAT THEY CAN DO IMAGE BINARY MODE (10) INPUT.
THE RESTRICTION ON THE LISTING DEVICE IS THAT IT CAN DO
ASCII LINE MODE (1) OUTPUT. THE LISTING
WIDTH CAN BE ADJUSTED TO ANY SIZE LINE PRINTER
OR TELETYPE WHICH HAS MORE THAN 60 PRINT POSITIONS.
QUICK INSTRUCTIONS TO RUN KWIC:
THE BEST WAY TO DESCRIBE THE INDEX IS BY MAKING ONE, TO USE THE
DEMO DATA SUPPLIED DO THE FOLLOWING:
(1) MOUNT DISTRIBUTION DECTAPE AND ASSIGN 'DSK' OR
'PIP' TAPE TO YOUR DISK AREA.
(2) TYPE RUN DTA#:KWIC (FOR DISK RU KWIC)
(3) WHEN ASKED FOR 'MASTER FILE' TYPE CR TO USE DEFAULT
OR THE FILE NAME SPECIFICATION IN THE
FORM 'DEV:FILE.EXT'. I.E. DEVICE NAME: FILE NAME.
EXTENSION (CR). DEFAULTS: DEV=DSK,
FILE=KWIC, EXT=MAS.
DEFAULTS SPECIFY NAME OF TEST DATA SET.
(4) WHEN ASKED FOR 'STOP FILE' TYPE CR
OR FILE SPECIFICATION AS ABOVE.
DEFAULTS: DEV=DSK, FILE=KWIC, EXT=STP.
(5) WHEN PROMPED FOR 'INDEX FILE' TYPE CR
OR FILE SPECIFICATION AS ABOVE
DEFAULTS: DEV=DSK, FILE=KWIC, EXT=NDX.
(THIS WILL WRITE LISTING ON DECTAPE OR DISK IF
YOU HAVE ONE, UNDER THE NAME 'KWIC.NDX' PPN 0,0).
(6) WHEN PROMPTED WITH 'FREQUENCY FILE' TYPE
A CARRIAGE RETURN TO DEFAULT TO 'DSK:KWIC.FRQ'.
THIS IS THE WHERE THE PROGRAM WILL WRITE THE
FREQUENCY FILE.
(7) WHEN PROMPTED WITH 'LISTING TITLE' TYPE YOUR
NIFTY COMPANY NAME OR SLOGAN. (MAX 80 CHARACTERS)
(8) WHEN IT PRINTS 'EXTI' THE INDEX HAS BEEN WRITTEN
ON THE FILE DESCRIBED IN STEP 5 AND THE
FREQUENCY LIST ON THE FILE SPECIFIED IN STEP 6
(9) PRINT THE INDEX AND FREQUENCY FILES WITH 'PIP'. AREN'T
THEY BEAUTIFULL?
(10) IF IT IS NOT BEAUTIFUL GO TO 'IMPLEMENTATION
ON YOUR 10'.
FORMAT OF 'STOP LIST' FILE:
THE USER CREATES A 'STOP LIST' OF WORDS WHICH THE USER FEELS
HAVE NO USE AS INDEX TERMS FOR HIS PATRTICULAR APPLICATION.
ONE SUCH 'STOP LIST' IS SUPPLIED WITH THE PACKAGE, IT IS CALLED
'KWIC.STP'. THE SUPPLIED LIST IS A GENERALIZED
STOP LIST WHICH CONTAINS 'LOW VALUE' KEYWORDS
SUCH AS, A, AN, IN, THE. THIS FILE MUST BE IN ALPHABETICAL ORDER.
THE FILE MAY HAVE STANDARD D.E.C. SEQUENCE NUMBERS.
EACH WORD TO BE STOPPED MUST BE DELIMITED BY A CARRIAGE RETURN
LINE FEED. SPACES AND TABS ARE IGNORED.
FORMAT OF 'MASTER' FILE:
THE MASTER FILE CONSISTS OF THE DATA TO BE INDEXED BY THE
KWIC PROGRAM. THIS MAY BE ANY TYPE OF ALPHANUMERICAL DATA.
THE USUAL DATA WOULD BE IN THE FORM OF MANY BOOK TITLES IN
A SPECIFIC AREA OF STUDY, OR POSSIBLY A WHOLE LIBRARY'S CATALOGUE.
BUT THE PROGRAM IS FLEXIBLE ENOUGH TO ALLOW KWIC INDEXING OF
A THESIS PAPER OR SIMILAR DOCUMENT (FOR WHAT IT'S WORTH).
THE DELIMITERS FOR EACH FIELD OF DATA (ALL 3 DELIMITERS)
ARE DECIDED UPON BY THE USER AT ASSEMBLE TIME.
THIS FILE SHOULD HAVE SEQUENCE NUMBERS AS THEY ARE USED IN THE
IDENTIFICATION OF SYNTAX ERROR LOCATIONS.
AFTER DEBUGGING THE DATA YOU MAY REMOVE
THE SEQUENCE NUMBERS TO SAVE DISK SPACE AND THE PROGRAM
WILL OPERATE NORMALLY.
GENERAL FORMAT OF AN ITEM IN THE MASTER FILE AS FOLLOWS:
1) STANDARD D.E.C. SEQUENCE NUMBER
2) FIELD OF DATA TO BE INDEXED (MAY BE CONTINUED
ON ANY NUMBER OF LINES. I.E. A CARRIAGE
RETURN LINE FEED IS IGNORED COMPLETELY)
3) THE DELIMITER FOR SORT FIELD ('=' ON THE DISTRIBUTED
VERSION.
4) NEXT ANY DATA TO BE TOTALLY IGNORED BY THE SYSTEM
SUCH AS COPYRIGHT DATE AND PUBLISHER. THIS WAS
DONE SO THAT THE SAME DATA BASE CAN BE USED FOR
THIS PROGRAM AS FOR OTHERS. I.E. ONE THAT
USES DATA NOT NORMALLY KWIC INDEXED.
5) THE I.D. DELIMITER CHARACTER (IN DISTRIBUTION IT IS A '[')
6) THE IDENTIFICATION NUMBER TO BE ASSOCIATED WITH
THE ITEM. THE MAXIMUM LENGTH OF THIS FIELD IS ALSO
ADJUSTABLE AT ASSEMBLE TIME. IN THE DISTRIBUTION IT
IS 10 DIGITS. WARNING! USE NO SPACES OR TABS IN THIS FIELD
7) THE END OF ITEM DELIMITER. IN THE DISTRIBUTION IT IS A ']'
NOTE 1: SINCE A CARRIAGE RETURN LINE FEED IS IGNORED
TO CONTINUE A WORD ON ANOTHER LINE THE USER MERELY TYPES
THE REST OF THE WORD WITH NO SPACES. BUT IF HE WISHES TO
DELIMIT THE WORD WITH A SPACE HE MUST TYPE IT EITHER AT
THE END OF THE LINE OR THE BEGINNING OF THE CONTINUATION
LINE. EX:
THE OHIO STATE UNIVE
RSITY
(CONTINUATION OF SAME WORD)
THE OHIO STATE
UNIVERSITY
(TWO SEPERATE WORDS)
NOTE 2: A SPACE AND A TAB ARE THE ONLY CHARACTERS WHICH DELIMIT
A WORD FROM ITS NEIGHBOR. SEQUENCIAL SPACES OR
TABS ARE REDUCED TO ONE SPACE ON THE LISTING.
NOTE 3: TWO CONVENTIONS HAVE BEEN USED IN THE TEST DATA
WHICH YOU MAY WANT TO USE. THE FIRST IS TO PLACE ALL
THE AUTHOR'S NAMES IN PARENS. THIS WILL MAKE ALL THE
AUTHOR'S NAMES APPEAR IN ONE SPOT IN THE
INDEX. THE SECOND CONVENTION IS TO USE A '/' IN FRONT OF
ANY WORD WHICH IS NOT IN THE TITLE, BUT YOU FEEL
HAS VALUE AS A INDEX TERM FOR THIS ITEM.
IMPLEMENTATION ON YOUR 10:
THERE ARE MANY ASSEMBLY PARAMETERS
AND THE ONES WHICH DIRECTLY AFFECT YOUR INSTALATION ARE
AS FOLLOWS:
SWITCH OR DEFAULT DESCRIPTION OR
VARIABLE VALUE ACTION TAKEN
____________________________________________________________
LPTSIZ 132 THE WIDTH OR DESIRED
WIDTH OF INDEX LINE. YOU MAY
WANT TO RESTRICT THIS FOR DUPLICATION
PURPOSES. IT MAY BE ANY EVEN NUMBER FROM
60 TO THE WIDTH OF THE OUTPUT DEVICE
LINE.
DELSRT "=" DELIMITER FOR THE SORTED DATA FIELD.
(THE FIRST FIELD DELIMITER) THIS MAY
BE ANY CHARACTER GREATER THAN A SPACE (40)
PUT THE CHARACTER IN DOUBLE QUOTES.
DELKEY "[" DELIMITER FOR IDENTIFICATION
FIELD. SAME RESTRICTIONS AS FOR
DELSRT. DON'T THINK YOU'RE SMART
AND USE THE SAME CHARACTERS FOR ALL
OR SOME DELIMITERS.
DELEOL "]" DELIMITER FOR THE END OF THE ITEM
(FOLLOWS THE IDENTIFICATION FIELD)
SAME RESTRICTIONS AS FOR DELKEY
MAXLIN ^D50 NUMBER OF LINES PUT ON A PAGE (NOT
INCLUDING THE HEADER)
SIZWRD ^D50 MAXIMUM NUMBER OF CHARACTERS
IN ANY ONE WORD. I.E. BEFORE
A SPACE OR TAB. (THIS ALLOWS ALL
THOSE ALL TIME FAVORITES LIKE
ANTIDISESTABLISHMENTENTARIASM)
MAXSAM ^D300 MAXIMUM NUMBER OF WORDS WHICH ARE NOT
STOP WORDS AND ARE IDENTICAL.
THIS IS THE SIZE OF THE
HASH TABLE.
DEBUG 0 IF 1 WILL MAKE A NON REENTRANT
BEGUGGING VERSION. USED ONLY WHEN
FIXING PROGRAM)
REENT 1 IF 1 GIVES REENTRANT CODE
AND IF 0 MAKES A NON REENTRANT.
IDLEN ^D10 MAXIMUM SIZE OF THE I.D. FIELD
FREQSW 1 IF 1 ASSEMBLES THE
FREQUENCY LIST CODE. IF 0, NO
FREQUENCY LIST IS GENERATED.
NOTES AND RANDOM INFO:
1) DO NOT (NOT!) USE A STRING OF 5 OR MORE "_" CHARACTERS
IN SEQUENCE ON ANY DATA FILE.
2) THE USER CANNOT SPECIFY A PPN IN A FILE SPECIFICATION.
3) DO NOT END THE STOP LIST WITHOUT A
CARRIAGE RETRUN LINE FEED. EX:
NOT THIS^Z
^
WRONG WAY
BUT THIS
^Z
^
CORRECT WAY
4) IF NO SEQUENCE NUMBERS ARE ON THE MASTER FILE THE
ERROR MESSAGES WILL NOT LOCATE THE LINES IN ERROR ON THE FILE
BUT MERELY PRINT THE FACT THAT THEY EXIST.
5) THE SYSTEM RUNS UNDER 4NN72 OR BETTER MONITORS (THERE SHOULD BE
NO MONITOR RESTRICTIONS IF IT RUNS UNDER OUR MONITOR)
6) ON OUR SYSTEM USING ALL OF USER CORE (23K) WE CAN HOLD AND
KWIC INDEX A LIBRARY CATALOGUE OF 4000 ITEMS.
7) USING THE SAME DATA (A SMALL AMOUNT) THIS PROGRAM HAS
RUN FASTER ON THE 10 THAN ON OUR 370/165.
8) IF A WORD IN THE STOP LIST IS LONGER THAN 12 CHARACTERS
IT WILL BE TRUNCATED IN THE LISTING BUT ITS VALUE WILL
BE UNCHANGED.
ERROR MESSAGES:
THE FOLLOWING IS A LIST OF ERROR MESSAGES AND THEIR MEANING.
1)
CANNOT INIT XXXXX DEVICE
DEVICE SPECIFIED IN AN INPUT PARM OR A DEFAULT SPECIFICATION
WAS NOT CORRECT OR AVAILABLE TO THE USER.
2)
CANNOT FIND XXXXX FILE
THE FILE SPECIFIED (TYPE 'XXXXXX') COULD NOT BE FOUND.
3)
CANNOT ENTER XXXXX FILE
THE DIRECTORY ON THE DEVICE SPECIFIED TO
WRITE THE 'XXXXX' LISTING ON WAS FULL.
4)
?READ ERROR ON 'XXXXX' FILE
A DEVICE ERROR OCCURED ON THE 'XXXXX' FILE WHILE READING.
5)
?WRITE ERROR ON 'XXXXX' FILE
A DEVICE ERROR OCCURED ON THE 'XXXXX' FILE WHILE WRITING.
6)
?MASTER FILE NO LONGER AVAILABLE
THE PROGRAM RELEASES THE MASTER FILE FOR A SHORT PERIOD
WHILE IT READS IN THE STOP LIST FILE. THIS IS SO ON A
DECTAPE SYSTEM THESE TWO FILES MAY BE ON THE SAME DRIVE. THIS ERROR
OCCURES WHEN IT LOOKS FOR THE FILE THE SECOND TIME (AFTER THE
STOP LIST IS READ IN) AND CANNOT FIND IT. THIS SHOULD NEVER
HAPPEN, IF IT DOES THE JOB BOMBS OFF.
7)
?FATAL UUO FAILURE -BADFAL-
A CORE UUO FAILED WHILE DE-ALLOTING CORE.
THIS SHOULD BE AN IMPOSSIBLE CONDITION.
THE JOB BOMBS OFF.
8)
?MAXIMUM SIZE WORD EXCEEDED
WORD=CCCCCCCCCC
A WORD LONGER THAN THE LENGTH SPECIFIED BY THE
'SIZWRD' ASSEMBLY CONSTANT WAS EXCEEDED. JOB
BOMBS OFF. THE 'CCCCCC' WILL BE THE WORD IN
ERROR.
9)
?TOO MANY MATCHES FOR ARRAY
WORD=CCCCCCCCC
MORE THAN THE NUMBER OF IDENTICAL INDEX ITEMS SPECIFIED
BY THE 'MAXSAM' ASSEMBLY CONSTANT WERE FOUND.
JOB BOMBS OFF. THE 'CCCCC' WILL BE THE WORD WHICH
OCCURED MANY TIMES.
10)
?CORE UUO FAILED--TRYING AGAIN
IF THE CORE UUO FAILS WHILE TRYING TO READ IN DATA
THIS MESSAGE IS PRINTED. 30 SECONDS LATER THE PROGRAM
WILL TRY TO ALLOCATE THE CORE AGAIN. IT CONTINUES
LOOPING TILL IT GETS THE CORE. THIS IS USEFULL ON
NON SWAPPING SYSTEMS WHERE A USER CAN WAIT FOR THE CORE TO
BECOME FREE.
11)
?ERROR IN LINE NNNNN---
THE LINE NNNNN IS BAD OR ONE OF THE NEAR LINES.
THE SPECIFIC ERROR FOLLOWS. IF ANY OF THESE
ERRORS (THE ONES IN SECTION 11) OCCUR THE KWIC INDEX
AND FREQUENCY LIST ARE NOT GENERATED, ONLY THE STOP LIST.
---I.D. NUMBER TOO LONG
MEANS SIZE OF IDENTIFICATION
NUMBER GREATER THAN 'IDSIZ' ASSEMBLY
CONSTANT.
---NO I.D. NUMBER FOUND
MEANS JUST WHAT IT SAYS.
---NO SORT DELIM FOUND
MEANS JUST WHAT IT SAYS.
---SYNTAX ERROR
UNDIAGNOSABLE ERROR. (YOUR GUESS)
HAVING PROBLEMS:
IF YOU FIND ANY BUGS OR HAVE ANY
SUGGESTIONS PLEASE USE THE BELOW ADDRESS.
G.B. MOERSDORF
PDP-10 ROOM
CALDWELL LAB.
OHIO STATE UNIVERSITY
COLUMBUS, OHIO 43210
614-422-8039