Trailing-Edge
-
PDP-10 Archives
-
decuslib20-05
-
decus/20-0137/clustr/clustr.doc
There are 2 other files named clustr.doc in the archive. Click here to see a list.
WESTERN MICHIGAN UNIVERSITY
COMPUTER CENTER
LIBRARY PROGRAM #1.12.1
CALLING NAME: CLUSTR
PROGRAMMED AND ADAPTED BY: RUSSELL BARR III
PREPARED BY: BILL GRANET & RUSSELL BARR III
STATISTICAL CONSULTANT: DR. MICHAEL STOLINE
APPROVED BY: JACK R. MEAGHER
DATE: FEBRUARY 17, 1976
SINGLE LINK CLUSTER ANALYSIS PROGRAM
TABLE OF CONTENTS
1.0 INTRODUCTION
2.0 OPTIONS
3.0 LIMITATIONS
4.0 RESPONSES FOR QUESTIONS-INPUT? AND OUTPUT?
5.0 RESPONSES FOR THE QUESTION-FORMAT?
6.0 PROGRAM QUESTIONS AND HOW TO ANSWER THEM
7.0 SAMPLE TERMINAL AND BATCH RUNS
8.0 REFERENCES
9.0 DELTA ONE-HAT
1.0 INTRODUCTION
THIS PROGRAM IS A COMBINATION OF PROGRAMMING FROM COMMUNICATIONS OF THE ACM
(JUNE 1973, PAGES 355-361), COMPUTER JOURNAL (FEBRUARY 1973, PAGES 30-34), AND
SUBSTANTIAL PROGRAMMING BY MR. RUSSELL R. BARR III.
GIVEN M OBSERVATIONS ON N VARIABLES, OR A MATRIX WHOSE ELEMENTS ARE MEASURES
OF DEGREE OF ASSOCIATION BETWEEN EACH PAIR OF A SET OF N OBJECTS, THIS PROGRAM
GIVES INDICATIONS OF SIMILARITY WITHIN GROUPS AND DISSIMILARITY BETWEEN GROUPS.
THE SINGLE-LINK, OR NEAREST NEIGHBOR, METHOD OF CLUSTERING IS USED. THE OUTPUT
OF THE CLUSTERING IS GIVEN EITHER AS A DENDROGRAM OR AS A SHADE PLOT OR BOTH AS
THE USER SPECIFIES.
2.0 INPUT OPTIONS AND CLUSTER LIMITS
2.1 RAW DATA INPUT OPTIONS
IF RAW DATA IS SUBMITTED, THIS PROGRAM GENERATES A MATRIX FROM THE M
OBSERVATIONS ON N VARIABLES BY ONE OF FIVE DIFFERENT METHODS SPECIFIED BY THE
USER.
THESE MATRICES ARE CALLED:
1) CORRELATION (PEARSON PRODUCT MOMENT)
2) EUCLIDEAN DISTANCE
3) MANHATTAN METRIC
4) MINKOWSKI METRIC
5) CHI SQUARE DISTANCE
THESE FORMULAS WILL BE ILLUSTRATED WITH A SMALL DATA SET. PEARSON PRODUCT
MOMENT CALCULATIONS WILL NOT BE ILLUSTRATED AS IT IS WIDELY KNOWN. LET THE
FOLLOWING 3 OBSERVATIONS ON 2 VARIABLES BE GIVEN:
3,2
1,1
0,4
NOTE: (A) A**2 MEANS A*A. (B) ABS(A) MEANS ABSOLUTE VALUE OF A.
(C)<= MEANS LESS THAN OR EQUAL. (D)< MEANS LESS THAN.
(A) EUCLIDEAN DISTANCES MATRIX ON THE VARIABLES
A(11) A(12) 0 17
=
A(21) A(22) 17 0
WHERE 17 = (3-2)**2 + (1-1)**2 + (0-4)**2 = 1+0+16 = A(12) = A(21)
EUCLIDEAN DISTANCES MATRIX ON THE OBSERVATIONS
A(11) A(12) A(13) 0 5 13
A(21) A(22) A(23) = 5 0 10
A(31) A(32) A(33) 13 10 0
A(12) = (3-1)**2 + (2-1)**2 = 4+1 = 5
A(13) = (3-0)**2 + (2-4)**2 = 9+4 = 13
A(23) = (1-0)**2 + (1-4)**2 = 1+9 = 10
(B) MANHATTAN METRIC MATIRX ON THE VARIABLES
A(11) A(12) 0 5
A(21) A(22) 5 0
A(21) = A(12) = 3-2 + 1-1 + 0-4 = 1+0+4 = 5
MANHATTAN METRIC MATRIX ON THE OBSERVATIONS
A(11) A(12) A(13) 0 3 5
A(21) A(22) A(23) = 3 0 4
A(31) A(32) A(33) 5 4 0
A(12) = ABS(3-1) + ABS(2-1) = (2+1) = 3
A(13) = ABS(3-0) + ABS(2-4) = (3+2) = 5
A(23) = ABS(1-0) + ABS(1-4) = (1+3) = 4
(C) MINKOWSKI METRIC MATRIX FOR BOTH VARIABLES AND OBSERVATIONS IS A VARIATION
OF THE MANHATTAN METRIC WHEREBY THE ABSOLUTE VALUES ARE RAISED TO A
POWER.
(D) CHI SQUARE DISTANCES MATRIX ON THE VARIABLES
A(11) A(12) 0 3.65
=
A(21) A(22) 3.65 0
(3-E(11))**2 (1-E(12))**2 (0-E(3))**2 (2-E(21))**2
WHERE A(12) = ----------- + --------- + ---------- + ----------- +
E(11) E(12) E(13) E(21)
(1-E(22))**2 (4-E(23))**2
----------- + ---------
E(22) E(23)
(3+1+0) 5 (4) 20
WHERE E(11) = (3+2) ------------- = ----- = --
(3+1+0+2+1+4) 11 11
4 8 4 16
E(12) = (1+1) -- = -- E(13) = (0+4) -- = --
11 11, 11 11,
(2+1+4) 7 7 14
E(21) = (3+2) ------- = 5 -- E(22) = (1+1) -- = --
11 11, 11 11,
7 28
E(23) = (0+4) -- = --
11 11
CHI SQUARE DISTANCES MATRIX ON THE OBSERVATIONS
A(11) A(12) A(13) 0 .058 3.6
A(21) A(22) A(23) = .058 0 2.4
A(31) A(32) A(33) 3.6 2.4 0
(3-E(11))**2 (2-E(12))**2 (1-E(21))**2 (1-E(22))**2
A(12) = ----------- + ----------- + ----------- + -----------
E(11) E(12) E(21) E(22)
(3+2) 5 15
E(11) = (3+1) --------- = 4 - E(12) = (2+1) (5/7) = --
(3+2+1+1) 7, 7,
(1+1) 2 2 6
E(21) = (3+1) ----- = 4 - E(22) = (2+1) - = -
7 7, 7 7
(3-E(11))**2 (2-E(12))**2 (0-E(21))**2 (4-E(22))**2
A(13) = ----------- + ----------- + ----------- + -----------
E(11) E(12) E(21) E(22)
(3+2) 5
E(11) = (3+0) --------- = 3 - E(12) = (2+4) (5/9) = 40/9,
(3+2+0+4) 9
(0+4) 12 4
E(21) = (3+0) ----- = -- E(22) = (2+4) - = 32/9
9 9, 9
THEREFORE (3-15/9)**2 (2-40/9)**2 (12/9)**2 (4-32/9)**2
A(13) = ---------- + ---------- = -------- + ----------
15/9 40/9 12/9 32/9
(1-E(11))**2 (1-E(12))**2 (0-E(21))**2 (4-E(22))**2
A(23) = ------------ + ------------ + ------------ + ------------
E(11) E(12) E(21) E(22)
(1+1) 2 2 10
E(11) = (1+0) --------- = 1 - E(12) = (1+4) - = --
(1+1+0+4) 6, 6 6,
(0+4) 4 4 20
E(21) = (1+0) ----- = 1 -, E(22) = (1+4) - = --
6 6 6 6
2.2 CLUSTER LIMITS
THE FOLLOWING OPTIONS DEFINE HOW THE 15 DEGREES OF SHADE ARE DEFINED FROM THE
DISTANCE MATRIX.
OPTION 1: THE RANKS OF THE DISTANCES ARE SORTED AND DIVIDED INTO 15 EQUAL
INTERVALS.
OPTION 2: THE RANGE OF THE ACTUAL VALUE OF THE DISTANCES IS DIVIDED IN 15
EQUAL INTERVALS.
OPTION 3: THE USER SUPPLIES THE MINIMUM AND MAXIMUM VALUES FOR THE RANGE
DESIRED AND THIS IS DIVIDED INTO 15 EQUAL INTERVALS.
OPTION 4: THE USER ENTERS THE STARTING VALUE OF THE FIRST CLUSTER FOLLOWED
BY THE UPPER LIMITS OF ALL 15 CLUSTERS.
3.0 LIMITATIONS
IF THE METHOD OF INPUT IS 1,2,3,4,5, (SEE SECTION 2.1) THE MAXIMUM DATA SIZE IS
(A) K*K+6K+M*N+2L <= 26,000. K=MAX0(M,N). L=K*(K-1)/2. WHERE M IS THE
NUMBER OF COLUMNS AND N IS THE NUMBER OF ROWS.
(B) IF (A) IS NOT ATTAINABLE FOR METHOD 1 BUT THE DATA NEED NOT BE
TRANSPOSED, THE MAXIMUM IS
M<=95(N IS NOT LIMITED).
THIS IS USEFUL WHERE N IS LARGE WITH RESPECT TO M.
(C) IF METHOD 6 OR 7 IS USED THE MAXIMUM MATRIX IS <= 95.
IF THE ORDER OF THE DISTANCE MATRIX FORMED BY THE INPUT IS LARGER THAN 100,
NEITHER THE SHADE DIAGRAM NOR THE DENDROGRAM CAN BE PRINTED. IF THE ORDER OF
THE DISTANCE MATRIX IS LARGER THAN 50, THE DENDROGRAM CANNOT BE PRINTED.
WHENEVER THE DENDROGRAM IS NOT PRINTED, THE ORDERED INPUT AND THEIR COMPUTED
DISTANCES OR HEIGHTS ARE PRINTED SO THE USER MAY CREATE HIS OWN.
NOTE: THESE LIMITATIONS ARE BASED ON THE CURRENT USER CORE AREA OF 40K.
ANY INCREASE ABOVE 40K SHOULD BE ADDED TO 26000 IN THE ABOVE FORMULA.
4.0 RESPONSES FOR QUESTIONS-INPUT? AND OUTPUT?
THE RESPONSES TO THESE QUESTIONS DEFINE WHERE THE USER INTENDS TO WRITE HIS
OUTPUT FILE (OUTPUT?) AND FROM WHERE THE USER EXPECTS TO READ HIS INPUT DATA
(INPUT?).
THE NORMAL RESPONSE TO EACH OF THESE QUESTIONS CONSISTS OF THREE BASIC PARTS:
A DEVICE, A FILENAME, AND A PROJECT-PROGRAMMER NUMBER.
THE GENERAL FORMAT FOR THESE THREE PARTS IS AS FOLLOWS:
DEV:FILE.EXT[PROJ,PROG]
1) DEV: ANY OF THE FOLLOWING DEVICES ARE APPROPRIATE WHERE INDICATED:
DEVICE LIST DEFINITION STATEMENT USE
TTY: TERMINAL INPUT OR OUTPUT
DSK: DISK INPUT OR OUTPUT
CDR: CARD READER INPUT ONLY
LPT: LINE PRINTER OUTPUT ONLY
DTA0: DECTAPE 0 INPUT OR OUTPUT
DTA1: DECTAPE 1 INPUT OR OUTPUT
DTA2: DECTAPE 2 INPUT OR OUTPUT
DTA3: DECTAPE 3 INPUT OR OUTPUT
DTA4: DECTAPE 4 INPUT OR OUTPUT
DTA5: DECTAPE 5 INPUT OR OUTPUT
DTA6: DECTAPE 6 INPUT OR OUTPUT
DTA7: DECTAPE 7 INPUT OR OUTPUT
MTA0: MAGNETIC TAPE 0 INPUT OR OUTPUT
MTA1: MAGNETIC TAPE 1 INPUT OR OUTPUT
DEVICES MAY BE SPECIFIED BY LOGICAL OR PHYSICAL NAMES. THE DEVICE LIST COLUMN
HAS PHYSICAL NAMES.
INPUT MAY NOT BE DONE FROM THE LINE PRINTER NOR MAY OUTPUT GO TO THE CARD
READER.
2) FILE.EXT IS THE NAME AND EXTENSION OF THE FILE TO BE USED. THIS PART OF
SPECIFICATION IS USED ONLY IF DISK OR DECTAPE IS USED.
3) [PROJ,PROG] IF A DISK IS USED AND THE USER WISHES TO READ A FILE IN
ANOTHER PERSON'S DIRECTORY, HE MAY DO SO BY SPECIFYING THE PROJECT-
PROGRAMMER NUMBER OF THE DIRECTORY FROM WHICH HE WISHES TO READ. THE
PROJECT NUMBER AND THE PROGRAMMER NUMBER MUST BE SEPARATED BY A COMMA AND
ENCLOSED IN BRACKETS. OUTPUT MUST GO TO YOUR OWN AREA.
EXAMPLE:
OUTPUT? LPT:/COPIES:2
INPUT? DSK:DATA.DAT[71171,571026]
IN THE EXAMPLE, TWO COPIES OF THE OUTPUT ARE TO BE PRINTED BY THE HIGH
SPEED LINE PRINTER. THE INPUT DATA IS A DISK FILE OF NAME DATA.DAT IN USER
DIRECTORY [71171,71026].
DEFAULTS:
1) IF NO DEVICE IS SPECIFIED BUT A FILENAME IS SPECIFIED, THE DEFAULT
DEVICE WILL BE DSK:.
2) IF NO FILENAME IS SPECIFIED AND A DISK OR DECTAPE IS USED, THE DEFAULT
ON INPUT WILL BE FROM INPUT.DAT: ON OUTPUT IT WILL BE OUTPT.DAT.
3) IF THE PROGRAM IS RUN FROM THE TERMINAL AND NO SPECIFICATION IS GIVEN
(JUST A CARRIAGE RETURN), BOTH INPUT AND OUTPUT DEVICES WILL BE THE
TERMINAL.
4) IF THE PROGRAM IS RUN THROUGH BATCH AND NO SPECIFICATION IS GIVEN (A
BLANK CARD), THE INPUT DEVICE WILL BE CDR: AND THE OUTPUT DEVICE WILL
BE LPT:.
5) IF NO PROJECT-PROGRAMMER NUMBER IS GIVEN, THE USER'S OWN NUMBER WILL BE
ASSUMED.
NOTE: (1) THE FOLLOWING RESPONSES TO OUTPUT? AND INPUT? ARE VALID AFTER THEIR
FIRST OCCURRENCE.
(A) SAME
UPON RETURNING TO INPUT?, IF THE SAME DATA FILE IS TO BE USED
AGAIN, SIMPLY ENTER "SAME"; OTHERWISE, EITHER USE THE "FINISH"
OPTION OR ENTER ANOTHER FILENAME, ETC.
(B) CONTINUE
FOR MAGTAPES, THIS MEANS DO NOT REREAD THE SAME PART OF TAPE
AS BEFORE RATHER READ THE NEXT PART OF TAPE. FOR DISK OR
DECTAPE, THE RESULT OF USING CONTINUE IS THE SAME AS USING THE
SAME OPTION.
(C) FINISH
THE USER MUST ENTER EITHER FINISH, FINI, OR ^Z TO BRANCH OUT OF
THIS PROGRAM. FAILURE TO DO SO MIGHT RESULT IN LOSING THE
ENTIRE OUTPUT FILE.
(D) /OUTPUT
IF ONE ENTERS THIS, THE NEXT PROMPTING WILL BE OUTPUT? NOW
ONE MAY SPECIFY AN OUTPUT DEVICE AND/OR FILENAME DIFFERENT
FROM THE ONE USED IN RESPONSE TO THE PREVIOUS OUTPUT?. UNLESS
A "CONTINUE" IS SPECIFIED FOR THE "OUTPUT?", THE PREVIOUS
OUTPUT FILE WILL BE CLOSED (AND PRINTED IF LPT).
(2) (FOR HELP TYPE HELP)-- WILL ALSO BE PRINTED THE FIRST TIME OUTPUT?
AND INPUT? ARE PRINTED. AFTER THAT, THIS MESSAGE WILL NOT BE
PRINTED.
5.0 RESPONSES FOR THE QUESTION-FORMAT
THERE ARE 3 OPTIONS AVAILABLE FOR THE FORMAT; NAMELY:
(A) STANDARD FORMAT OPTION
UNLESS OTHERWISE SPECIFIED, THE PROGRAM ASSUMES THE STANDARD OPTION.
IN THIS OPTION, THE DATA IS REARRANGED IN GROUPS OF 10 PER LINE,
TWO VALUES BEING SEPARATED BY A COMMA.
(B) OBJECT TIME FORMAT OPTION
IF THE DATA IS SUCH THAT A USER'S OWN FORM IS REQUIRED, SIMPLY
ENTER A LEFT PARENTHESIS FOLLOWED BY THE FIRST FORMAT SPECIFICATION,
A COMMA AND THEN THE SECOND SPECIFICATION, ETC. WHEN YOU FINISH,
ENTER A RIGHT PARENTHESIS AND THEN A CARRIAGE RETURN. THERE CAN BE
A MAXIMUM OF 5 LINES FOR THE FORMAT, EACH LINE BEING 80 COLUMNS
LONG.
(C) SAME OPTION
IF THE FORMAT TO BE USED DURING THE SECOND OR SUBSEQUENT OCCURRENCE
OF THE LINE, FORMAT (F-TYPE ONLY) IS THE SAME AS THE PREVIOUS
FORMAT, SIMPLY ENTER "SAME<CR>".
6.0 PROGRAM QUESTIONS AND HOW TO ANSWER THEM
IN THE FOLLOWING SECTION, THE SYMBOL "<CR> MEANS PRESS THE "RETURN" BUTTON.
6.1 OUTPUT?
SEE SECTION 4.0. THE NEXT QUESTION IS 6.2.
6.2 INPUT?
SEE SECTION 4.0. THE NEXT QUESTION IS 6.3.
6.3 ENTER OUTPUT IDENTIFICATION IF DESIRED, ELSE TYPE <CR>
TYPE DESIRED OUTPUT IDENTIFICATION. IF NONE IS DESIRED, SIMPLY PRESS THE
RETURN BUTTON. THE NEXT QUESTION IS 6.4.
6.4 ENTER TYPE OF INPUT, TYPE <CR> FOR HELP
THIS QUESTION REQUESTS THE NUMBER OF THE DESIRED INPUT METHOD. TYPING <CR>
PRINTS THE FOLLOWING EXPLANATION:
TYPE:
1 FOR RAW DATA, CORRELATION COMPUTED
2 FOR RAW DATA, EUCLIDEAN DISTANCES COMPUTED
3 FOR RAW DATA, MANHATTAN METRIC COMPUTED
4 FOR RAW DATA, MINKOWSKI METRIC COMPUTED
5 FOR RAW DATA, CHI SQUARE COMPUTED
6 FOR LOWER TRIANGULAR MATRIX OF DISTANCES
7 FOR LOWER TRIANGULAR MATRIX OF CORRELATION
ENTER TYPE OF INPUT, TYPE <CR> FOR HELP
SEE SECTION 7.1 FOR AN EXAMPLE OF HOW TO ENTER DATA IN METHOD 6 OR 7. WHEN
YOU HAVE RESPONDED WITH ONE OF THE METHODS ABOVE, THE NEXT QUESTION IS 6.5.
6.5 ENTER CLUSTER LIMITS CONTROL CODE, TYPE <CR> FOR HELP
THE QUESTION REQUESTS THE NUMBER OF THE METHOD YOU WOULD LIKE TO USE TO SPECIFY
THE 15 INTERVALS. TYPING <CR> PRINTS THE FOLLOWING EXPLANATION:
TYPE:
1 TO DIVIDE THE RANKS OF DISTANCES INTO 15 EQUAL INTERVALS
2 TO DIVIDE THE ACTUAL DISTANCES INTO 15 EQUAL INTERVALS
3 TO SPECIFY THE INTERVAL MAX AND MIN
4 TO ENTER THE ENDPOINTS OF THE INTERVALS
ENTER CLUSTER LIMITS CONTROL CODE, TYPE <CR> FOR HELP
THESE OPTIONS ARE EXPLAINED IN SECTION 2.2. WHEN YOU HAVE RESPONDED WITH ONE
OF THE CODES ABOVE, THE NEXT QUESTION IS 6.6.
6.6 ENTER METHOD(S) OF OUTPUT, TYPE <CR> FOR HELP
TYPING <CR> PRINTS THE FOLLOWING EXPLANATION:
TYPE:
1 FOR SHADE DIAGRAM
2 FOR DENDROGRAM
3 FOR BOTH
WHEN YOU HAVE RESPONDED WITH ONE OF THE METHODS ABOVE, THE NEXT QUESTION IS 6.7.
6.7 FORMAT (F-TYPE ONLY)
SEE SECTION 5.0. IF YOUR RESPONSE TO QUESTION 6.4 WAS '6' OR '7', THE NEXT
QUESTION IS 6.10. OTHERWISE, IT IS 6.8.
6.8 ENTER NUMBER OF ROWS AND COLUMNS (I.E., 13,6)
ENTER THE SIZE OF THE INPUT DATA MATRIX. IF THE RESPONSE TO QUESTION 6.4 WAS
'4', THE NEXT QUESTION IS 6.9; OTHERWISE, GO TO LINE 6.13.
6.9 ENTER POWER FOR MINKOWSKI METRIC
SEE SECTION 2.0 FOR EXPLANATION. IF YOU RESPONDED WITH A '3' TO QUESTION 6.5,
THE NEXT QUESTION IS 6.11. IF YOU RESPONDED TO QUESTION 6.5 WITH A '4', THE
NEXT QUESTION IS 6.12; OTHERWISE, GO TO LINE 6.13.
6.10 HOW MANY VARIABLES
ENTER THE SIZE OR "ORDER" OF THE DISTANCE OR CORRELATION MATRIX (WHICHEVER IS
APPLICABLE). IF YOU RESPONDED TO QUESTION 6.5 WITH A '3', THE NEXT QUESTION
IS 6.11. IF YOU RESPONDED TO QUESTION 6.5 WITH A '4', THE NEXT QUESTION IS
6.12; OTHERWISE, GO TO LINE 6.13.
6.11 ENTER CLUSTER LIMITS (MIN,MAX)
ENTER A PAIR OF NUMBERS SEPARATED BY A COMMA REPRESENTING THE RANGE FROM THE
BOTTOM OF THE LOWEST CLUSTER TO THE TOP OF THE HIGHEST CLUSTER. GO TO LINE
6.13.
6.12 ENTER END POINTS OF CLUSTERS (8 PER LINE)
SEE SECTION 2.2. GO TO LINE 6.13.
6.13 IF "?NOT ENOUGH CORE AVAILABLE" IS PRINTED, THE JOB IS TOO LARGE AND THE
NEXT QUESTION IS 6.1. SEE SECTION 5.0. IF THE SINGLE MESSAGE IS "?NOT ENOUGH
CORE AVAILABLE TO KEEP DATA IN CORE-CONTINUING" IS PRINTED, THE NEXT QUESTION
IS 6.15. IF YOU RESPONDED WITH A '6' OR '7' TO QUESTION 6.4, THE NEXT QUESTION
IS 6.15; OTHERWISE, THE NEXT QUESTION IS 6.14.
6.14 TYPE:
1 FOR CLUSTERING BY VARIABLE
2 FOR CLUSTERING BY OBSERVATION
3 FOR BOTH
SEE SECTION 2.0. IF INPUT IS FROM THE TERMINAL, THE NEXT LINE IS 6.15;
OTHERWISE, IT IS 6.16.
6.15 ENTER DATA
ENTER YOUR DATA ACCORDING TO THE FORMAT YOU SPECIFIED AT LINE 6.7. IF YOU
SPECIFIED '2' TO QUESTION 6.6, THE NEXT QUESTION IS 6.18; OTHERWISE, IT IS 6.17.
6.17 DO YOU WISH TO HAVE A SHADE DIAGRAM BEFORE CLUSTERING?(YES OR NO)
RESPOND WITH A "YES" IF YOU WISH TO HAVE A SHADE DIAGRAM BOTH BEFORE AND AFTER
CLUSTERING. THE NEXT QUESTION IS 6.18.
6.18 DO YOU WISH TO HAVE LABELS IN THE OUTPUT?(YES OR NO)
IF YOU WANT TO ENTER A LABEL FOR EACH COLUMN ENTER A "YES"; OTHERWISE, ENTER A
"NO". IF YOU RESPONDED WITH A "YES" TO THIS QUESTION, THE NEXT QUESTION IS
6.19; OTHERWISE, THE NEXT QUESTION FOLLOWS THE OUTPUT AND IS 6.1.
6.19 ENTER THE ### LABELS, ONE PER LINE (10 CHARACTERS MAX)
### STANDS FOR THE NUMBER OF COLUMNS YOU ENTERED IN EITHER QUESTION 6.8 OR 6.10.
IF YOU ANSWERED "3" TO QUESTION 6.14, THIS QUESTION IS ASKED TWICE. THE NEXT
QUESTION IS 6.1.
7.0 SAMPLE TERMINAL AND BATCH RUNS
7.1 SAMPLE TERMINAL RUN
IN THIS EXAMPLE INPUT IS FROM THE TERMINAL AND OUTPUT IS TO THE TERMINAL.
"<CR>" MEANS PRESS THE RETURN BUTTON. EXCEPT FOR PROMPTINGS ALL
INFORMATION ON THE SAME LINE WITH <CR> AND PRECEDING <CR> ARE
ENTERED BY THE USER. 28 LINES OF DATA FOLLOWING ENTER DATA ARE
ENTERED BY USER.
.R CLUSTR<CR>
WMU SINGLE-LINK CLUSTER ANALYSIS
OUTPUT? (for help type HELP) TTY:<CR>
INPUT? (for help type HELP) TTY:<CR>
ENTER OUTPUT IDENTIFICATION IF DESIRED, ELSE TYPE <RETURN>
TERMINAL TEST<CR>
ENTER TYPE OF INPUT, TYPE <RETURN> FOR HELP
7<CR>
ENTER CLUSTER LIMITS CONTROL CODE, TYPE <RETURN> FOR HELP
3<CR>
ENTER METHOD(S) OF OUTPUT, TYPE <RETURN> FOR HELP
3<CR>
FORMAT: (F-TYPE ONLY)
(18F4.3)<CR>
HOW MANY VARIABLES? 24<CR>
ENTER CLUSTER LIMITS(MIN,MAX) 0,.75<CR>
ENTER DATA
318
403 317
468 230 305
321 285 247 227
335 234 268 327 622
304 157 223 335 656 722
332 157 382 391 578 527 619
326 195 184 325 723 714 685 532
116 057 075 099 311 203 246 285 170
308 150 091 110 344 353 232 300 280 484
314 145 140 160 215 095 181 271 113 585 428
489 239 321 327 344 309 345 395 280 408 535 512
125 103 177 066 280 292 236 252 260 172 350 131 195
238 131 065 127 229 251 172 175 248 154 240 173 139 370
414 272 263 322 187 291 180 296 242 124 314 119 281 412 325
176 005 177 187 208 273 228 255 274 289 362 278 194 341 345 324
368 255 211 251 263 167 159 250 208 317 350 349 323 201 334 344 448
270 112 312 137 190 251 226 274 274 190 290 110 263 206 192 258 324 358
365 292 297 339 398 435 451 427 446 173 202 246 241 302 272 388 262 301
167
369 306 165 349 318 263 314 362 266 405 399 355 425 183 232 348 173 357
331 413
413 232 250 380 441 386 396 357 483 160 304 193 279 243 246 283 273 317
342 463 374
474 348 383 335 435 431 405 501 504 262 251 350 382 242 256 360 287 272
303 509 451 503
282 211 203 248 420 433 437 388 424 531 412 414 358 304 165 262 326 405
374 366 448 375 434
DO YOU WISH TO HAVE A SHADE DIAGRAM
BEFORE CLUSTERING?(YES OR NO) YES<CR>
DO YOU WISH TO HAVE LABELS IN THE OUTPUT?(YES OR NO) NO<CR>
TERMINAL TEST
SHADE DIAGRAM BEFORE CLUSTERING
1 H
2 *H
3 X*H
4 X+*H
5 *=++H
6 *+=*HH
7 *=+*HHH
8 *=XXHHHH
9 *==*HHHHH
10 ----*++==H
11 *---*X+==XH
12 *--=+-==-HXH
13 X+*****X=XHHH
14 --=-==+===*-=H
15 +---+===+=+=-XH
16 X==*====+-*-=X*H
17 =.==+=+===X==***H
18 X=+====++****+**XH
19 =-*-==+====-=+==*XH
20 X==*XXXXX=+++*=X=*=H
21 X*=**=*X=XXXX=+*=X*XH
22 X++XXXXXX=*==++==**XXH
23 X*X*XXXHH==*X+=X==*HXHH
24 =+++XXXXXHXXX*==*XXXXXXH
SYMBOLS . - - = + = * X X X H H H H H
TERMINAL TEST
DELTA-ONE-HAT = 0.1921630E+00
SHADE DIAGRAM AFTER CLUSTERING
AT 2ND SHADE M= 24
2 H
15 -H
19 -=H
3 *-*H
17 .**=H
18 =*X+XH
14 -X+=*+H
16 =*==**XH
21 *+*==X=*H
4 +--*==-**H
22 ++*+=*+=XXH
8 ===X=+==XXXH
6 +========*XHH
7 ==+++=+=**XHHH
5 =+=++===*+XHHHH
9 =+===+=+=*XHHHHH
20 =====**XX*XXXXXXH
23 *=*X==+XX*HHXXXHHH
1 *+=X=X-XXXX*****XXH
11 -+=-X***X-*=X+*=+=*H
13 +-=*=*==X*=X***=+XXHH
10 -==-=*=-X-==++*===-XXH
12 -=--=*--X===-=+-+**XHHH
24 +=X+*X*=X+XXXXXXXX=XXHXH
SYMBOLS . - - = + = * X X X H H H H H
DENDROGRAM
CASE
NUMBER 1 1 1 1 1 1 2 2 2 2 1 1 1 1 2
2 5 9 3 7 8 4 6 1 4 2 8 6 7 5 9 0 3 1 1 3 0 2 4
I I I I I I I I I I I I I I I I I I I I I I I I
0.277 I I I I I I I I I I I I I I -+- I I I I I I I I
I I I I I I I I I I I I I I I I I I I I I I I
0.278 I I I I I I I I I I I I -+- I I I I I I I I I
I I I I I I I I I I I I I I I I I I I I I I
0.286 I I I I I I I I I I I I --+-- I I I I I I I I
I I I I I I I I I I I I I I I I I I I I I
0.381 I I I I I I I I I I I --+--- I I I I I I I I
I I I I I I I I I I I I I I I I I I I I
0.415 I I I I I I I I I I I I I I I I I -+- I
I I I I I I I I I I I I I I I I I I I
0.465 I I I I I I I I I I I I I I I -+- I I
I I I I I I I I I I I I I I I I I I
0.469 I I I I I I I I I I I I I I I I -+--
I I I I I I I I I I I I I I I I I
0.488 I I I I I I I I I I I I I I I --+---
I I I I I I I I I I I I I I I I
0.491 I I I I I I I I I I I I -+- I I
I I I I I I I I I I I I I I I
0.496 I I I I I I I I I I I ----+----- I I
I I I I I I I I I I I I I I
0.497 I I I I I I I I I I ----+---- I I
I I I I I I I I I I I I I
0.511 I I I I I I I I I I I --+---
I I I I I I I I I I I I
0.526 I I I I I I I I I I -------+-------
I I I I I I I I I I I
0.532 I I I I I I I I I ------+-------
I I I I I I I I I I
0.549 I I I I I I I I ----+----
I I I I I I I I I
0.552 I I I I -+- I I I
I I I I I I I I
0.586 I I I I I I ---+---
I I I I I I I
0.588 I I I I I --+---
I I I I I I
0.595 I I I I --+---
I I I I I
0.597 I I I --+---
I I I I
0.626 I I --+--
I I I
0.630 I --+--
I I
0.652 --+--
INPUT? FINISH<CR>
END OF EXECUTION
CPU TIME: 2.31 ELAPSED TIME: 11.05
EXIT
7.2 SAMPLE BATCH SETUP
IN THIS EXAMPLE THE DATA IS ON DISK IN THE FILE SLINK.DAT AND THE OUTPUT IS TO
THE LINE PRINTER. THE COMMENTS TO THE RIGHT ARE NOT PART OF THE CONTROL DECK.
$JOB[###,###] ###,### IS REPLACED BY THE USER'S OWN
PROJECT-PROGRAMMER NUMBER.
$PASSWORD #### #### STANDS FOR THE USER'S PASSWORD
.R CLUSTR PROGRAM NAME
LPT: OUTPUT TO LINE PRINTER
SLINK.DAT INPUT FILE
BATCH EXAMPLE IDENTIFICATION LINE
7 INPUT IS A CORRELATION MATRIX
1 INTERVAL CONTROL CODE
3 TYPE OF OUTPUT
(9F8.2) FORMAT
18 SIZE OF MATRIX
YES SHADE DIAGRAM BEFORE CLUSTERING
NO NO LABELS
FINISH NO MORE DATA SETS
(EOF) END OF FILE CARD
8.0 REFERENCES
(A) "MATHEMATICAL TAXONOMY" BY JARDINE AND SIBSON; PUBLISHED BY JOHN WILEY,
1971.
(B) COMMUNICATIONS OF THE ACM, JUNE 1973, PAGES 355-361.
(C) THE COMPUTER JOURNAL, FEBRUARY 1973, PAGES 30-34.
9.0 DELTA-ONE-HAT
DELTA-ONE-HAT = SUM(D(I,J)-D*(I,J))/ SUM D(I,J). WHERE D(I,J) IS THE DISTANCE
(I<J) (I<J)
BETWEEN THE ITH AND JTH OBJECT (OBSERVATION OR VARIABLE) AND D*(I,J) IS THE
SMALLEST HEIGHT SUCH THAT OBJECTS (OBSERVATIONS OR VARIABLES) I AND J ARE
INCLUDED IN A CLUSTER OF THAT HEIGHT. DELTA-ONE-HAT SATISFIES THE RELATIONSHIP
0<= DELTA-ONE-HAT <= 1. A SMALL DELTA-ONE-HAT INDICATES THAT THE DATA IS AMENABLE
TO SINGLE-LINK CLASSIFICATION. SEE THE REFERENCES (A) AND (C) FOR MORE DETAILS.