There are 2 other files named mulreg.doc in the archive. Click here to see a list.

First page

A USER PROGRAM FOR MULTIPLE LINEAR REGRESSION ANALYSIS ====================================================== Marten van Gelderen Version 5H(246) 16-Feb-80

Copyright (C) 1975, 1979, 1980 by Foundation Mathematical Centre, Amsterdam Institute for Nuclear Physics Research, Amsterdam General permission to make fair use in teaching or research of all or part of this material is granted to individual readers and to nonprofit organizations, provided that the copyright notice of either the Foundation Mathematical Centre or the Institute for Nuclear Physics Research is given and that reference is made to this publication and to the fact that reprinting privileges were granted by permission of one of the above mentioned organizations.

CONTENTS Page Abstract 1 Introduction 2 Chapter 1 - Multiple linear regression analysis 4 1. The regression model 4 2. Least squares 5 2.1 Weighted least squares 8 2.2 Residual analysis 9 3. Tests of hypotheses 10 3.1 A particular regression coefficient 10 3.2 Analysis of variance 10 Chapter 2 - The input to the program 14 1. The model specification 14 1.1 Transformations 15 2. The input specification 17 3. The option specification 19 4. The data specification 21 5. The user program 22 6. Examples 23 Chapter 3 - The output from the program 25 1. Standard and optional printed output 25 2. Standard and optional data output 27 3. Error messages 28 4. Examples 30 Appendices 1. Technical remarks 31 2. Definition of the syntax of a user program 34 3. Technical description of the program 37 References 41

Page 1 ABSTRACT Performing multiple linear regression analysis on a electronic computer can be very laborious when preparation of the input is complicated. The program introduced in this document is especially designed to have flexible and comprehensible model and input specifications. It accepts a "Model" formula which resembles the notation of regression models in common statistical literature quite closely. An accompanying "Input" formula provides the program with information about the arrangement of the observations in the input "Data", which consists of a series of numbers in freefield format. A "Run" command activates the program, while an "Exit" command causes the program to stop. Extensive runtime "Help" information is available. The following piece of program may serve as an example of some of these ideas: "Model" y = alfa0 + alfa1 * x + alfa2 * x ^ 2; "Input" 5 * ([x], n, n * [y]); "Options" Transformed data matrix, Process submodels (1); "Data" 1 4 1.1 0.7 1.8 0.4 3 5 3.0 1.4 4.9 4.4 4.5 5 3 7.3 8.2 6.2 10 4 12.0 13.1 12.6 13.2 15 4 18.7 19.7 17.4 17.1 "Run" "Exit"

Page 2 INTRODUCTION Many computer programs or subroutines exist, which in one way or another calculate estimates for the parameters of a multiple linear regression model, but most of them can hardly be used by the layman. Subroutines have to be embedded in a higher level language program to perform the necessary input and output; as a layman however, one is not always acquainted with these languages. The programs usually require the input data to be presented in a specific standard form, which in practice means that input data that are already available in a machine readable form must be transformed into that standard form or that the input data should be punched exactly in that standard form. More difficult still and often more confusing, is the way in which the program is told which regression model the user wants to consider. For instance, for the program for multiple polynomial regression analysis of the Mathematical Centre a code matrix of zeros and ones had to be given in order to indicate the form of the regression polynomial. Transformations of one or more variables are hardly ever automatically possible and require a separate program to be run in advance to prepare the transformed data matrix. In this document a program is described that will make it possible for almost everyone to obtain his desired results with no more than a superficial knowledge of the underlying programming system. It has been recognized recently that standard programs should not bother the user too much with awkward input specifications. A statistician is more interested in the results of a program and in how he can obtain them without interference of software specialists of the computer centre, than in a few seconds gain in actual computing time. Therefore one of the objectives was to develop a program which would enable the user to specify the form of the regression model and the arrangement of the numbers in his input data, in a straightforward manner, even if he has hardly any knowledge of programming languages at all. The program thus accepts a model formula which resembles the notation of regression models in common statistical literature quite closely; the accompanying input formula indicates which numbers or series of numbers in the input data belong to which variable in the model formula.

Page 3 This input scheme gives the user the opportunity to work with existing data and to process possible transformations of the variables in the model formula without the need for external data adjustment. However, the model formula must be given explicitly and completely and the structure of the input data must be known exactly before an input formula can be constructed. In the output the results are identified with the names given by the user in the model formula. Any technical (machine dependent) action required to run the program is described in appendix 1. ACKNOWLEDGEMENTS The basic ideas for the input arrangement presented in this document originated from A.P.B.M. VEHMEYER. The article 'ALGOL 60 translation for everybody' by F.E.J. KRUSEMAN ARETZ [6] provided the foundation for the translator and the execution section of the program, while the techniques used in the storage section were designed by L.J. OOSTRIJK. The matrix and least squares routines used in the program were copied from 'ALGOL 60 procedures in numerical algebra, part 1', by T.J. DEKKER [2]. The classic book 'Applied Regression Analysis' by N.R. DRAPER & H. SMITH [1] served as a frame of reference throughout the design and implementation of the program. Numerous recently incorporated statistical improvements were suggested by R.D. GILL.

Page 4 CHAPTER 1 MULTIPLE LINEAR REGRESSION ANALYSIS 1.1 THE REGRESSION MODEL In a regression problem the researcher postulates a certain relation- ship between a random variable y (the realizations of which are subject to some form of disturbance) on the one side and a number of variables x1,...,xp (which are without or at least almost without disturbances) on the other side. This relationship is expressed by a mathematical formula, which is called the (linear) regression model, for instance: y = a0 + a1 * x1 + ... + ap * xp + e (1) in which a0,...,ap represent unknown regression coefficients (parameters) which are to be estimated and e represents the disturbance. If a constant term is present in the model formula (in (1) the a0), the model is said to be an 'intercept model', if no constant term is present, the model is called a 'no-intercept model'. The variables x1,...,xp and the variable y can also represent (other) transformed variables. The researcher might have reasons to believe (from background information concerning the experiment) that transformations are necessary, for instance: 1) to obtain normally distributed disturbances, 2) to obtain a greater homogeneity of the variances of the disturbances, 3) to linearize non-linear regression models (if possible). The transformed regression model can be written as: G(y) = a0 + a1 * F1(x1,...,xm) + ... + ap * Fp(x1,...,xm) + e (2) in which G, F1,...,Fp represent the transformations, a0,...,ap represent the parameters to be estimated, y represents the dependent variable, x1,...,xm represent the independent variables, e represents the disturbance.

Page 5 The choice of a transformation by means of 'trial and error' is rather time consuming and costly. The importance of the location parameter makes for the difficulty. It is not unusual that Log (x) yields no improvement, but that Log (c+x) gives better results for a particular choice of c. Because this holds for almost any transformation of some importance, we must actually solve in each case a nonlinear adjustment problem. Often though, a simple form of the transformation is suggested by the researcher who is better acquainted with the peculiarities of the experiment. 1.2 LEAST SQUARES Regression analysis consists in fact of the adjustment of a hyperplane of the required dimension to the data. The fitting is done with the method of least squares, which means that the sum of the squares of the differences between the observed values for y and the estimated values for the expectation of y, are minimized. This sum of squares is also called the residual sum of squares. In matrix notation the regression model can be written as (cf. DRAPER & SMITH [3] pp. 58-62): Y = Xa + e (3) in which Y is a (n*1) random vector of observations, X is a (n*p) matrix of known (fixed) values, a is a (p*1) vector of (unknown) parameters, and e is a (n*1) random vector of disturbances. It is supposed that E(e) = 0 and var(e) = Isigma^2, in which I is the unit matrix, thus: E(Y) = Xa (4) The sum of squares of the differences between the observed values of Y and the estimated values for the expectation of Y thus equals: (Y-Xa)'(Y-Xa) = Y'Y - 2a'X'Y + a'X'Xa (5) (for a'X'Y is a scalar and therefore equal to Y'Xa).

Page 6 Choosing as least squares estimator b that value of a which minimizes (5), involves differentiating with respect to the elements of a and equating the result to zero: -2X'Y + 2X'Xb = 0, thus: X'Y = X'Xb (6) This system is called the normal equations. If the rank of X equals p, X'X is nonsingular and the inverse of X'X exists. In that case the solution of the normal equations can be written as: b = inv(X'X)X'Y (7) Observe that p <= n must hold, in order that the rank of X can be p at all. Therefore at least as many observations must be made, as there are parameters in the model. Also observe that E(b) = inv(X'X)X'E(Y) = a, thus b is an unbiased estimator of a. The least squares estimator has the following properties: 1. It is an estimator which minimizes the sum of squares of deviations, irrespective of any distribution properties of the disturbances. The assumption that the disturbances are normally distributed is, of course, necessary for tests which depend on this assumption, such as t- or F- tests, or for obtaining confidence intervals based on t- or F- distributions. 2. According to the Gauss-Markov theorem, the elements of b are unbiased estimators, which have minimum variance (of any linear function of the Y's which provides unbiased estimators), again irrespective of the distribution properties of the disturbances. 3. If the disturbances are mutually independent and normally distributed (with E(e) = 0 and var(e) = Isigma^2), then b is also the maximum likelihood estimator. The variance-covariance matrix of b is: var(b) = inv(X'X)sigma^2 (8) The variances are the diagonal and the covariances the off-diagonal elements.

Page 7 An unbiased estimator for sigma^2 is given by: s^2 = (Y'Y - b'X'Y) / (n-p) (9) The square root of this estimator is frequently called 'standard error of estimate'. In the printed output of the program it is indicated more properly as 'standard deviation of the error term'. Let vij be the element in the i-th row and j-th column of inv(X'X), then sdi = s * Sqrt(vii) estimates the standard deviation of bi, and cij = vij / Sqrt(vii * vjj) gives the correlation coefficient between bi and bj for i = 1,...,p and j = 1,...,p. Thus: vii = (sdi / s)^2 (10) and vij = cij * Sqrt(vii * vjj) = cij * (sdi * sdj) / s (11) A frequently used statistical measure for evaluating regression models is the multiple correlation coefficient R which is defined in the intercept model as the square root of the proportion of the corrected total sum of squares accounted for by the model. If the correction for means is denoted by nu^2, with u = Sum(i,1,n,yi)/n, then R can be defined by: R^2 = (b'X'Y-nu^2)/(Y'Y-nu^2) = 1 - (Y'Y-b'X'Y)/(Y'Y-nu^2) (12) However, we must divide Y'Y-b'X'Y by n-p, not by n, to obtain an unbiased estimator of sigma^2, moreover it is customary to divide Y'Y-nu^2 by n-1, not by n. If we adopt both modifications we obtain the adjusted multiple correlation coefficient, which can thus be defined by: adj(R)^2 = 1 - (n-1)/(n-p) * (Y'Y-b'X'Y)/(Y'Y-nu^2) (13) In the no-intercept model the correction for means is ignored, giving as definition of R^2: b'X'Y/Y'Y = 1 - (Y'Y-b'X'Y)/Y'Y, while the adj(R)^2 is defined correspondingly as: 1 - n/(n-p) * (Y'Y-b'X'Y)/Y'Y. R^2 itself is often called the 'proportion of variation explained' (cf. THEIL [11] pp. 178-179).

Page 8 1.2.1 WEIGHTED LEAST SQUARES It sometimes happens that some of the observations for the dependent variable are 'less reliable' than others. This usually means that the variances of the observations are not all equal; in other words the matrix V = var(e) is not of the form Isigma^2, but is diagonal with unequal diagonal elements. The basic idea to solve this problem is, to transform Y to other variables, which do appear to satisfy the usual tentative model assumptions, and then apply the usual (unweighted) analysis to the variables so obtained. The estimates can then be re-expressed in terms of the original variables Y (cf. DRAPER & SMITH [3] pp. 77-81). Let the original regression model be: Y = Xa + e, with E(e) = 0 and var(e) = Vsigma^2, with V diagonal with unequal diagonal elements, and let P = inv(V). Premultiplying the original regression model with Q = Sqrt(P) gives as transformed regression model: QY = QXa + Qe (14) with E(Qe) = 0 and var(Qe) = Isigma^2. The normal equations then become: (QX)'QY = (QX)'QXa (15) giving as solution if the indicated inverse matrix exists: b = inv((QX)'QX)(QX)'QY = inv(X'PX)X'PY (16) with variance-covariance matrix: var(b) = inv(X'PX)sigma^2 (17) In practical situations it is often difficult to obtain specific information on the form of V at first. For this reason it is sometimes necessary to make the (known to be erroneous) assumption V = I and then attempt to discover something about the form of V by examining the residuals from the regression analysis.

Page 9 1.2.2 RESIDUAL ANALYSIS The vector of residuals D is defined as the difference between the vector of observations Y and the vector of fitted values Z, obtained by using the regression equation Z = Xb. So D = Y - Z or di = yi - zi for i = 1,...,n. If the model is correct, the residual mean square MSE = s^2 estimates sigma^2, and the estimated standard deviation of the fitted value zi at xi = (xi1,...,xip)' is: sd(zi) = s * Sqrt(xi'inv(X'X)xi) (18) which can be used to construct a confidence interval for the expected value of yi: E(yi) at xi = (xi1,...,xip)', or to construct a prediction interval for the mean of h new observations at this point (cf. DRAPER & SMITH [3] pp. 121-122). In the first case the confidence interval is: zi +- t(n-p-1,1-alpha/2) * s * Sqrt(xi'inv(X'X)xi) (19) and in the second case the prediction interval is: zi +- t(n-p-1,1-alpha/2) * s * Sqrt(1/h + xi'inv(X'X)xi) (20) Researchers often divide the residuals di by s, resulting in the standardized residuals, which can be examined to see if they make it appear that the assumption ei/sigma ~ N(0,1) is violated (cf. DRAPER & SMITH [3] pp. 86-97). It might be expected that roughly 95% of the di/s were between the limits (-2,2). However, the variances of the residuals are not constant but a function of the X matrix (see (18)), which suggests as standardization: ti = di / s / Sqrt(1 - xi'inv(X'X)xi) (21) giving the studentized residual. The maximum studentized residual can be used in a test for detecting outliers, as follows: let t^2 = max(ti^2), then min(1, n * (1-Fisher(1, n-p-1, t^2*(n-p-1)/(n-p-t^2)))) is an 'upper bound for the right tail probability of the largest absolute studentized residual' (cf. LUND [9] pp. 473-474).

Page 10 1.3 TESTS OF HYPOTHESES 1.3.1 A PARTICULAR REGRESSION COEFFICIENT If the disturbances are mutually independent and normally distributed (with E(e) = 0 and var(e) = Isigma^2) and with a (preset) level of signifi- cance alpha, a significance test for a particular regression coefficient can be performed, or more specifically: the null hypothesis is: H0: ai = 0 (given that all other aj are in the model), which is tested against the alternative hypothesis: H1: ai is not equal to zero, by treating FRi = bi^2/var(bi) as a realization of a Fisher(1,n-p) variate. However, this test must be used with caution, because with the (preset) level of significance alpha, only one coefficient can be tested properly, while the computer output lists statistics for all coefficients. It seems very tempting to test the coefficients serially one at a time, but one must keep in mind that in doing so the level of significance of the whole test rises above the nominal value (cf. DRAPER & SMITH [3] p. 65). 1.3.2 ANALYSIS OF VARIANCE In the analysis of variance table the different contributions to the total uncorrected sum of squares Y'Y (which is the first part in the table) are given (cf. DRAPER & SMITH [3] pp. 57 & 62). The second part of the table assumes the presence of an (unknown) constant term in the model; if this term is absent, the 'mean'-line disappears and in the 'regression'-line p-1 changes into p and b'X'Y-nu^2 changes into b'X'Y. The third part of the table is only present when repeated observations for the dependent variable are available, in which case:

Page 11 k is the number of groups of replications, mi is the number of replications in group i, and W = (w1,...,wk)', with wi = Sum(j,1,mi,yij) / Sqrt(mi), for i = 1,...,k. The fourth part of the table is only present if a reduction is requested and possible. SSQ then stands for the residual sum of squares from a regression analysis with the first p-q out of the p independent variables (1 <= q <= p-1), while SSE stands for the residual sum of squares from a regression analysis with the original p independent variables. Analysis of variance source of right tail variation df sum of squares mean square F-ratio probability --------------------------------------------------------------------------- total n Y'Y --------------------------------------------------------------------------- mean 1 nu^2 MSM = nu^2 FRM P(FM>=FRM) regression p-1 b'X'Y - nu^2 MSR FRR P(FR>=FRR) residual n-p Y'Y - b'X'Y MSE = s^2 --------------------------------------------------------------------------- lack of fit k-p W'W - b'X'Y MSL FRL P(FL>=FRL) pure error n-k Y'Y - W'W MSP --------------------------------------------------------------------------- reduction q SSQ - SSE MSQ FRQ P(FQ>=FRQ) --------------------------------------------------------------------------- The column 'mean square' is obtained by division of the sums of squares by their corresponding degrees of freedom. The column 'F-ratio' is obtained by division of the mean squares by the residual mean square, except for the lack of fit F-ratio, which is obtained by division of the lack of fit mean square by the pure error mean square, thus: MSM = nu^2/1, MSR = (b'X'Y-nu^2)/(p-1), MSE = s^2 = (Y'Y-b'X'Y)/(n-p), MSL = (W'W-b'X'Y)/(k-p), MSP = (Y'Y-W'W)/(n-k), MSQ = (SSQ-SSE)/q, FRM = MSM/MSE, FRR = MSR/MSE, FRL = MSL/MSP and FRQ = MSQ/MSE.

Page 12 If the disturbances are mutually independent and normally distributed (with E(e) = 0 and var(e) = Isigma^2) and with a (preset) level of signifi- cance alpha, a significance tests can be performed for: 1. The mean of the observations for the dependent variable, or more speci- fically: the null hypothesis is: H0: E(u) = 0, which is tested against the alternative hypothesis: H1: E(u) is not equal to zero, by treating FRM as a realization of a Fisher(1,n-p) variate. 2. The regression equation, or more specifically: the null hypothesis is: H0: a1 = ... = ap = 0, except for the ai that denotes the constant term (if present), which is tested against the alternative hypothesis: H1: at least one of a1,...,ap is not equal to zero, by treating FRR as a realization of a Fisher(p-1,n-p) variate. 3. The adequacy (linearity) of the model, or more specifically: the null hypothesis is: H0: the linear model is adequate (that is, no model signifi- cantly improves the prediction of Y over the linear model), which is tested against the alternative hypothesis: H1: the linear model is not adequate, by treating FRL as a realization of a Fisher(k-p,n-k) variate.

Page 13 4. A subset of regression coefficients, or more specifically: suppose without loss of generality that the subset consists of the last q coefficients, then the null hypothesis is: H0: ar = ... = ap = 0, with r = p-q+1, which is tested against the alternative hypothesis: H1: at least one of ar,...,ap is not equal to zero, by treating FRQ as a realization of a Fisher(q,n-p) variate. 5. A linear combination of the regression coefficients, or more specifi- cally: the null hypothesis is: H0: c'a = m, in which c is a vector of constants with order q+1, which is tested against the alternative hypotheses: H1: c'a is not equal to m, by substituting c'a = m in the original model, shifting the known terms to the left hand part, combining the corresponding terms in the right hand part, and testing the thus derived so called 'reduced model' by treating FRQ as a realization of a Fisher(q,n-p) variate. In each case the right tail probability P(F >= FR) can be found in the last column of the analysis of variance table (cf. DRAPER & SMITH [3] pp. 63-64, 68 & 72-75).

Page 14 CHAPTER 2 THE INPUT TO THE PROGRAM 2.0 The purpose of the input system is to give the user a simple and adequate formalism to tell the program what he wants. In order to specify the regression model and the correspondence between the variables in the model and the numbers in the input data, the user must provide a so called user program, which consists of one or more jobs which in their turn are made up out of one or more specifications. Each specification starts with a keyword and terminates at the next keyword. The keywords are: "Model", "Input", "Options", "Data", "Run", "Exit" and "Help" (quotes included). 2.1 THE MODEL SPECIFICATION To let the program know between which variables the statistician expects a certain kind of relationship, he must provide a model specifi- cation, which consists of the keyword "Model" followed by a formula (the model statement), which resembles the notation of regression models in common statistical literature quite closely. For instance: "Model" y = alpha0 + alpha1 * x1 + alpha2 * x2; A model formula consists of an identifier to denote the dependent variable (the left hand part), followed by an '=' (equal), followed by the sum of a number of terms (the right hand part), while it is terminated with a ';' (semicolon). Each term must be the product of an identifier to denote the parameter (which is to be estimated) and an identifier to denote the independent variable. An exception is made for the optional constant term, which is given as a single identifier denoting that constant term, and which may be placed anywhere in the model. Each identifier must start with a letter and is allowed to contain any number of letters, digits and blanks. As most peripheral equipment of a computer is unable to process sub- or superscriptions or Greek letters, we write alpha0, alpha1 and alpha2. Identifiers have no inherent meaning, but

Page 15 serve for the identification of variables, parameters and functions. They may be chosen freely (except for the twentyone standard function names and the ten option names, cf. section 2.1.1 and section 2.3). It is advised not to use the same identifier to denote two (or more) different quantities; for regression parameters, however, it will not lead to fatal errors, whereas for the dependent and independent variables distinguishable identifiers must be used indeed. Correct model formulae are for instance: "Model" y variable = constant + parameter * x variable; and "Model" depvar = const + beta1 * xvar1 + beta2 * xvar2; 2.1.1 TRANSFORMATIONS Almost all transformations a user would like to perform on his input data fit quite naturally in the model formula: each transformation is expressed as a formula itself. If, for instance, the user wants to include in the model formula as an independent variable the natural logarithm of the sum of two other variables, called xvar1 and xvar2, he writes: Ln (xvar1 + xvar2) In model formulae the operators '+' (plus), '-' (minus), '*' (asterisk), and '/' (slash) are allowed, all with their conventional meaning of addition, subtraction, multiplication and division respectively. Of course the normal operator precedence rules are obeyed. Special operators are: ':' (colon), integer division and '^' (uparrow), exponentiation. The operation term : factor is defined only for operands both of type integer and will yield a result of type integer, with the same sign as would be obtained by normal division, while the magnitude is found by dividing the two quantities and taking the whole part; mathematically it can be defined as: a : b = Sign (a / b) * Entier (Abs (a / b)), for instance: 5 : 2 = 2 and -7 : 2 = -3. The operation factor ^ primary denotes exponentiation, where the factor is the base and the primary is the exponent, for instance: 5 ^ 2 = 25 and 2 ^ 3 ^ 2 = 64 but 2 ^ (3 ^ 2) = 512.

Page 16 Also the following twentyone standard functions are allowed: Abs (E), Sign (E), Sqrt (E), Sin (E), Cos (E), Tan (E), Ln (E), Log (E), Exp (E), Entier (E), Round (E), Mod (E1, E2), Min (E1, E2), Max (E1, E2), Arcsin (E), Arccos (E), Arctan (E), Sinh (E), Cosh (E), Tanh (E) and Indicator (E1, E2, E3) in which E, E1, E2 and E3 are expressions in terms of variables, operators and standard functions. Round (E) is defined as: Entier (E + 0.5) and Indicator (E1, E2, E3) is defined as: IF E1 <= E2 <= E3 THEN 1 ELSE 0. The dependent variable may be transformed in a similar way and as a consequence the model formula in its most general form looks like: "Model" G (y) = b0 + b1 * F1 (x1,...,xm) + ... + bp * Fp (x1,...,xm); Some examples of transformed model formulae are: "model" y = a0 + a1 * Sqrt (x1 + x2) + a2 * Sqrt (x3); and "MODEL" Arcsin (Sqrt (Y)) = A0 + A1 * X + A2 * X ^ 2; A user can specify model formulae in which terms with known regression coefficients appear, by subtracting those terms from the left hand part of the model formula, for instance: "Model" y - 5.4321 * x3 = a0 + a1 * x + a2 * x ^ 2; This applies especially to the constant term; if this term is known it must be shifted to the left hand part. If weights are present in the input data (or can be computed out of the input data), to indicate that the variances of the observations are not all equal (cf. section 1.2.1), the left hand part of the model formula can be expanded with a so called weight part (which can be an expression), preceeded by a '&' (ampersand), for instance: "Model" Depvar & Max (Abs (Weight), 10) = Const + Param * Indepvar;

Page 17 2.2 THE INPUT SPECIFICATION To indicate which numbers or series of numbers in the input data belong to which variable in the model formula and which numbers can be skipped, the program expects an input specification. It consists of the keyword "Input", followed by a formula (the input statement) which describes the arrangement of the observations in the input data, while it is terminated with a ';' (semicolon). The basic idea is that numbers from the input data are identified with the names from the input formula in such a way that (in order of entry) numbers belonging to the same name are put in a queue appended to that name. For instance: "Input" 100 * (codenr, 10 * [yvar], [xvar1, xvar2], -1); means that one hundred series of numbers (each, as a check, terminated in this example by -1) are present in the input data. Each series consists of fourteen numbers: first one value which is read and assigned to the name codenr, then ten values for the name yvar, then one value for the name xvar1, followed by one value for the name xvar2 and finally the value -1. The basic constituent of an input formula is a variable enclosed in square brackets, in the example: [yvar]. The corresponding number from the input data will be appended to the queue for that name. Several variables can be put together in a variable list by separating them by commas and enclosing them in square brackets, in the example: [xvar1, xvar2]. This only serves to save the writing of several opening and closing brackets. Separate numbers, series or blocks of numbers can be treated by putting a repetition factor (control) followed by an asterisk in front of a variable list (or in front of an input formula which must then be enclosed in parentheses), in the example: 100 * and 10 * . If a repetition factor is 1, it may be omitted together with the asterisk and a parentheses pair, but square bracket pairs must remain. When a name is used as a repetition factor, a value must already have been assigned to it, which is done by giving that name, without square brackets and followed by a comma (or closing parenthesis), earlier in the input formula than the use of that name as a repetition factor. The corres-

Page 18 ponding number from the input data is then assigned as a value to that name. If such names are used repeatedly in the input formula, the corresponding numbers from the input data are compared with the first one and, in the case of inequality, an error message is supplied. This may serve as a check against shifted data reading. A similar check can be obtained by giving an explicit number followed by a comma (or closing parenthesis), in the example: the -1. The corresponding number from the input data is then compared with that given number and, in the case of inequality, an error message is produced. Also an expression is allowed as a repetition factor, or for that matter, as a check value, provided that it is enclosed in angle brackets, for instance: <k+n>. As in the case of single names used as a repetition factor each (non-standard function and non-option) name used in such a (special) expression must have been given, followed by a comma, earlier in the input formula than the use of that name in the expression. The linkage between the model formula and the input formula is established by using the same names in the model terms and in the input variable lists. Numbers from the input data that belong to such input names will be treated as observations for the model variables, while numbers that belong to input names between square brackets which do not appear in the model formula, are skipped. Often, repeated observations for the dependent variable are available. In order to be able to process these observations automatically, it is necessary that a variable list consisting entirely of dependent variables is preceeded by a repetition factor (followed by an asterisk) indicating the number of repetitions. If a variable list contains independent as well as dependent variables, the number of replications is assumed to be 1. A series of (say 100) observations for a dependent variable with no replications is denoted as: 100 * ([dep var]) The repetition factor in front of the opening square bracket is omitted (because it is 1), although the parentheses are not. Without the parentheses it would mean 100 replications of [dep var].

Page 19 EXAMPLE "Input" k, n, <k+n> * (c, m, m * [y], [x1,x2,x3,x4], c), -99; means that: the first number is read and its value assigned to k, the next number is read and its value assigned to n, then k+n times the following happens: a number is read and its value assigned to c, the next number is read and its value assigned to m, then the m replications for y are read, next the observations for x1, x2, x3 and x4 are read, then a number is read and its value compared with c, finally a number is read and its value compared with -99. If the comparisons fail, an error message is supplied and execution of the job is terminated, otherwise (k+n) observations for x1, x2, x3, x4 and for each quadruple m replications for y, have been identified. 2.3 THE OPTION SPECIFICATION It is possible to have the program perform some tasks optionally by providing an option specification in a job. It consists of the keyword "Options" followed by a list of option identifiers or corresponding option numbers (the option statement), separated by commas and terminated with a ';' (semicolon). The following ten options are available: option number option name 1 Transformed data matrix 2 Correlation matrix 3 Residual analysis 4 No regression analysis 5 Process submodels 6 Print input data 7 No input data rewind 8 Save original model 9 Test reduced model 10 Missing values

Page 20 Options 1, 2, 3 and 6 cause the corresponding piece of information to be printed. However, option 1 lists only those (possibly transformed) variables that are present in the model formula in a neat tabular form, while option 6 lists all the original input data serially (eleven numbers per line) without any special layout, because the input data consists (by definition) of an unstructured series of numbers (cf. section 2.4). Option 4 suppresses the regression analysis; it is meant to be used in combination with option 1 and/or 2. Option 5 causes the program to process submodels, which are formed by a form of backward elimination: each time the last term from the right hand part from the model formula is omitted, by deleting the last column from the design matrix, and a regression analysis is performed with the reduced design matrix. Messages are generated about which terms are omitted, while further processing of the job ceases when the resulting model formula is of the form: y = c. Moreover a test is made (under the usual assumptions) whether the omitted terms did contribute significantly to the regression sum of squares (cf. section 1.3.2.4). To option 5 a specifier list may be appended, to prevent the pro- duction of waste output for unwanted submodels. In this list the number of terms to be omitted from the model formula (counting backwards, starting at the end) must be given enclosed in parentheses. For example the option: process submodels (6, 10) instructs the program to process only two submodels, one with the last six terms omitted and one with the last ten terms omitted (from the original model formula). If the user asks for more terms to be omitted than are present in the model formula, an error message is supplied and the execution of that job is terminated. Moreover, if no explicit specifier list is appended to option 5, the options 2 and 3 yield no effect (even if specified), which is also to prevent the production of waste output for the submodels. Option 7 gives the user the opportunity to process consecutive pieces of input data in consecutive jobs. Normally the processing of the input data for each job starts with the first number in the data specification (or with the first number in the datastream), and the program gives a (warning) message if the input formula does not match the input data

Page 21 precisely. This option disengages the message and causes the program to continue processing input data where the previous job had finished. Option 8 causes the residual degrees of freedom and residual sum of squares from the current job to be saved, in order to be able in the next job, by means of specifying option 9, to test whether the model under consideration in that next job, shows a significant increase in residual sum of squares in comparison with the model in the previous job. In effect this gives the possibility of testing a hypothesis concerning a linear combination of the parameters from a model (cf. section 1.3.2.5), for instance (cf. SEARLE [11] pp. 121-122): "Model 1" y = b1 * x1 + b2 * x2 + b3 * x3; "Options" Save original model; "Run" "Model 2" y - 4 * x1 = b2 * (x1 + x2) + b3 * x3; "Options" Test reduced model; "Run" causes the null hypothesis: b1 = b2 + 4 to be tested (in the second job). Option 10 may be used to identify some observations or repetitions as 'missing'. In a specifier list, appended to this option, the missing values must be given enclosed in parentheses. When a repetition equal to a missing value is encountered in the input data, the corresponding set of repetitions for the dependent variable(s) is not included in the design matrix. When an observation equal to a missing value is encountered in the input data, or when none of the repetitions are included in the design matrix, the corresponding set of observations for the independent variables together with the (possibly empty) set of repetitions for the dependent variable(s) (i.e. the 'case') is not included in the design matrix. 2.4 THE DATA SPECIFICATION The data specification consists of an unstructured series of numbers (the data statement), preceeded by the keyword "Data" and terminated at the next keyword. The structure is imposed onto it by the input formula.

Page 22 A sequence of symbols is considered a number when it satisfies the definition of number in appendix 2, together with the machine dependent restrictions imposed by the underlying SIMULA programming system. Note that the definition of number does not allow FORTRAN-like numbers as for instance 10. or 1# (at least one digit must follow). It is recommended always to use blanks as delimiters between numbers, but the use of other non-numerical symbols as delimiters will not lead to fatal errors, only to a slight increase in processing time. If a datafile is specified in response to the datastream request, or is appended to the "Run" keyword (cf. appendix 1), the program will try to read a record of input data, which then do not have to be preceeded by the keyword "Data", from that file. The record ends at the next "Eor" keyword. However, a non empty data specification in the user program will get priority over reading input data from the specified file, while an empty data specification causes the program to start reading the next record of input data from the specified file. If the data specification as well as the next nonempty record in the data file does not contain any numerical information, an error message is supplied. EXAMPLES real number value 1.234 1.234 .98 0.98 -0.5673#2 -56.73 +.02#-1 0.002 #+3 1000.0 2.5 THE USER PROGRAM In a user program several jobs can be submitted to the program. Each job is separated from its preceding one by the keyword "Run", while the entire user program is terminated with the keyword "Exit". In the first job, the model, input and data specification must be given in some order. The option specification is, of course, optional. In each following job a specification which is not changed may be omitted, the program then retains the last given specification. If options have been specified in a previous

Page 23 job and one wants to delete them, this is done by providing a new option specification which may be empty if no options are to be executed (that is by only providing: "Options";). In front of each job or in front of the keyword "Exit" a text may be given for further identification of the output of a job or of the output of the entire user program. The use of quotes in that text should be avoided in view of confusion with the keywords. The program starts reading a (possibly empty) text at the beginning of the next line after the keyword "Run" of the previous job (with the first job the program starts with the first line in the inputstream). 2.6 EXAMPLES The following user program can be submitted without any modification to the Multiple Linear Regression Analysis program. It consists of four jobs, three of which are preceeded by an identifying header, while the whole user program ends with an identifying trailer: ********************************** * Example 1 originates from: * * DE JONGE [4], pp. 472 & 479. * ********************************** "Model" y = c * Log (x) + a + b * x; "Input" 5 * ([x], 10 * [y]); "Options" Transformed data matrix, Correlation matrix, Residual analysis, Process submodels (1, 2); "Data" 25 0.67 0.70 0.75 0.76 0.78 0.80 0.83 0.84 0.88 0.89 50 0.88 0.92 0.93 0.96 0.98 1.00 1.01 1.03 1.06 1.07 80 0.96 0.98 0.99 1.03 1.05 1.06 1.08 1.11 1.15 1.17 130 1.07 1.09 1.11 1.13 1.14 1.14 1.19 1.22 1.25 1.29 180 1.10 1.13 1.17 1.19 1.20 1.21 1.23 1.25 1.28 1.33 "Run"

Page 24 ********************************** * Example 2 originates from: * * SEARLE [11], pp. 121-123 * ********************************** "Input" 5 * [y, x1, x2, x3]; "Data" 8 2 1 4 10 -1 2 1 9 1 -3 4 6 2 1 2 12 1 4 6 "Model 1" y = a3 * x3 + a2 * x2 + a1 * x1; "Options" Save original model, Process submodels (1); "Run" "Model 2" y - 4 * x1 = b2 * (x1 + x2) + b3 * x3; (eqn. 118, p. 121) "Options" Test reduced model, Transformed data matrix; "Run" **************************************** * Example 3 originates from: * * AFIFI & AZEN [1], pp. 88 & 93-100. * **************************************** "Model" y = alfa0 + alfa1 * x; "Input" 5 * ([x], n, n * [y]); "Option" Transformed data matrix, Print input data; "Data" 1 4 1.1 0.7 1.8 0.4 3 5 3.0 1.4 4.9 4.4 4.5 5 3 7.3 8.2 6.2 10 4 12.0 13.1 12.6 13.2 15 4 18.7 19.7 17.4 17.1 "Run" *** Marten van Gelderen; Mathematisch Centrum *** "Exit"

Page 25 CHAPTER 3 THE OUTPUT FROM THE PROGRAM 3.1 STANDARD AND OPTIONAL PRINTED OUTPUT After having read the keyword "Run", the processing of the job is initiated. First the model, input and option texts are printed in this order. Next an attempt is made at translating the specifications. Errors against syntax or semantics cause error messages to be printed below each specification, while further processing of that job ceases. Note that the processing of the next job, if present, will be of little or no use unless the specification which developed the error(s) is changed. Next the (transformed) data matrix is formed and passed to the regression routines, which supply the following printed output in the order indicated: 1) a listing of the original input data (option 6), 2) the (transformed) data matrix (option 1), 3) per (transformed) variable the: mean, standard deviation, minimum and maximum, 4) the correlation matrix of the (transformed) variables (option 2), 5) the multiple correlation coefficient (with adjustment), 6) the proportion of variation explained (with adjustment), 7) the standard deviation of the error term, 8) the estimates for the regression parameters with estimated standard deviation, F-ratio and right tail probability, 9) the correlation matrix of the estimates (option 2), 10) the analysis of variance table, 11) the residual analysis (option 3). Ad 1) cf. section 2.4. Ad 2) The transformed data matrix gives the input data after possible transformations according to the model specifications have been applied. If the model formula contains no transformations, the original input data are given. The dependent variable is given as a separate column. In the case of replications for the dependent variable, the mean value of them is given, and the number of

Page 26 replications is given as an extra (last) column. If a weight- variable (or -expression) is specified in the model formula, the (transformed) data comprising the weights are given as an extra (last) column. Each (transformed) independent variable is indicated by its corresponding parameter. This originated from the fact that it is not obvious how to denote a variable which is transformed like: Arcsin (Sqrt (y+25)), with 'Arcsin', with 'Sqrt' or perhaps with 'y' itself. The dependent variable is indicated by 'dep.var.'. Ad 4) and 9) The matrix of the estimated correlation coefficients of the variables and of the estimates are both supplied depending on whether option 2 is specified or not. Ad 5), 6) and 7) cf. section 1.2. Ad 8) The F-ratio and right tail probability give the user the opportunity to test the significance of a particular regression coefficient (cf. section 1.3.1). Ad 10) The layout of the table closely resembles that of the table in section 1.3.2. The F-ratios and right tail probabilities give the user the opportunity to test the significance of all the regression coefficients or of a subset or combination thereof or to test the adequacy of the (linear) model (cf. section 1.3.2). Ad 11) A table of observations, fitted values, standard deviations of the fitted values, residuals, standardized residuals and studentized residuals is provided (cf. section 1.2.2). As a check on computa- tions, the sum of the residuals is also given. If an unknown constant term is present in the model formula, this sum should be zero. Furthermore the upperbound for the right tail probability of the largest absolute studentized residual is given. Without options specified, the printed output from the program consists of 3), 5), 6), 7), 8) and 10). If option 5 is specified, the output for the model itself is given as specified by the other options, but for the submodels it depends on the use of a submodel specifier list. Without that list the output from the options 1, 2 and 3 is suppressed (even if those options are specified). With that list only the superfluous parts of the output (that is the transformed data matrix and the correlation matrix of the variables) are suppressed.

Page 27 3.2 STANDARD AND OPTIONAL DATA OUTPUT If an outputfile is specified in response to the outputstream request, or is appended to the "Run" keyword (cf. appendix 1), the program writes the following pieces of information in one record to that file: 1) if option 1 is specified: the transformed data matrix, preceded by the number of rows and columns respectively, 2) if option 2 is specified: the corrrelation matrix of the variables, preceded by its order, 3) the number of submodels specified in the list appended to option 5, if that option is specified at all; otherwise the number 1. It is followed by (for each (sub) model): the number of estimated parameters, the estimates for the parameters of the (sub) model, and: a) if option 2 is specified: the variance-covariance matrix of the estimates, preceded by its order, (BE CAREFUL: this is NOT the correlation matrix of the estimates, which is printed; however, the correspondence between the two matrices is established by the relations (10) and (11) in "Help"/Theory), b) if option 3 is specified: the number of respondents, followed by for each respondent the: observation, fitted value, standard deviation, residual, standardized residual and studentized residual, and finishes by writing an "Eor" keyword. As in the case of printed output, the output described in 3) is only effected for submodels, if an explicit submodel specifier list is appended to option 5. An input specification to describe one record of data written to the outputstream when options 1, 2, 3 and 5 (with a submodel specifier list appended to it) are specified, could read: "Input" n, m, n * (m * [transformed data element]), p, <p * (p+1) : 2> * [correlation element], s, s * (t, t * [estimate], q, <q * (q+1) : 2> * [covariance element], r, r * (6 * [residual element]) );

Page 28 For the original model the following relations hold: q = t, t = m-1, r = n and p = m (or p = m-1 if replications and/or weights are specified); s is the number of processed (sub)models; for each submodel t and q are decreased with the number of terms that are omitted from the original model. Real numbers in the printed output are given in fixed point format with a six decimal fractional part, the only exceptions are the estimates for the regression parameters with their standard deviations, which have a ten decimal fractional part and the numbers in the listings of the input data and the transformed data matrix, which have a three decimal fractional part. Real numbers in the data output are given in floating point format with a sixteen decimal mantissa and a two decimal exponent part. 3.3 ERROR MESSAGES Error messages against syntax or semantics have the following layout: Error : <error text> or <error number> The error text corresponding to the error numbers is: 1 No input data given. 2 All input data has been skipped. 3 Attempt to process more input data than provided. 4 Number in the input data is incorrect or too large. 5 In a number '.' is not followed by a digit. 6 In a number '#' is not followed by '+', '-' or a digit. 10 No model formula given. 11 Left hand part is not followed by '='. 12 Expression is not followed by ')'. 13 Option name used in a primary in an expression. 14 Incorrect primary in a factor in an expression. 15 Incorrect (control) identifier in an expression. 16 Parameter list of a standard function is not followed by ')'. 17 Standard function call with incorrect number of parameters.

Page 29 20 No input formula given. 21 Expression in a control is not followed by '>'. 22 Option name used in a control in an input statement. 23 Input statement in a description is not followed by ')'. 24 Variable list in a description is not followed by ']'. 25 Incorrect description in an input statement. 26 Incorrect identifier in a variable list. 27 Item in a variable list is not an identifier. 30 Incorrect option number in an option statement. 31 Incorrect option name in an option statement. 32 Specifier list is not followed by ')'. 33 Number in a specifier list is incorrect or too large. 34 Specifier list is appended to incorrect option. 35 Specifier is not a number. 36 Specification is not properly continued. 37 Specification is not terminated with ';'. 40 No defined (independent) identifier to the right of '='. 41 Incorrect use of a parameter in a regression term. 42 Undefined (weight) identifier to the left of '='. 43 Undefined (dependent) identifier to the left of '='. 44 Number in a regression term is incorrect or too large. 45 Term does not have the form: param * factor or factor * param. 46 Undefined (independent) identifier in a regression term. 47 No regression parameter in a regression term. 50 Division by zero. 51 Integer division by zero. 52 Observation for dependent variable is in absolute value too large. 53 Observation for independent variable is in absolute value too large. 54 Exponentiation with zero base and non positive exponent. 55 Exponentiation with negative base and real exponent. 56 Weight factor is not positive. 60 Argument of 'Sqrt' is negative. 61 Argument of 'Ln' is not positive. 62 Argument of 'Log' is not positive.

Page 30 63 Argument of 'Exp' is too large. 64 Argument of 'Arcsin' is in absolute value larger than one. 65 Argument of 'Arccos' is in absolute value larger than one. 66 Argument of 'Sinh' is in absolute value too large. 67 Argument of 'Cosh' is in absolute value too large. 70 Number of observations for the first dependent variable is zero. 71 Numbers of observations for the dependent variables are not equal. 72 Number of observations for the first independent variable is zero. 73 Numbers of observations for the independent variables are not equal. 74 Control reads an incorrect number in the input data. 75 Numbers of replications for the dependent variables are not equal. 76 Given, read or computed replication factor is not integral. If the error number lies between: 5 and 37, it is followed by the most recently processed identifier, number and symbol. Only the first eight characters of each name are displayed. 41 and 47, it is followed by the number of the right hand part regression term which causes the error, or a zero if the left hand part is at fault. 50 and 67, it is followed by the wrong value and the number of the line in the transformed data matrix which causes the error. Instead of the wrong value, the number of the right hand part regression term which causes the error is displayed when the error number lies between 50 and 53. 70 and 76, it is followed by the check value and the wrong value. Instead of the wrong value, the value of the controlling variable of the next enclosing repetition loop is displayed when the error number is 76. 3.4 EXAMPLES An impression of the printed output of the Multiple Linear Regression Analysis program may be obtained by actually submitting the user program in section 2.6 (which resides on: Hlp: Mulexa.hlp) to the MULREG program.

Page 31 APPENDIX 1 TECHNICAL REMARKS The following technical remarks reflect the SIMULA implementation of the Multiple Linear Regression Analysis program as of version 5H(246). Any comments or queries concerning the functioning of the software described in this document should be addressed to: Marten van Gelderen, IKO Computer Systems Group, Postbox 4395, 1009 AJ Amsterdam, The Netherlands. (telephone: 31-(0)20-930951). The program resides on a device called USR:. It is started by typing: .R MULREG, and responds to the standard output device (usually TTY:) with an identifying header and requests to specify four files, as follows: Multiple Linear Regression Analysis Enter file specifications Inputstream : Printstream : Datastream : Outputstream : The inputstream serves to read the user program from; the printstream receives the printed output from the program; the datastream serves to read the separate input data records from; the outputstream receives the data output records from the program. If the default carriage return is responded to the data- and output- stream requests, the program assumes that no separate input data are present and that no data output is required. If the default carriage return is responded to the input- and print- stream requests, the program connects the standard input and output devices (usually TTY:) to the input- and printstream respectively. To notify that the inputstream is connected to the standard input device, the program dis- plays the prompting character '*' (asterisk).

Page 32 If both the input- and printstream are connected to the standard input and output devices, the program echoes every text it cannot interpret properly, preceeded by the error character '?' (question mark). However, in response to a single carriage return the text 'For help type: "Help"' is displayed. To the "Run" keyword a file specification list may be appended, pre- ceeded by the character '/' (slash), with the following general format: /print-spec;output-spec=input-spec;data-spec Each of the specifications will be connected to the corresponding program stream respectively. Also, each of the specifications may be omitted, which means that nothing will be changed to the corresponding stream at all. However, if the character ' ' (blank) is substituted for one (or more) of the specifications, the defaults - as described previously - will be connected to the corresponding streams respectively. If the specified files do not exist (for input- or datastreams) or cannot be created (for print- or outputstreams), the corresponding streams are displayed again, followed by the character '?' (question mark), to indicate the erroneous situation and to enable the specification of other files (or defaults). If the program encounters a premature end-of-file condition in the inputstream, it will connect the inputstream to the standard input device and thus respond with the prompting character. New specifications for the "Model", "Input", "Options" or "Data" (or keywords like "Run" or "Exit") may then be entered. In the inputstream, the program does not discriminate between upper and lower case letters. Identifiers may contain any number of blanks, however, a carriage return is not permitted, restricting the maximum length of identifiers to the maximum number of characters in one input line. In front of and following the opening quote of keywords only non-printing ASCII characters like tabs and/or blanks are permitted, otherwise the keyword (and the whole line following it) is not recognized.

Page 33 Two implementation dependent restrictions are imposed on user pro- grams: the maximum number of differently spelled identifiers and numbers is 789 and the maximum number of nested parentheses is 62. In addition two machine dependent restrictions are imposed: the maximum number of characters in one input line is 132 and the number of significant digits in computations is 18. In response to the keyword "Help", the information in this appendix (which resides on HLP:) is copied to the printstream. Each of the following switches may be appended to the "Help" keyword, preceeded by the character '/' (slash), in order to obtain more detailed information (which also resides on HLP:). /Theory Regression model & least squares (section 1.1 & 1.2), /Tests Possible tests of hypotheses (section 1.3), /Model Specification of the model formula (section 2.1), /Input Specification of the input formula (section 2.2), /Options Possible options and their effects (section 2.3), /Data Acceptable numbers and their delimiters (section 2.4), /User Setup of a complete user program (section 2.5), /Example Example of a complete user program (section 2.6), /Print Standard and optional printed output (section 3.1), /Output Standard and optional data output (section 3.2), /Errors Meaning of the error numbers (section 3.3), /Syntax Definition of the syntax of a user program (appendix 2). If no help information is available an appropriate message is displayed.

Page 34 APPENDIX 2 DEFINITION OF THE SYNTAX OF A USER PROGRAM The syntax of a user program is defined in an extended version of a notation known as the Backus Naur Form, in short: BNF (cf. NAUR [10]). The extensions comprise an explicit repetition and optionality construct together with the possibility of factorization. The BNF may be regarded as a metalanguage for the description of a user program. In addition to the symbols that are admissible in a user program, the metalanguage requires a number of extra symbols, called metasymbols. The ten metasymbols used in extended BNF are: ::=, |, <, >, {, }, [, ], ( and ). The , and . are part of the metalanguage English in which we are describing BNF. We write: <expression>::= ['+' | '-'] <term> { ('+' | '-') <term> } The metasymbols < and > are used as delimiters to enclose the name of a class. The metasymbol ::= may be read as 'is defined as' or as 'consists of'. The metasymbol | is read as 'or'. Repetition is denoted by curly brackets, i.e. { a } stands for e | a | aa | ... Optionality is expressed by square brackets, i.e. [ a ] stands for e | a. Parentheses merely serve for grouping (factorization) i.e. (a | b) c stands for ab | ac. Terminal symbols appear enclosed in single apostrophes. The above phrase defines an expression as a term, optionally preceeded by a '+' or a '-' and followed by an arbitrary repetition of terms, each preceeded by a '+' or a '-'. The syntax of a user program can thus be defined as follows:

Page 35 <letter>::= 'A'|'B'|'C'|'D'|'E'|'F'|'G'|'H'|'I'|'J'|'K'|'L'|'M'| 'N'|'O'|'P'|'Q'|'R'|'S'|'T'|'U'|'V'|'W'|'X'|'Y'|'Z'| 'a'|'b'|'c'|'d'|'e'|'f'|'g'|'h'|'i'|'j'|'k'|'l'|'m'| 'n'|'o'|'p'|'q'|'r'|'s'|'t'|'u'|'v'|'w'|'x'|'y'|'z' <digit>::= '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9' <model keyword>::= '"Model"' | '"MO"' <input keyword>::= '"Input"' | '"IN"' <option keyword>::= '"Options"' | '"OP"' <data keyword>::= '"Data"' | '"DA"' <run keyword>::= '"Run"' | '"RU"' <exit keyword>::= '"Exit"' | '"EX"' <function name>::= 'Abs' | 'Sign' | 'Sqrt' | 'Sin' | 'Cos' | 'Tan' | 'Ln' | 'Log' | 'Exp' | 'Entier' | 'Round' | 'Mod' | 'Min' | 'Max' | 'Arcsin' | 'Arccos' | 'Arctan' | 'Sinh' | 'Cosh' | 'Tanh' | 'Indicator' <option name>::= 'Transformed data matrix' | 'Correlation matrix' | 'Residual analysis' | 'No regression analysis' | 'Process submodels' | 'Print input data' | 'No input data rewind' | 'Save original model' | 'Test reduced model' | 'Missing values' <option number>::= '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9' | '10' <number>::= ['+' | '-'] <unsigned number> <unsigned number>::= <decimal number> | <exponent part> | <decimal number> <exponent part> <decimal number>::= <unsigned integer> | <fractional part> | <unsigned integer> <fractional part> <exponent part>::= '#' <integer> <fractional part>::= '.' <unsigned integer> <integer>::= ['+' | '-'] <unsigned integer> <unsigned integer>::= <digit> { <digit> }

Page 36 <identifier>::= <letter> { <letter> | <digit> } <data specification>::= <data keyword> [ <input data> ] <input data>::= <number> { <number> } <option specification>::= <option keyword> [ <option statement> ] ';' <option statement>::= <option> { ',' <option> } <option>::= <simple option> [ '(' <specifier list> ')' ] <simple option>::= <option name> | <option number> <specifier list>::= <specifier> { ',' <specifier> } <specifier>::= <number> <input specification>::= <input keyword> <input statement> ';' <input statement>::= <input part> { ',' <input part> } <input part>::= <control> | <description> | <control> '*' <description> <control>::= <number> | <identifier> | '<' <expression> '>' <description>::= '(' <input statement> ')' | '[' <variable list> ']' <variable list>::= <variable> { ',' <variable> } <variable>::= <identifier> <model specification>::= <model keyword> <model statement> ';' <model statement>::= <left hand part> '=' <right hand part> <left hand part>::= <expression> [ '&' <weight part> ] <weight part>::= <expression> <right hand part>::= ['+'] <term> { '+' <term> } <expression>::= ['+' | '-'] <term> { ('+' | '-') <term> } <term>::= <factor> { ('*' | '/' | ':') <factor> } <factor>::= <primary> { '^' <primary> } <primary>::= <unsigned number> | <identifier> | <function designator> | '(' <expression> ')' <function designator>::= <function name> [ '(' <parameter list> ')' ] <parameter list>::= <parameter> { ',' <parameter> } <parameter>::= <expression> <user program>::= { <job> } <exit keyword> <job>::= { <specification> } <run keyword> <specification>::= <model specification> | <input specification> | <option specification> | <data specification>

Page 37 APPENDIX 3 TECHNICAL DESCRIPTION OF THE PROGRAM In this appendix a more or less technical description of the program is given in terms of variables, procedures and control flow. Basically the program logic is as follows: Init program Enter & open files Job: Read job Run: Init compiler tables Compile model & input Init data buffers Execute (to produce design matrix) Regression analysis Print results GOTO Job Exit: Close files In case of an errorsituation in one of the sections, no further action is undertaken and control is transferred to Job. 1. The program is initialized by setting various declared system and compiler constants to their appropriate values. Some of them are highly machine dependent constants, others are internal codings or table limits. The four basic informationstreams in the program: the input-, print-, data- and outputstream are connected to the external file specifications, entered by the user, in the procedure enterfiles. 2. A job is read specification by specification (after printing and skipping leading text by a call on echotext) by means of calls on readtext or readdata, depending on which specification from the job is read. Both procedures first set up a buffer administration as will be described in section 3, then process an input line or input numbers and finally call on readline to obtain further information. Their task is finished when the

Page 38 variable endtext becomes true, that is when the next input line starts with a '"' (quote), which is the beginning of what can be a keyword, or when the end-of-file condition is met. If an input data record is to be read from a file, the same administration is setup and the same routine for reading the actual numbers is used, only information is obtained from the datastream rather than the inputstream. The echoing of input text and input data is done via calls on the procedures printtext and printdata respectively. 3. The text and data storage section provides two kinds of SIMULA classes, one for text storage and another for data storage. The class textstorage provides a procedure to store lines of text into a linked list of textbuffers (starting at base). The procedure nextline retrieves those lines (after a reset). The class datastorage provides a procedure to store real numbers into a databuffer which is linked into a linked list of such buffers (starting at base). In case of buffer overflow a new buffer is created and linked into the list. A super class inputstorage provides the reset procedure and a procedure nextnumber to retrieve numbers out of the buffers in a FIFO (first in, first out) manner. Another super class rightstorage provides a procedure lastnumber, which is quite similar to nextnumber, except that it retrieves numbers out of the buffers in a LIFO (last in, first out) manner. 4. The compiler uses two tables: one called program and another called hashtable. The first table accepts the (macro) instructions generated by the compiler via calls of load. The second table is used by the procedure nextatom to identify the various items in the model and input formulae. The recognition method used is twin-prime-hashing described in KNUTH [5] p. 522. Alphanumerical items can belong to one of the following classes: identifier, number, functionname or optionname, which are all superclasses of atomcell. The procedure nextatom calls on nextchar and delivers a pointer to an atomcell of the appropriate type containing, among others, the actual text of the item and its index in the hashtable. The compiler assumes that the running system has at its disposal a (programmed pseudo) register F which is capable of handling both integer and real numbers. Furthermore a (programmed pseudo) memory organization known as a stack must be available to the running system. A stackpointer refers to the first free position in the stack. All binary operations will

Page 39 take place with the top of the stack as first operand and F as the second. The result is delivered in F and as a side effect the stackpointer is decreased by 1. When the contents of F is saved in the stack, the stackpointer is increased by 1. The fundamental idea behind most procedures for translating the model and input formulae is, that the first atom of the syntactical unit to be processed by that procedure has been read already (its 'value' being assigned to lastatom). The procedure considers itself to have finished its task after reading the first atom that no longer can belong to that unit syntactically. Meanwhile the translation of that unit has been produced. A more elaborate description of the procedure system (arithmetic) expression can be found in KRUSEMAN ARETZ [6] and [7]. Reference [8] provides the description of a complete ALGOL 60 compiler. We only mention here that every expression is transformed into a macro program that corresponds to the reversed polish form, thus: (a+b) * (c-d) ^ e becomes: ab+ cd- e ^ *. The procedure for translating the input formula must, among others, generate instructions to perform the linkage between identifiers from the model formula and numbers from the input data. While translating a model formula, identifiers to the right of the equal sign are assigned type 1, those to the left of the equal sign type 2. If these identifiers appear in a variable list the types are changed into 3 and 5 respectively. While translating an input specification, identifiers in a variable list not appearing in the model formula are assigned type 4, those in the input formula not appearing in a variable list are assigned type 6. Meanwhile instructions are generated to put the next number from the input data in the appropriate column of the (yet untransformed) design matrix, or to skip that number. For variable lists that consists entirely of identifiers not appearing in the model formula, special instructions to skip the corresponding numbers all in one, are generated. 5. In the procedure check model a check is made if the model formula, after the linkage to the input data (by means of the input formula), still satisfies some elementary statistical conditions, like:

Page 40 a) each term must be the product of a parameter and a factor, b) in that factor no identifier may appear that is not present in a variable list in the input formula. (An attempt to perform regression analysis with variables for which no input data is present may not succeed.) In the procedure check input a check is made if for each variable in the model formula an equal amount of numbers is present, moreover a check is made if all numbers in the input data have actually been processed (option 7 disengages this check). 6. The execution section of the program is activated by a call of the procedure execute which, among other things, simulates the basic cycle of a computer: next: get the instruction indicated by the programcounter increase the programcounter with 1 isolate the instruction and address part execute the instruction GOTO next This cycle ends when the programcounter tries to leave the (translated macro) program. The (macro) instructions itself are coded via the switch lists macro and macro2. 7. After the execution of the input- and modelinstructions, the (transformed) design matrix is delivered to the regression routine(s). The actual computation of the regression coefficients is done via a call of lsqdec followed by a call of lsqsol and lsqinv. The first two of these procedures are described extensively in DEKKER [2] pp. 65-69. The vector and matrix multiplication is done via calls of vecvec, matvec, tamvec and tammat, described in DEKKER [2] pp. 8-9. The algorithms for phi and Fisher are copied from CACM: algorithms 209 and 322 respectively. All other computations are straightforward.

Page 41 REFERENCES [1] AFIFI, A.A. & S.P. AZEN, Statistical Analysis; A Computer Oriented Approach; Academic Press, (1972). [2] DEKKER, T.J., ALGOL 60 procedures in numerical algebra, part 1; MC Tract 22, Mathematisch Centrum, Amsterdam. [3] DRAPER, N.R. & H. SMITH, Applied Regression Analysis; John Wiley & Sons, (1966). [4] JONGE, H. DE, Inleiding tot de Medische Statistiek, deel II; Nederlands Instituut voor Praeventieve Geneeskunde, (1960). [5] KNUTH, D.E., The Art of Computer Programming, Vol. 3, Sorting and Searching; Addison Wesley, (1973). [6] KRUSEMAN ARETZ, F.E.J., ALGOL 60 translation for everybody; Elektronische Datenverarbeitung, Vol. 6 (1964), 6, p. 233-244. [7] KRUSEMAN ARETZ, F.E.J., Programmeren voor rekenautomaten, (De MC ALGOL 60 vertaler voor de EL X8); MC Syllabus 13, Mathematisch Centrum, (1972). [8] KRUSEMAN ARETZ, F.E.J., P.J.W. TEN HAGEN & H.L. OUDSHOORN, An ALGOL 60 compiler in ALGOL 60; MC Tract 48, Mathematisch Centrum, (1973). [9] LUND, R.E., Tables for an approximate test for outliers in linear models; Technometrics, Vol. 17 (1975), 4, p. 473-476. [10] NAUR, P. (ed.), Revised report on the algorithmic language ALGOL 60; Regnecentralen, Copenhagen, (1964). [11] SEARLE, S.R., Linear Models; John Wiley & Sons, (1971). [12] THEIL, H., Principles of Econometrics; John Wiley & Sons, (1971).

Page 42