Trailing-Edge
-
PDP-10 Archives
-
decus_20tap5_198111
-
decus/20-0149/mulreg.doc
There are 2 other files named mulreg.doc in the archive. Click here to see a list.
First page
A USER PROGRAM FOR MULTIPLE LINEAR REGRESSION ANALYSIS
======================================================
Marten van Gelderen
Version 5H(246)
16-Feb-80
Copyright (C) 1975, 1979, 1980 by
Foundation Mathematical Centre, Amsterdam
Institute for Nuclear Physics Research, Amsterdam
General permission to make fair use in teaching or research of all
or part of this material is granted to individual readers and to
nonprofit organizations, provided that the copyright notice of
either the Foundation Mathematical Centre or the Institute for
Nuclear Physics Research is given and that reference is made to
this publication and to the fact that reprinting privileges were
granted by permission of one of the above mentioned organizations.
CONTENTS Page
Abstract 1
Introduction 2
Chapter 1 - Multiple linear regression analysis 4
1. The regression model 4
2. Least squares 5
2.1 Weighted least squares 8
2.2 Residual analysis 9
3. Tests of hypotheses 10
3.1 A particular regression coefficient 10
3.2 Analysis of variance 10
Chapter 2 - The input to the program 14
1. The model specification 14
1.1 Transformations 15
2. The input specification 17
3. The option specification 19
4. The data specification 21
5. The user program 22
6. Examples 23
Chapter 3 - The output from the program 25
1. Standard and optional printed output 25
2. Standard and optional data output 27
3. Error messages 28
4. Examples 30
Appendices 1. Technical remarks 31
2. Definition of the syntax of a user program 34
3. Technical description of the program 37
References 41
Page 1
ABSTRACT
Performing multiple linear regression analysis on a electronic
computer can be very laborious when preparation of the input is
complicated. The program introduced in this document is especially
designed to have flexible and comprehensible model and input
specifications. It accepts a "Model" formula which resembles the notation
of regression models in common statistical literature quite closely. An
accompanying "Input" formula provides the program with information about
the arrangement of the observations in the input "Data", which consists of
a series of numbers in freefield format. A "Run" command activates the
program, while an "Exit" command causes the program to stop. Extensive
runtime "Help" information is available. The following piece of program
may serve as an example of some of these ideas:
"Model" y = alfa0 + alfa1 * x + alfa2 * x ^ 2;
"Input" 5 * ([x], n, n * [y]);
"Options" Transformed data matrix, Process submodels (1);
"Data" 1 4 1.1 0.7 1.8 0.4
3 5 3.0 1.4 4.9 4.4 4.5
5 3 7.3 8.2 6.2
10 4 12.0 13.1 12.6 13.2
15 4 18.7 19.7 17.4 17.1
"Run"
"Exit"
Page 2
INTRODUCTION
Many computer programs or subroutines exist, which in one way or
another calculate estimates for the parameters of a multiple linear
regression model, but most of them can hardly be used by the layman.
Subroutines have to be embedded in a higher level language program to
perform the necessary input and output; as a layman however, one is not
always acquainted with these languages. The programs usually require the
input data to be presented in a specific standard form, which in practice
means that input data that are already available in a machine readable form
must be transformed into that standard form or that the input data should
be punched exactly in that standard form. More difficult still and often
more confusing, is the way in which the program is told which regression
model the user wants to consider. For instance, for the program for
multiple polynomial regression analysis of the Mathematical Centre a code
matrix of zeros and ones had to be given in order to indicate the form of
the regression polynomial. Transformations of one or more variables are
hardly ever automatically possible and require a separate program to be run
in advance to prepare the transformed data matrix.
In this document a program is described that will make it possible for
almost everyone to obtain his desired results with no more than a
superficial knowledge of the underlying programming system. It has been
recognized recently that standard programs should not bother the user too
much with awkward input specifications. A statistician is more interested
in the results of a program and in how he can obtain them without
interference of software specialists of the computer centre, than in a few
seconds gain in actual computing time. Therefore one of the objectives was
to develop a program which would enable the user to specify the form of the
regression model and the arrangement of the numbers in his input data, in a
straightforward manner, even if he has hardly any knowledge of programming
languages at all. The program thus accepts a model formula which resembles
the notation of regression models in common statistical literature quite
closely; the accompanying input formula indicates which numbers or series
of numbers in the input data belong to which variable in the model formula.
Page 3
This input scheme gives the user the opportunity to work with existing
data and to process possible transformations of the variables in the model
formula without the need for external data adjustment. However, the model
formula must be given explicitly and completely and the structure of the
input data must be known exactly before an input formula can be
constructed. In the output the results are identified with the names given
by the user in the model formula. Any technical (machine dependent) action
required to run the program is described in appendix 1.
ACKNOWLEDGEMENTS
The basic ideas for the input arrangement presented in this document
originated from A.P.B.M. VEHMEYER. The article 'ALGOL 60 translation for
everybody' by F.E.J. KRUSEMAN ARETZ [6] provided the foundation for the
translator and the execution section of the program, while the techniques
used in the storage section were designed by L.J. OOSTRIJK. The matrix and
least squares routines used in the program were copied from 'ALGOL 60
procedures in numerical algebra, part 1', by T.J. DEKKER [2]. The classic
book 'Applied Regression Analysis' by N.R. DRAPER & H. SMITH [1] served as
a frame of reference throughout the design and implementation of the
program. Numerous recently incorporated statistical improvements were
suggested by R.D. GILL.
Page 4
CHAPTER 1
MULTIPLE LINEAR REGRESSION ANALYSIS
1.1 THE REGRESSION MODEL
In a regression problem the researcher postulates a certain relation-
ship between a random variable y (the realizations of which are subject to
some form of disturbance) on the one side and a number of variables
x1,...,xp (which are without or at least almost without disturbances) on
the other side. This relationship is expressed by a mathematical formula,
which is called the (linear) regression model, for instance:
y = a0 + a1 * x1 + ... + ap * xp + e (1)
in which a0,...,ap represent unknown regression coefficients (parameters)
which are to be estimated and e represents the disturbance. If a constant
term is present in the model formula (in (1) the a0), the model is said to
be an 'intercept model', if no constant term is present, the model is
called a 'no-intercept model'.
The variables x1,...,xp and the variable y can also represent (other)
transformed variables. The researcher might have reasons to believe (from
background information concerning the experiment) that transformations are
necessary, for instance:
1) to obtain normally distributed disturbances,
2) to obtain a greater homogeneity of the variances of the disturbances,
3) to linearize non-linear regression models (if possible).
The transformed regression model can be written as:
G(y) = a0 + a1 * F1(x1,...,xm) + ... + ap * Fp(x1,...,xm) + e (2)
in which G, F1,...,Fp represent the transformations,
a0,...,ap represent the parameters to be estimated,
y represents the dependent variable,
x1,...,xm represent the independent variables,
e represents the disturbance.
Page 5
The choice of a transformation by means of 'trial and error' is rather
time consuming and costly. The importance of the location parameter makes
for the difficulty. It is not unusual that Log (x) yields no improvement,
but that Log (c+x) gives better results for a particular choice of c.
Because this holds for almost any transformation of some importance, we
must actually solve in each case a nonlinear adjustment problem. Often
though, a simple form of the transformation is suggested by the researcher
who is better acquainted with the peculiarities of the experiment.
1.2 LEAST SQUARES
Regression analysis consists in fact of the adjustment of a hyperplane
of the required dimension to the data. The fitting is done with the method
of least squares, which means that the sum of the squares of the
differences between the observed values for y and the estimated values for
the expectation of y, are minimized. This sum of squares is also called
the residual sum of squares. In matrix notation the regression model can
be written as (cf. DRAPER & SMITH [3] pp. 58-62):
Y = Xa + e (3)
in which Y is a (n*1) random vector of observations,
X is a (n*p) matrix of known (fixed) values,
a is a (p*1) vector of (unknown) parameters,
and e is a (n*1) random vector of disturbances.
It is supposed that E(e) = 0 and var(e) = Isigma^2, in which I is the unit
matrix, thus:
E(Y) = Xa (4)
The sum of squares of the differences between the observed values of Y
and the estimated values for the expectation of Y thus equals:
(Y-Xa)'(Y-Xa) = Y'Y - 2a'X'Y + a'X'Xa (5)
(for a'X'Y is a scalar and therefore equal to Y'Xa).
Page 6
Choosing as least squares estimator b that value of a which minimizes
(5), involves differentiating with respect to the elements of a and
equating the result to zero:
-2X'Y + 2X'Xb = 0, thus: X'Y = X'Xb (6)
This system is called the normal equations. If the rank of X equals p, X'X
is nonsingular and the inverse of X'X exists. In that case the solution of
the normal equations can be written as:
b = inv(X'X)X'Y (7)
Observe that p <= n must hold, in order that the rank of X can be p at all.
Therefore at least as many observations must be made, as there are
parameters in the model. Also observe that E(b) = inv(X'X)X'E(Y) = a, thus
b is an unbiased estimator of a.
The least squares estimator has the following properties:
1. It is an estimator which minimizes the sum of squares of deviations,
irrespective of any distribution properties of the disturbances. The
assumption that the disturbances are normally distributed is, of course,
necessary for tests which depend on this assumption, such as t- or F-
tests, or for obtaining confidence intervals based on t- or F-
distributions.
2. According to the Gauss-Markov theorem, the elements of b are unbiased
estimators, which have minimum variance (of any linear function of the
Y's which provides unbiased estimators), again irrespective of the
distribution properties of the disturbances.
3. If the disturbances are mutually independent and normally distributed
(with E(e) = 0 and var(e) = Isigma^2), then b is also the maximum
likelihood estimator.
The variance-covariance matrix of b is:
var(b) = inv(X'X)sigma^2 (8)
The variances are the diagonal and the covariances the off-diagonal
elements.
Page 7
An unbiased estimator for sigma^2 is given by:
s^2 = (Y'Y - b'X'Y) / (n-p) (9)
The square root of this estimator is frequently called 'standard error of
estimate'. In the printed output of the program it is indicated more
properly as 'standard deviation of the error term'.
Let vij be the element in the i-th row and j-th column of inv(X'X),
then sdi = s * Sqrt(vii) estimates the standard deviation of bi, and cij =
vij / Sqrt(vii * vjj) gives the correlation coefficient between bi and bj
for i = 1,...,p and j = 1,...,p. Thus:
vii = (sdi / s)^2 (10)
and
vij = cij * Sqrt(vii * vjj) = cij * (sdi * sdj) / s (11)
A frequently used statistical measure for evaluating regression models
is the multiple correlation coefficient R which is defined in the intercept
model as the square root of the proportion of the corrected total sum of
squares accounted for by the model. If the correction for means is denoted
by nu^2, with u = Sum(i,1,n,yi)/n, then R can be defined by:
R^2 = (b'X'Y-nu^2)/(Y'Y-nu^2) = 1 - (Y'Y-b'X'Y)/(Y'Y-nu^2) (12)
However, we must divide Y'Y-b'X'Y by n-p, not by n, to obtain an unbiased
estimator of sigma^2, moreover it is customary to divide Y'Y-nu^2 by n-1,
not by n. If we adopt both modifications we obtain the adjusted multiple
correlation coefficient, which can thus be defined by:
adj(R)^2 = 1 - (n-1)/(n-p) * (Y'Y-b'X'Y)/(Y'Y-nu^2) (13)
In the no-intercept model the correction for means is ignored, giving
as definition of R^2: b'X'Y/Y'Y = 1 - (Y'Y-b'X'Y)/Y'Y, while the adj(R)^2
is defined correspondingly as: 1 - n/(n-p) * (Y'Y-b'X'Y)/Y'Y. R^2 itself
is often called the 'proportion of variation explained' (cf. THEIL [11]
pp. 178-179).
Page 8
1.2.1 WEIGHTED LEAST SQUARES
It sometimes happens that some of the observations for the dependent
variable are 'less reliable' than others. This usually means that the
variances of the observations are not all equal; in other words the matrix
V = var(e) is not of the form Isigma^2, but is diagonal with unequal
diagonal elements. The basic idea to solve this problem is, to transform Y
to other variables, which do appear to satisfy the usual tentative model
assumptions, and then apply the usual (unweighted) analysis to the
variables so obtained. The estimates can then be re-expressed in terms of
the original variables Y (cf. DRAPER & SMITH [3] pp. 77-81).
Let the original regression model be: Y = Xa + e, with E(e) = 0 and
var(e) = Vsigma^2, with V diagonal with unequal diagonal elements, and let
P = inv(V). Premultiplying the original regression model with Q = Sqrt(P)
gives as transformed regression model:
QY = QXa + Qe (14)
with E(Qe) = 0 and var(Qe) = Isigma^2. The normal equations then become:
(QX)'QY = (QX)'QXa (15)
giving as solution if the indicated inverse matrix exists:
b = inv((QX)'QX)(QX)'QY = inv(X'PX)X'PY (16)
with variance-covariance matrix:
var(b) = inv(X'PX)sigma^2 (17)
In practical situations it is often difficult to obtain specific
information on the form of V at first. For this reason it is sometimes
necessary to make the (known to be erroneous) assumption V = I and then
attempt to discover something about the form of V by examining the
residuals from the regression analysis.
Page 9
1.2.2 RESIDUAL ANALYSIS
The vector of residuals D is defined as the difference between the
vector of observations Y and the vector of fitted values Z, obtained by
using the regression equation Z = Xb. So D = Y - Z or di = yi - zi for
i = 1,...,n. If the model is correct, the residual mean square MSE = s^2
estimates sigma^2, and the estimated standard deviation of the fitted value
zi at xi = (xi1,...,xip)' is:
sd(zi) = s * Sqrt(xi'inv(X'X)xi) (18)
which can be used to construct a confidence interval for the expected value
of yi: E(yi) at xi = (xi1,...,xip)', or to construct a prediction interval
for the mean of h new observations at this point (cf. DRAPER & SMITH [3]
pp. 121-122). In the first case the confidence interval is:
zi +- t(n-p-1,1-alpha/2) * s * Sqrt(xi'inv(X'X)xi) (19)
and in the second case the prediction interval is:
zi +- t(n-p-1,1-alpha/2) * s * Sqrt(1/h + xi'inv(X'X)xi) (20)
Researchers often divide the residuals di by s, resulting in the
standardized residuals, which can be examined to see if they make it appear
that the assumption ei/sigma ~ N(0,1) is violated (cf. DRAPER & SMITH [3]
pp. 86-97). It might be expected that roughly 95% of the di/s were between
the limits (-2,2).
However, the variances of the residuals are not constant but a
function of the X matrix (see (18)), which suggests as standardization:
ti = di / s / Sqrt(1 - xi'inv(X'X)xi) (21)
giving the studentized residual. The maximum studentized residual can be
used in a test for detecting outliers, as follows: let t^2 = max(ti^2),
then min(1, n * (1-Fisher(1, n-p-1, t^2*(n-p-1)/(n-p-t^2)))) is an 'upper
bound for the right tail probability of the largest absolute studentized
residual' (cf. LUND [9] pp. 473-474).
Page 10
1.3 TESTS OF HYPOTHESES
1.3.1 A PARTICULAR REGRESSION COEFFICIENT
If the disturbances are mutually independent and normally distributed
(with E(e) = 0 and var(e) = Isigma^2) and with a (preset) level of signifi-
cance alpha, a significance test for a particular regression coefficient
can be performed, or more specifically: the null hypothesis is:
H0: ai = 0 (given that all other aj are in the model),
which is tested against the alternative hypothesis:
H1: ai is not equal to zero,
by treating FRi = bi^2/var(bi) as a realization of a Fisher(1,n-p) variate.
However, this test must be used with caution, because with the
(preset) level of significance alpha, only one coefficient can be tested
properly, while the computer output lists statistics for all coefficients.
It seems very tempting to test the coefficients serially one at a time, but
one must keep in mind that in doing so the level of significance of the
whole test rises above the nominal value (cf. DRAPER & SMITH [3] p. 65).
1.3.2 ANALYSIS OF VARIANCE
In the analysis of variance table the different contributions to the
total uncorrected sum of squares Y'Y (which is the first part in the table)
are given (cf. DRAPER & SMITH [3] pp. 57 & 62).
The second part of the table assumes the presence of an (unknown)
constant term in the model; if this term is absent, the 'mean'-line
disappears and in the 'regression'-line p-1 changes into p and b'X'Y-nu^2
changes into b'X'Y.
The third part of the table is only present when repeated observations
for the dependent variable are available, in which case:
Page 11
k is the number of groups of replications,
mi is the number of replications in group i, and
W = (w1,...,wk)', with wi = Sum(j,1,mi,yij) / Sqrt(mi), for i = 1,...,k.
The fourth part of the table is only present if a reduction is
requested and possible. SSQ then stands for the residual sum of squares
from a regression analysis with the first p-q out of the p independent
variables (1 <= q <= p-1), while SSE stands for the residual sum of squares
from a regression analysis with the original p independent variables.
Analysis of variance
source of right tail
variation df sum of squares mean square F-ratio probability
---------------------------------------------------------------------------
total n Y'Y
---------------------------------------------------------------------------
mean 1 nu^2 MSM = nu^2 FRM P(FM>=FRM)
regression p-1 b'X'Y - nu^2 MSR FRR P(FR>=FRR)
residual n-p Y'Y - b'X'Y MSE = s^2
---------------------------------------------------------------------------
lack of fit k-p W'W - b'X'Y MSL FRL P(FL>=FRL)
pure error n-k Y'Y - W'W MSP
---------------------------------------------------------------------------
reduction q SSQ - SSE MSQ FRQ P(FQ>=FRQ)
---------------------------------------------------------------------------
The column 'mean square' is obtained by division of the sums of
squares by their corresponding degrees of freedom. The column 'F-ratio' is
obtained by division of the mean squares by the residual mean square,
except for the lack of fit F-ratio, which is obtained by division of the
lack of fit mean square by the pure error mean square, thus:
MSM = nu^2/1, MSR = (b'X'Y-nu^2)/(p-1), MSE = s^2 = (Y'Y-b'X'Y)/(n-p),
MSL = (W'W-b'X'Y)/(k-p), MSP = (Y'Y-W'W)/(n-k), MSQ = (SSQ-SSE)/q,
FRM = MSM/MSE, FRR = MSR/MSE, FRL = MSL/MSP and FRQ = MSQ/MSE.
Page 12
If the disturbances are mutually independent and normally distributed
(with E(e) = 0 and var(e) = Isigma^2) and with a (preset) level of signifi-
cance alpha, a significance tests can be performed for:
1. The mean of the observations for the dependent variable, or more speci-
fically: the null hypothesis is:
H0: E(u) = 0,
which is tested against the alternative hypothesis:
H1: E(u) is not equal to zero,
by treating FRM as a realization of a Fisher(1,n-p) variate.
2. The regression equation, or more specifically: the null hypothesis is:
H0: a1 = ... = ap = 0, except for the ai that denotes
the constant term (if present),
which is tested against the alternative hypothesis:
H1: at least one of a1,...,ap is not equal to zero,
by treating FRR as a realization of a Fisher(p-1,n-p) variate.
3. The adequacy (linearity) of the model, or more specifically: the null
hypothesis is:
H0: the linear model is adequate (that is, no model signifi-
cantly improves the prediction of Y over the linear model),
which is tested against the alternative hypothesis:
H1: the linear model is not adequate,
by treating FRL as a realization of a Fisher(k-p,n-k) variate.
Page 13
4. A subset of regression coefficients, or more specifically: suppose
without loss of generality that the subset consists of the last q
coefficients, then the null hypothesis is:
H0: ar = ... = ap = 0, with r = p-q+1,
which is tested against the alternative hypothesis:
H1: at least one of ar,...,ap is not equal to zero,
by treating FRQ as a realization of a Fisher(q,n-p) variate.
5. A linear combination of the regression coefficients, or more specifi-
cally: the null hypothesis is:
H0: c'a = m, in which c is a vector of constants with order q+1,
which is tested against the alternative hypotheses:
H1: c'a is not equal to m,
by substituting c'a = m in the original model, shifting the known terms
to the left hand part, combining the corresponding terms in the right
hand part, and testing the thus derived so called 'reduced model' by
treating FRQ as a realization of a Fisher(q,n-p) variate.
In each case the right tail probability P(F >= FR) can be found in the last
column of the analysis of variance table (cf. DRAPER & SMITH [3] pp. 63-64,
68 & 72-75).
Page 14
CHAPTER 2
THE INPUT TO THE PROGRAM
2.0 The purpose of the input system is to give the user a simple and
adequate formalism to tell the program what he wants. In order to specify
the regression model and the correspondence between the variables in the
model and the numbers in the input data, the user must provide a so called
user program, which consists of one or more jobs which in their turn are
made up out of one or more specifications. Each specification starts with
a keyword and terminates at the next keyword. The keywords are: "Model",
"Input", "Options", "Data", "Run", "Exit" and "Help" (quotes included).
2.1 THE MODEL SPECIFICATION
To let the program know between which variables the statistician
expects a certain kind of relationship, he must provide a model specifi-
cation, which consists of the keyword "Model" followed by a formula (the
model statement), which resembles the notation of regression models in
common statistical literature quite closely. For instance:
"Model" y = alpha0 + alpha1 * x1 + alpha2 * x2;
A model formula consists of an identifier to denote the dependent
variable (the left hand part), followed by an '=' (equal), followed by the
sum of a number of terms (the right hand part), while it is terminated with
a ';' (semicolon). Each term must be the product of an identifier to
denote the parameter (which is to be estimated) and an identifier to denote
the independent variable. An exception is made for the optional constant
term, which is given as a single identifier denoting that constant term,
and which may be placed anywhere in the model.
Each identifier must start with a letter and is allowed to contain any
number of letters, digits and blanks. As most peripheral equipment of a
computer is unable to process sub- or superscriptions or Greek letters, we
write alpha0, alpha1 and alpha2. Identifiers have no inherent meaning, but
Page 15
serve for the identification of variables, parameters and functions. They
may be chosen freely (except for the twentyone standard function names and
the ten option names, cf. section 2.1.1 and section 2.3). It is advised
not to use the same identifier to denote two (or more) different
quantities; for regression parameters, however, it will not lead to fatal
errors, whereas for the dependent and independent variables distinguishable
identifiers must be used indeed. Correct model formulae are for instance:
"Model" y variable = constant + parameter * x variable;
and
"Model" depvar = const + beta1 * xvar1 + beta2 * xvar2;
2.1.1 TRANSFORMATIONS
Almost all transformations a user would like to perform on his input
data fit quite naturally in the model formula: each transformation is
expressed as a formula itself. If, for instance, the user wants to include
in the model formula as an independent variable the natural logarithm of
the sum of two other variables, called xvar1 and xvar2, he writes:
Ln (xvar1 + xvar2)
In model formulae the operators '+' (plus), '-' (minus), '*' (asterisk),
and '/' (slash) are allowed, all with their conventional meaning of
addition, subtraction, multiplication and division respectively. Of course
the normal operator precedence rules are obeyed. Special operators are:
':' (colon), integer division and '^' (uparrow), exponentiation.
The operation term : factor is defined only for operands both of type
integer and will yield a result of type integer, with the same sign as
would be obtained by normal division, while the magnitude is found by
dividing the two quantities and taking the whole part; mathematically it
can be defined as: a : b = Sign (a / b) * Entier (Abs (a / b)),
for instance: 5 : 2 = 2 and -7 : 2 = -3.
The operation factor ^ primary denotes exponentiation, where the factor is
the base and the primary is the exponent,
for instance: 5 ^ 2 = 25 and 2 ^ 3 ^ 2 = 64 but 2 ^ (3 ^ 2) = 512.
Page 16
Also the following twentyone standard functions are allowed:
Abs (E), Sign (E), Sqrt (E), Sin (E), Cos (E), Tan (E), Ln (E),
Log (E), Exp (E), Entier (E), Round (E), Mod (E1, E2), Min (E1, E2),
Max (E1, E2), Arcsin (E), Arccos (E), Arctan (E), Sinh (E), Cosh (E),
Tanh (E) and Indicator (E1, E2, E3)
in which E, E1, E2 and E3 are expressions in terms of variables, operators
and standard functions. Round (E) is defined as: Entier (E + 0.5) and
Indicator (E1, E2, E3) is defined as: IF E1 <= E2 <= E3 THEN 1 ELSE 0.
The dependent variable may be transformed in a similar way and as a
consequence the model formula in its most general form looks like:
"Model" G (y) = b0 + b1 * F1 (x1,...,xm) + ... + bp * Fp (x1,...,xm);
Some examples of transformed model formulae are:
"model" y = a0 + a1 * Sqrt (x1 + x2) + a2 * Sqrt (x3);
and
"MODEL" Arcsin (Sqrt (Y)) = A0 + A1 * X + A2 * X ^ 2;
A user can specify model formulae in which terms with known regression
coefficients appear, by subtracting those terms from the left hand part of
the model formula, for instance:
"Model" y - 5.4321 * x3 = a0 + a1 * x + a2 * x ^ 2;
This applies especially to the constant term; if this term is known it must
be shifted to the left hand part.
If weights are present in the input data (or can be computed out of
the input data), to indicate that the variances of the observations are not
all equal (cf. section 1.2.1), the left hand part of the model formula can
be expanded with a so called weight part (which can be an expression),
preceeded by a '&' (ampersand), for instance:
"Model" Depvar & Max (Abs (Weight), 10) = Const + Param * Indepvar;
Page 17
2.2 THE INPUT SPECIFICATION
To indicate which numbers or series of numbers in the input data
belong to which variable in the model formula and which numbers can be
skipped, the program expects an input specification. It consists of the
keyword "Input", followed by a formula (the input statement) which
describes the arrangement of the observations in the input data, while it
is terminated with a ';' (semicolon). The basic idea is that numbers from
the input data are identified with the names from the input formula in such
a way that (in order of entry) numbers belonging to the same name are put
in a queue appended to that name. For instance:
"Input" 100 * (codenr, 10 * [yvar], [xvar1, xvar2], -1);
means that one hundred series of numbers (each, as a check, terminated in
this example by -1) are present in the input data. Each series consists of
fourteen numbers: first one value which is read and assigned to the name
codenr, then ten values for the name yvar, then one value for the name
xvar1, followed by one value for the name xvar2 and finally the value -1.
The basic constituent of an input formula is a variable enclosed in
square brackets, in the example: [yvar]. The corresponding number from the
input data will be appended to the queue for that name. Several variables
can be put together in a variable list by separating them by commas and
enclosing them in square brackets, in the example: [xvar1, xvar2]. This
only serves to save the writing of several opening and closing brackets.
Separate numbers, series or blocks of numbers can be treated by
putting a repetition factor (control) followed by an asterisk in front of a
variable list (or in front of an input formula which must then be enclosed
in parentheses), in the example: 100 * and 10 * .
If a repetition factor is 1, it may be omitted together with the
asterisk and a parentheses pair, but square bracket pairs must remain.
When a name is used as a repetition factor, a value must already have been
assigned to it, which is done by giving that name, without square brackets
and followed by a comma (or closing parenthesis), earlier in the input
formula than the use of that name as a repetition factor. The corres-
Page 18
ponding number from the input data is then assigned as a value to that
name. If such names are used repeatedly in the input formula, the
corresponding numbers from the input data are compared with the first one
and, in the case of inequality, an error message is supplied. This may
serve as a check against shifted data reading. A similar check can be
obtained by giving an explicit number followed by a comma (or closing
parenthesis), in the example: the -1. The corresponding number from the
input data is then compared with that given number and, in the case of
inequality, an error message is produced.
Also an expression is allowed as a repetition factor, or for that
matter, as a check value, provided that it is enclosed in angle brackets,
for instance: <k+n>. As in the case of single names used as a repetition
factor each (non-standard function and non-option) name used in such a
(special) expression must have been given, followed by a comma, earlier in
the input formula than the use of that name in the expression.
The linkage between the model formula and the input formula is
established by using the same names in the model terms and in the input
variable lists. Numbers from the input data that belong to such input
names will be treated as observations for the model variables, while
numbers that belong to input names between square brackets which do not
appear in the model formula, are skipped.
Often, repeated observations for the dependent variable are available.
In order to be able to process these observations automatically, it is
necessary that a variable list consisting entirely of dependent variables
is preceeded by a repetition factor (followed by an asterisk) indicating
the number of repetitions. If a variable list contains independent as well
as dependent variables, the number of replications is assumed to be 1. A
series of (say 100) observations for a dependent variable with no
replications is denoted as:
100 * ([dep var])
The repetition factor in front of the opening square bracket is omitted
(because it is 1), although the parentheses are not. Without the
parentheses it would mean 100 replications of [dep var].
Page 19
EXAMPLE "Input" k, n, <k+n> * (c, m, m * [y], [x1,x2,x3,x4], c), -99;
means that: the first number is read and its value assigned to k,
the next number is read and its value assigned to n,
then k+n times the following happens:
a number is read and its value assigned to c,
the next number is read and its value assigned to m,
then the m replications for y are read,
next the observations for x1, x2, x3 and x4 are read,
then a number is read and its value compared with c,
finally a number is read and its value compared with -99.
If the comparisons fail, an error message is supplied and execution of the
job is terminated, otherwise (k+n) observations for x1, x2, x3, x4 and for
each quadruple m replications for y, have been identified.
2.3 THE OPTION SPECIFICATION
It is possible to have the program perform some tasks optionally by
providing an option specification in a job. It consists of the keyword
"Options" followed by a list of option identifiers or corresponding option
numbers (the option statement), separated by commas and terminated with a
';' (semicolon). The following ten options are available:
option number option name
1 Transformed data matrix
2 Correlation matrix
3 Residual analysis
4 No regression analysis
5 Process submodels
6 Print input data
7 No input data rewind
8 Save original model
9 Test reduced model
10 Missing values
Page 20
Options 1, 2, 3 and 6 cause the corresponding piece of information to
be printed. However, option 1 lists only those (possibly transformed)
variables that are present in the model formula in a neat tabular form,
while option 6 lists all the original input data serially (eleven numbers
per line) without any special layout, because the input data consists (by
definition) of an unstructured series of numbers (cf. section 2.4).
Option 4 suppresses the regression analysis; it is meant to be used
in combination with option 1 and/or 2.
Option 5 causes the program to process submodels, which are formed by
a form of backward elimination: each time the last term from the right
hand part from the model formula is omitted, by deleting the last column
from the design matrix, and a regression analysis is performed with the
reduced design matrix. Messages are generated about which terms are
omitted, while further processing of the job ceases when the resulting
model formula is of the form: y = c. Moreover a test is made (under the
usual assumptions) whether the omitted terms did contribute significantly
to the regression sum of squares (cf. section 1.3.2.4).
To option 5 a specifier list may be appended, to prevent the pro-
duction of waste output for unwanted submodels. In this list the number of
terms to be omitted from the model formula (counting backwards, starting at
the end) must be given enclosed in parentheses. For example the option:
process submodels (6, 10) instructs the program to process only two
submodels, one with the last six terms omitted and one with the last ten
terms omitted (from the original model formula). If the user asks for more
terms to be omitted than are present in the model formula, an error message
is supplied and the execution of that job is terminated. Moreover, if no
explicit specifier list is appended to option 5, the options 2 and 3 yield
no effect (even if specified), which is also to prevent the production of
waste output for the submodels.
Option 7 gives the user the opportunity to process consecutive pieces
of input data in consecutive jobs. Normally the processing of the input
data for each job starts with the first number in the data specification
(or with the first number in the datastream), and the program gives a
(warning) message if the input formula does not match the input data
Page 21
precisely. This option disengages the message and causes the program to
continue processing input data where the previous job had finished.
Option 8 causes the residual degrees of freedom and residual sum of
squares from the current job to be saved, in order to be able in the next
job, by means of specifying option 9, to test whether the model under
consideration in that next job, shows a significant increase in residual
sum of squares in comparison with the model in the previous job. In effect
this gives the possibility of testing a hypothesis concerning a linear
combination of the parameters from a model (cf. section 1.3.2.5), for
instance (cf. SEARLE [11] pp. 121-122):
"Model 1" y = b1 * x1 + b2 * x2 + b3 * x3;
"Options" Save original model;
"Run"
"Model 2" y - 4 * x1 = b2 * (x1 + x2) + b3 * x3;
"Options" Test reduced model;
"Run"
causes the null hypothesis: b1 = b2 + 4 to be tested (in the second job).
Option 10 may be used to identify some observations or repetitions
as 'missing'. In a specifier list, appended to this option, the missing
values must be given enclosed in parentheses. When a repetition equal to a
missing value is encountered in the input data, the corresponding set of
repetitions for the dependent variable(s) is not included in the design
matrix. When an observation equal to a missing value is encountered in the
input data, or when none of the repetitions are included in the design
matrix, the corresponding set of observations for the independent variables
together with the (possibly empty) set of repetitions for the dependent
variable(s) (i.e. the 'case') is not included in the design matrix.
2.4 THE DATA SPECIFICATION
The data specification consists of an unstructured series of numbers
(the data statement), preceeded by the keyword "Data" and terminated at the
next keyword. The structure is imposed onto it by the input formula.
Page 22
A sequence of symbols is considered a number when it satisfies the
definition of number in appendix 2, together with the machine dependent
restrictions imposed by the underlying SIMULA programming system. Note
that the definition of number does not allow FORTRAN-like numbers as for
instance 10. or 1# (at least one digit must follow). It is recommended
always to use blanks as delimiters between numbers, but the use of other
non-numerical symbols as delimiters will not lead to fatal errors, only to
a slight increase in processing time.
If a datafile is specified in response to the datastream request, or
is appended to the "Run" keyword (cf. appendix 1), the program will try to
read a record of input data, which then do not have to be preceeded by the
keyword "Data", from that file. The record ends at the next "Eor" keyword.
However, a non empty data specification in the user program will get
priority over reading input data from the specified file, while an empty
data specification causes the program to start reading the next record of
input data from the specified file. If the data specification as well as
the next nonempty record in the data file does not contain any numerical
information, an error message is supplied.
EXAMPLES real number value
1.234 1.234
.98 0.98
-0.5673#2 -56.73
+.02#-1 0.002
#+3 1000.0
2.5 THE USER PROGRAM
In a user program several jobs can be submitted to the program. Each
job is separated from its preceding one by the keyword "Run", while the
entire user program is terminated with the keyword "Exit". In the first
job, the model, input and data specification must be given in some order.
The option specification is, of course, optional. In each following job a
specification which is not changed may be omitted, the program then retains
the last given specification. If options have been specified in a previous
Page 23
job and one wants to delete them, this is done by providing a new option
specification which may be empty if no options are to be executed (that is
by only providing: "Options";).
In front of each job or in front of the keyword "Exit" a text may be
given for further identification of the output of a job or of the output of
the entire user program. The use of quotes in that text should be avoided
in view of confusion with the keywords. The program starts reading a
(possibly empty) text at the beginning of the next line after the keyword
"Run" of the previous job (with the first job the program starts with the
first line in the inputstream).
2.6 EXAMPLES
The following user program can be submitted without any modification
to the Multiple Linear Regression Analysis program. It consists of four
jobs, three of which are preceeded by an identifying header, while the
whole user program ends with an identifying trailer:
**********************************
* Example 1 originates from: *
* DE JONGE [4], pp. 472 & 479. *
**********************************
"Model" y = c * Log (x) + a + b * x;
"Input" 5 * ([x], 10 * [y]);
"Options" Transformed data matrix, Correlation matrix,
Residual analysis, Process submodels (1, 2);
"Data" 25 0.67 0.70 0.75 0.76 0.78 0.80 0.83 0.84 0.88 0.89
50 0.88 0.92 0.93 0.96 0.98 1.00 1.01 1.03 1.06 1.07
80 0.96 0.98 0.99 1.03 1.05 1.06 1.08 1.11 1.15 1.17
130 1.07 1.09 1.11 1.13 1.14 1.14 1.19 1.22 1.25 1.29
180 1.10 1.13 1.17 1.19 1.20 1.21 1.23 1.25 1.28 1.33
"Run"
Page 24
**********************************
* Example 2 originates from: *
* SEARLE [11], pp. 121-123 *
**********************************
"Input" 5 * [y, x1, x2, x3];
"Data" 8 2 1 4
10 -1 2 1
9 1 -3 4
6 2 1 2
12 1 4 6
"Model 1" y = a3 * x3 + a2 * x2 + a1 * x1;
"Options" Save original model, Process submodels (1);
"Run"
"Model 2" y - 4 * x1 = b2 * (x1 + x2) + b3 * x3; (eqn. 118, p. 121)
"Options" Test reduced model, Transformed data matrix;
"Run"
****************************************
* Example 3 originates from: *
* AFIFI & AZEN [1], pp. 88 & 93-100. *
****************************************
"Model" y = alfa0 + alfa1 * x;
"Input" 5 * ([x], n, n * [y]);
"Option" Transformed data matrix, Print input data;
"Data" 1 4 1.1 0.7 1.8 0.4
3 5 3.0 1.4 4.9 4.4 4.5
5 3 7.3 8.2 6.2
10 4 12.0 13.1 12.6 13.2
15 4 18.7 19.7 17.4 17.1
"Run"
*** Marten van Gelderen; Mathematisch Centrum ***
"Exit"
Page 25
CHAPTER 3
THE OUTPUT FROM THE PROGRAM
3.1 STANDARD AND OPTIONAL PRINTED OUTPUT
After having read the keyword "Run", the processing of the job is
initiated. First the model, input and option texts are printed in this
order. Next an attempt is made at translating the specifications. Errors
against syntax or semantics cause error messages to be printed below each
specification, while further processing of that job ceases. Note that the
processing of the next job, if present, will be of little or no use unless
the specification which developed the error(s) is changed. Next the
(transformed) data matrix is formed and passed to the regression routines,
which supply the following printed output in the order indicated:
1) a listing of the original input data (option 6),
2) the (transformed) data matrix (option 1),
3) per (transformed) variable the:
mean, standard deviation, minimum and maximum,
4) the correlation matrix of the (transformed) variables (option 2),
5) the multiple correlation coefficient (with adjustment),
6) the proportion of variation explained (with adjustment),
7) the standard deviation of the error term,
8) the estimates for the regression parameters with
estimated standard deviation, F-ratio and right tail probability,
9) the correlation matrix of the estimates (option 2),
10) the analysis of variance table,
11) the residual analysis (option 3).
Ad 1) cf. section 2.4.
Ad 2) The transformed data matrix gives the input data after possible
transformations according to the model specifications have been
applied. If the model formula contains no transformations, the
original input data are given. The dependent variable is given as a
separate column. In the case of replications for the dependent
variable, the mean value of them is given, and the number of
Page 26
replications is given as an extra (last) column. If a weight-
variable (or -expression) is specified in the model formula, the
(transformed) data comprising the weights are given as an extra
(last) column. Each (transformed) independent variable is indicated
by its corresponding parameter. This originated from the fact that
it is not obvious how to denote a variable which is transformed
like: Arcsin (Sqrt (y+25)), with 'Arcsin', with 'Sqrt' or perhaps
with 'y' itself. The dependent variable is indicated by 'dep.var.'.
Ad 4) and 9) The matrix of the estimated correlation coefficients of the
variables and of the estimates are both supplied depending on
whether option 2 is specified or not.
Ad 5), 6) and 7) cf. section 1.2.
Ad 8) The F-ratio and right tail probability give the user the opportunity
to test the significance of a particular regression coefficient
(cf. section 1.3.1).
Ad 10) The layout of the table closely resembles that of the table in
section 1.3.2. The F-ratios and right tail probabilities give the
user the opportunity to test the significance of all the regression
coefficients or of a subset or combination thereof or to test the
adequacy of the (linear) model (cf. section 1.3.2).
Ad 11) A table of observations, fitted values, standard deviations of the
fitted values, residuals, standardized residuals and studentized
residuals is provided (cf. section 1.2.2). As a check on computa-
tions, the sum of the residuals is also given. If an unknown
constant term is present in the model formula, this sum should be
zero. Furthermore the upperbound for the right tail probability of
the largest absolute studentized residual is given.
Without options specified, the printed output from the program
consists of 3), 5), 6), 7), 8) and 10). If option 5 is specified, the
output for the model itself is given as specified by the other options, but
for the submodels it depends on the use of a submodel specifier list.
Without that list the output from the options 1, 2 and 3 is suppressed
(even if those options are specified). With that list only the superfluous
parts of the output (that is the transformed data matrix and the
correlation matrix of the variables) are suppressed.
Page 27
3.2 STANDARD AND OPTIONAL DATA OUTPUT
If an outputfile is specified in response to the outputstream request,
or is appended to the "Run" keyword (cf. appendix 1), the program writes
the following pieces of information in one record to that file:
1) if option 1 is specified: the transformed data matrix, preceded by the
number of rows and columns respectively,
2) if option 2 is specified: the corrrelation matrix of the variables,
preceded by its order,
3) the number of submodels specified in the list appended to option 5, if
that option is specified at all; otherwise the number 1. It is
followed by (for each (sub) model): the number of estimated parameters,
the estimates for the parameters of the (sub) model, and:
a) if option 2 is specified: the variance-covariance matrix of the
estimates, preceded by its order, (BE CAREFUL: this is NOT the
correlation matrix of the estimates, which is printed; however, the
correspondence between the two matrices is established by the
relations (10) and (11) in "Help"/Theory),
b) if option 3 is specified: the number of respondents, followed by for
each respondent the: observation, fitted value, standard deviation,
residual, standardized residual and studentized residual,
and finishes by writing an "Eor" keyword.
As in the case of printed output, the output described in 3) is only
effected for submodels, if an explicit submodel specifier list is appended
to option 5.
An input specification to describe one record of data written to the
outputstream when options 1, 2, 3 and 5 (with a submodel specifier list
appended to it) are specified, could read:
"Input" n, m, n * (m * [transformed data element]),
p, <p * (p+1) : 2> * [correlation element],
s, s * (t, t * [estimate],
q, <q * (q+1) : 2> * [covariance element],
r, r * (6 * [residual element]) );
Page 28
For the original model the following relations hold: q = t, t = m-1,
r = n and p = m (or p = m-1 if replications and/or weights are specified);
s is the number of processed (sub)models; for each submodel t and q are
decreased with the number of terms that are omitted from the original
model.
Real numbers in the printed output are given in fixed point format
with a six decimal fractional part, the only exceptions are the estimates
for the regression parameters with their standard deviations, which have a
ten decimal fractional part and the numbers in the listings of the input
data and the transformed data matrix, which have a three decimal fractional
part. Real numbers in the data output are given in floating point format
with a sixteen decimal mantissa and a two decimal exponent part.
3.3 ERROR MESSAGES
Error messages against syntax or semantics have the following layout:
Error : <error text> or <error number>
The error text corresponding to the error numbers is:
1 No input data given.
2 All input data has been skipped.
3 Attempt to process more input data than provided.
4 Number in the input data is incorrect or too large.
5 In a number '.' is not followed by a digit.
6 In a number '#' is not followed by '+', '-' or a digit.
10 No model formula given.
11 Left hand part is not followed by '='.
12 Expression is not followed by ')'.
13 Option name used in a primary in an expression.
14 Incorrect primary in a factor in an expression.
15 Incorrect (control) identifier in an expression.
16 Parameter list of a standard function is not followed by ')'.
17 Standard function call with incorrect number of parameters.
Page 29
20 No input formula given.
21 Expression in a control is not followed by '>'.
22 Option name used in a control in an input statement.
23 Input statement in a description is not followed by ')'.
24 Variable list in a description is not followed by ']'.
25 Incorrect description in an input statement.
26 Incorrect identifier in a variable list.
27 Item in a variable list is not an identifier.
30 Incorrect option number in an option statement.
31 Incorrect option name in an option statement.
32 Specifier list is not followed by ')'.
33 Number in a specifier list is incorrect or too large.
34 Specifier list is appended to incorrect option.
35 Specifier is not a number.
36 Specification is not properly continued.
37 Specification is not terminated with ';'.
40 No defined (independent) identifier to the right of '='.
41 Incorrect use of a parameter in a regression term.
42 Undefined (weight) identifier to the left of '='.
43 Undefined (dependent) identifier to the left of '='.
44 Number in a regression term is incorrect or too large.
45 Term does not have the form: param * factor or factor * param.
46 Undefined (independent) identifier in a regression term.
47 No regression parameter in a regression term.
50 Division by zero.
51 Integer division by zero.
52 Observation for dependent variable is in absolute value too large.
53 Observation for independent variable is in absolute value too large.
54 Exponentiation with zero base and non positive exponent.
55 Exponentiation with negative base and real exponent.
56 Weight factor is not positive.
60 Argument of 'Sqrt' is negative.
61 Argument of 'Ln' is not positive.
62 Argument of 'Log' is not positive.
Page 30
63 Argument of 'Exp' is too large.
64 Argument of 'Arcsin' is in absolute value larger than one.
65 Argument of 'Arccos' is in absolute value larger than one.
66 Argument of 'Sinh' is in absolute value too large.
67 Argument of 'Cosh' is in absolute value too large.
70 Number of observations for the first dependent variable is zero.
71 Numbers of observations for the dependent variables are not equal.
72 Number of observations for the first independent variable is zero.
73 Numbers of observations for the independent variables are not equal.
74 Control reads an incorrect number in the input data.
75 Numbers of replications for the dependent variables are not equal.
76 Given, read or computed replication factor is not integral.
If the error number lies between:
5 and 37, it is followed by the most recently processed identifier, number
and symbol. Only the first eight characters of each name are
displayed.
41 and 47, it is followed by the number of the right hand part regression
term which causes the error, or a zero if the left hand part is
at fault.
50 and 67, it is followed by the wrong value and the number of the line in
the transformed data matrix which causes the error. Instead of
the wrong value, the number of the right hand part regression
term which causes the error is displayed when the error number
lies between 50 and 53.
70 and 76, it is followed by the check value and the wrong value. Instead
of the wrong value, the value of the controlling variable of the
next enclosing repetition loop is displayed when the error
number is 76.
3.4 EXAMPLES
An impression of the printed output of the Multiple Linear Regression
Analysis program may be obtained by actually submitting the user program in
section 2.6 (which resides on: Hlp: Mulexa.hlp) to the MULREG program.
Page 31
APPENDIX 1
TECHNICAL REMARKS
The following technical remarks reflect the SIMULA implementation of
the Multiple Linear Regression Analysis program as of version 5H(246).
Any comments or queries concerning the functioning of the software
described in this document should be addressed to:
Marten van Gelderen, IKO Computer Systems Group, Postbox 4395,
1009 AJ Amsterdam, The Netherlands. (telephone: 31-(0)20-930951).
The program resides on a device called USR:. It is started by typing:
.R MULREG, and responds to the standard output device (usually TTY:) with
an identifying header and requests to specify four files, as follows:
Multiple Linear Regression Analysis
Enter file specifications
Inputstream :
Printstream :
Datastream :
Outputstream :
The inputstream serves to read the user program from; the printstream
receives the printed output from the program; the datastream serves to
read the separate input data records from; the outputstream receives the
data output records from the program.
If the default carriage return is responded to the data- and output-
stream requests, the program assumes that no separate input data are
present and that no data output is required.
If the default carriage return is responded to the input- and print-
stream requests, the program connects the standard input and output devices
(usually TTY:) to the input- and printstream respectively. To notify that
the inputstream is connected to the standard input device, the program dis-
plays the prompting character '*' (asterisk).
Page 32
If both the input- and printstream are connected to the standard input
and output devices, the program echoes every text it cannot interpret
properly, preceeded by the error character '?' (question mark). However,
in response to a single carriage return the text 'For help type: "Help"' is
displayed.
To the "Run" keyword a file specification list may be appended, pre-
ceeded by the character '/' (slash), with the following general format:
/print-spec;output-spec=input-spec;data-spec
Each of the specifications will be connected to the corresponding program
stream respectively. Also, each of the specifications may be omitted,
which means that nothing will be changed to the corresponding stream at
all. However, if the character ' ' (blank) is substituted for one (or
more) of the specifications, the defaults - as described previously - will
be connected to the corresponding streams respectively.
If the specified files do not exist (for input- or datastreams) or
cannot be created (for print- or outputstreams), the corresponding streams
are displayed again, followed by the character '?' (question mark), to
indicate the erroneous situation and to enable the specification of other
files (or defaults).
If the program encounters a premature end-of-file condition in the
inputstream, it will connect the inputstream to the standard input device
and thus respond with the prompting character. New specifications for the
"Model", "Input", "Options" or "Data" (or keywords like "Run" or "Exit")
may then be entered.
In the inputstream, the program does not discriminate between upper
and lower case letters. Identifiers may contain any number of blanks,
however, a carriage return is not permitted, restricting the maximum length
of identifiers to the maximum number of characters in one input line. In
front of and following the opening quote of keywords only non-printing
ASCII characters like tabs and/or blanks are permitted, otherwise the
keyword (and the whole line following it) is not recognized.
Page 33
Two implementation dependent restrictions are imposed on user pro-
grams: the maximum number of differently spelled identifiers and numbers
is 789 and the maximum number of nested parentheses is 62. In addition two
machine dependent restrictions are imposed: the maximum number of
characters in one input line is 132 and the number of significant digits in
computations is 18.
In response to the keyword "Help", the information in this appendix
(which resides on HLP:) is copied to the printstream. Each of the
following switches may be appended to the "Help" keyword, preceeded by the
character '/' (slash), in order to obtain more detailed information (which
also resides on HLP:).
/Theory Regression model & least squares (section 1.1 & 1.2),
/Tests Possible tests of hypotheses (section 1.3),
/Model Specification of the model formula (section 2.1),
/Input Specification of the input formula (section 2.2),
/Options Possible options and their effects (section 2.3),
/Data Acceptable numbers and their delimiters (section 2.4),
/User Setup of a complete user program (section 2.5),
/Example Example of a complete user program (section 2.6),
/Print Standard and optional printed output (section 3.1),
/Output Standard and optional data output (section 3.2),
/Errors Meaning of the error numbers (section 3.3),
/Syntax Definition of the syntax of a user program (appendix 2).
If no help information is available an appropriate message is displayed.
Page 34
APPENDIX 2
DEFINITION OF THE SYNTAX OF A USER PROGRAM
The syntax of a user program is defined in an extended version of a
notation known as the Backus Naur Form, in short: BNF (cf. NAUR [10]).
The extensions comprise an explicit repetition and optionality construct
together with the possibility of factorization.
The BNF may be regarded as a metalanguage for the description of a
user program. In addition to the symbols that are admissible in a user
program, the metalanguage requires a number of extra symbols, called
metasymbols. The ten metasymbols used in extended BNF are: ::=, |, <, >,
{, }, [, ], ( and ). The , and . are part of the metalanguage English in
which we are describing BNF. We write:
<expression>::= ['+' | '-'] <term> { ('+' | '-') <term> }
The metasymbols < and > are used as delimiters to enclose the name of
a class. The metasymbol ::= may be read as 'is defined as' or as 'consists
of'. The metasymbol | is read as 'or'. Repetition is denoted by curly
brackets, i.e. { a } stands for e | a | aa | ... Optionality is expressed
by square brackets, i.e. [ a ] stands for e | a. Parentheses merely serve
for grouping (factorization) i.e. (a | b) c stands for ab | ac. Terminal
symbols appear enclosed in single apostrophes.
The above phrase defines an expression as a term, optionally preceeded
by a '+' or a '-' and followed by an arbitrary repetition of terms, each
preceeded by a '+' or a '-'.
The syntax of a user program can thus be defined as follows:
Page 35
<letter>::= 'A'|'B'|'C'|'D'|'E'|'F'|'G'|'H'|'I'|'J'|'K'|'L'|'M'|
'N'|'O'|'P'|'Q'|'R'|'S'|'T'|'U'|'V'|'W'|'X'|'Y'|'Z'|
'a'|'b'|'c'|'d'|'e'|'f'|'g'|'h'|'i'|'j'|'k'|'l'|'m'|
'n'|'o'|'p'|'q'|'r'|'s'|'t'|'u'|'v'|'w'|'x'|'y'|'z'
<digit>::= '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9'
<model keyword>::= '"Model"' | '"MO"'
<input keyword>::= '"Input"' | '"IN"'
<option keyword>::= '"Options"' | '"OP"'
<data keyword>::= '"Data"' | '"DA"'
<run keyword>::= '"Run"' | '"RU"'
<exit keyword>::= '"Exit"' | '"EX"'
<function name>::= 'Abs' | 'Sign' | 'Sqrt' | 'Sin' | 'Cos' | 'Tan' |
'Ln' | 'Log' | 'Exp' | 'Entier' | 'Round' | 'Mod' |
'Min' | 'Max' | 'Arcsin' | 'Arccos' | 'Arctan' |
'Sinh' | 'Cosh' | 'Tanh' | 'Indicator'
<option name>::= 'Transformed data matrix' | 'Correlation matrix' |
'Residual analysis' | 'No regression analysis' |
'Process submodels' | 'Print input data' |
'No input data rewind' | 'Save original model' |
'Test reduced model' | 'Missing values'
<option number>::= '1' | '2' | '3' | '4' | '5' |
'6' | '7' | '8' | '9' | '10'
<number>::= ['+' | '-'] <unsigned number>
<unsigned number>::= <decimal number> | <exponent part> |
<decimal number> <exponent part>
<decimal number>::= <unsigned integer> | <fractional part> |
<unsigned integer> <fractional part>
<exponent part>::= '#' <integer>
<fractional part>::= '.' <unsigned integer>
<integer>::= ['+' | '-'] <unsigned integer>
<unsigned integer>::= <digit> { <digit> }
Page 36
<identifier>::= <letter> { <letter> | <digit> }
<data specification>::= <data keyword> [ <input data> ]
<input data>::= <number> { <number> }
<option specification>::= <option keyword> [ <option statement> ] ';'
<option statement>::= <option> { ',' <option> }
<option>::= <simple option> [ '(' <specifier list> ')' ]
<simple option>::= <option name> | <option number>
<specifier list>::= <specifier> { ',' <specifier> }
<specifier>::= <number>
<input specification>::= <input keyword> <input statement> ';'
<input statement>::= <input part> { ',' <input part> }
<input part>::= <control> | <description> | <control> '*' <description>
<control>::= <number> | <identifier> | '<' <expression> '>'
<description>::= '(' <input statement> ')' | '[' <variable list> ']'
<variable list>::= <variable> { ',' <variable> }
<variable>::= <identifier>
<model specification>::= <model keyword> <model statement> ';'
<model statement>::= <left hand part> '=' <right hand part>
<left hand part>::= <expression> [ '&' <weight part> ]
<weight part>::= <expression>
<right hand part>::= ['+'] <term> { '+' <term> }
<expression>::= ['+' | '-'] <term> { ('+' | '-') <term> }
<term>::= <factor> { ('*' | '/' | ':') <factor> }
<factor>::= <primary> { '^' <primary> }
<primary>::= <unsigned number> | <identifier> |
<function designator> | '(' <expression> ')'
<function designator>::= <function name> [ '(' <parameter list> ')' ]
<parameter list>::= <parameter> { ',' <parameter> }
<parameter>::= <expression>
<user program>::= { <job> } <exit keyword>
<job>::= { <specification> } <run keyword>
<specification>::= <model specification> | <input specification> |
<option specification> | <data specification>
Page 37
APPENDIX 3
TECHNICAL DESCRIPTION OF THE PROGRAM
In this appendix a more or less technical description of the program
is given in terms of variables, procedures and control flow.
Basically the program logic is as follows:
Init program
Enter & open files
Job: Read job
Run: Init compiler tables
Compile model & input
Init data buffers
Execute (to produce design matrix)
Regression analysis
Print results
GOTO Job
Exit: Close files
In case of an errorsituation in one of the sections, no further action is
undertaken and control is transferred to Job.
1. The program is initialized by setting various declared system and
compiler constants to their appropriate values. Some of them are highly
machine dependent constants, others are internal codings or table limits.
The four basic informationstreams in the program: the input-, print-,
data- and outputstream are connected to the external file specifications,
entered by the user, in the procedure enterfiles.
2. A job is read specification by specification (after printing and
skipping leading text by a call on echotext) by means of calls on readtext
or readdata, depending on which specification from the job is read. Both
procedures first set up a buffer administration as will be described in
section 3, then process an input line or input numbers and finally call on
readline to obtain further information. Their task is finished when the
Page 38
variable endtext becomes true, that is when the next input line starts with
a '"' (quote), which is the beginning of what can be a keyword, or when the
end-of-file condition is met. If an input data record is to be read from a
file, the same administration is setup and the same routine for reading the
actual numbers is used, only information is obtained from the datastream
rather than the inputstream. The echoing of input text and input data is
done via calls on the procedures printtext and printdata respectively.
3. The text and data storage section provides two kinds of SIMULA classes,
one for text storage and another for data storage. The class textstorage
provides a procedure to store lines of text into a linked list of
textbuffers (starting at base). The procedure nextline retrieves those
lines (after a reset). The class datastorage provides a procedure to store
real numbers into a databuffer which is linked into a linked list of such
buffers (starting at base). In case of buffer overflow a new buffer is
created and linked into the list. A super class inputstorage provides the
reset procedure and a procedure nextnumber to retrieve numbers out of the
buffers in a FIFO (first in, first out) manner. Another super class
rightstorage provides a procedure lastnumber, which is quite similar to
nextnumber, except that it retrieves numbers out of the buffers in a LIFO
(last in, first out) manner.
4. The compiler uses two tables: one called program and another called
hashtable. The first table accepts the (macro) instructions generated by
the compiler via calls of load. The second table is used by the procedure
nextatom to identify the various items in the model and input formulae.
The recognition method used is twin-prime-hashing described in KNUTH [5]
p. 522. Alphanumerical items can belong to one of the following classes:
identifier, number, functionname or optionname, which are all superclasses
of atomcell. The procedure nextatom calls on nextchar and delivers a
pointer to an atomcell of the appropriate type containing, among others,
the actual text of the item and its index in the hashtable.
The compiler assumes that the running system has at its disposal a
(programmed pseudo) register F which is capable of handling both integer
and real numbers. Furthermore a (programmed pseudo) memory organization
known as a stack must be available to the running system. A stackpointer
refers to the first free position in the stack. All binary operations will
Page 39
take place with the top of the stack as first operand and F as the second.
The result is delivered in F and as a side effect the stackpointer is
decreased by 1. When the contents of F is saved in the stack, the
stackpointer is increased by 1.
The fundamental idea behind most procedures for translating the model
and input formulae is, that the first atom of the syntactical unit to be
processed by that procedure has been read already (its 'value' being
assigned to lastatom). The procedure considers itself to have finished its
task after reading the first atom that no longer can belong to that unit
syntactically. Meanwhile the translation of that unit has been produced.
A more elaborate description of the procedure system (arithmetic)
expression can be found in KRUSEMAN ARETZ [6] and [7]. Reference [8]
provides the description of a complete ALGOL 60 compiler. We only mention
here that every expression is transformed into a macro program that
corresponds to the reversed polish form, thus:
(a+b) * (c-d) ^ e becomes: ab+ cd- e ^ *.
The procedure for translating the input formula must, among others,
generate instructions to perform the linkage between identifiers from the
model formula and numbers from the input data.
While translating a model formula, identifiers to the right of the
equal sign are assigned type 1, those to the left of the equal sign type 2.
If these identifiers appear in a variable list the types are changed into 3
and 5 respectively. While translating an input specification, identifiers
in a variable list not appearing in the model formula are assigned type 4,
those in the input formula not appearing in a variable list are assigned
type 6. Meanwhile instructions are generated to put the next number from
the input data in the appropriate column of the (yet untransformed) design
matrix, or to skip that number. For variable lists that consists entirely
of identifiers not appearing in the model formula, special instructions to
skip the corresponding numbers all in one, are generated.
5. In the procedure check model a check is made if the model formula,
after the linkage to the input data (by means of the input formula), still
satisfies some elementary statistical conditions, like:
Page 40
a) each term must be the product of a parameter and a factor,
b) in that factor no identifier may appear that is not present in a
variable list in the input formula. (An attempt to perform regression
analysis with variables for which no input data is present may not
succeed.)
In the procedure check input a check is made if for each variable in
the model formula an equal amount of numbers is present, moreover a check
is made if all numbers in the input data have actually been processed
(option 7 disengages this check).
6. The execution section of the program is activated by a call of the
procedure execute which, among other things, simulates the basic cycle of a
computer:
next: get the instruction indicated by the programcounter
increase the programcounter with 1
isolate the instruction and address part
execute the instruction
GOTO next
This cycle ends when the programcounter tries to leave the (translated
macro) program. The (macro) instructions itself are coded via the switch
lists macro and macro2.
7. After the execution of the input- and modelinstructions, the
(transformed) design matrix is delivered to the regression routine(s). The
actual computation of the regression coefficients is done via a call of
lsqdec followed by a call of lsqsol and lsqinv. The first two of these
procedures are described extensively in DEKKER [2] pp. 65-69. The vector
and matrix multiplication is done via calls of vecvec, matvec, tamvec and
tammat, described in DEKKER [2] pp. 8-9. The algorithms for phi and Fisher
are copied from CACM: algorithms 209 and 322 respectively.
All other computations are straightforward.
Page 41
REFERENCES
[1] AFIFI, A.A. & S.P. AZEN, Statistical Analysis; A Computer Oriented
Approach; Academic Press, (1972).
[2] DEKKER, T.J., ALGOL 60 procedures in numerical algebra, part 1; MC
Tract 22, Mathematisch Centrum, Amsterdam.
[3] DRAPER, N.R. & H. SMITH, Applied Regression Analysis; John Wiley &
Sons, (1966).
[4] JONGE, H. DE, Inleiding tot de Medische Statistiek, deel II;
Nederlands Instituut voor Praeventieve Geneeskunde, (1960).
[5] KNUTH, D.E., The Art of Computer Programming, Vol. 3, Sorting and
Searching; Addison Wesley, (1973).
[6] KRUSEMAN ARETZ, F.E.J., ALGOL 60 translation for everybody;
Elektronische Datenverarbeitung, Vol. 6 (1964), 6, p. 233-244.
[7] KRUSEMAN ARETZ, F.E.J., Programmeren voor rekenautomaten, (De MC
ALGOL 60 vertaler voor de EL X8); MC Syllabus 13, Mathematisch
Centrum, (1972).
[8] KRUSEMAN ARETZ, F.E.J., P.J.W. TEN HAGEN & H.L. OUDSHOORN, An ALGOL
60 compiler in ALGOL 60; MC Tract 48, Mathematisch Centrum,
(1973).
[9] LUND, R.E., Tables for an approximate test for outliers in linear
models; Technometrics, Vol. 17 (1975), 4, p. 473-476.
[10] NAUR, P. (ed.), Revised report on the algorithmic language ALGOL 60;
Regnecentralen, Copenhagen, (1964).
[11] SEARLE, S.R., Linear Models; John Wiley & Sons, (1971).
[12] THEIL, H., Principles of Econometrics; John Wiley & Sons, (1971).
Page 42