PDP-10 Archive: kcc-4/kcc/intern.doc from SRI_NIC_PERM_FS_1

Trailing-Edge - PDP-10 Archives - SRI_NIC_PERM_FS_1_19910112 - kcc-4/kcc/intern.doc

There are 7 other files named intern.doc in the archive. Click here to see a list.

This file attempts to document the internal workings of KCC.  It is
still incomplete.

Current holes in documentation:
	Need to explain what pseudo-code looks like, what types there are,
and which of them use which parts of a pseudo-code entry.  CCCODE etc.
	Need to explain KCC's notion of "register" and how these registers
are allocated and handled.  CCGEN, CCREG etc.

This diagram (based on a msg from David Eppstein) lists each
KCC module with a summary of what its routines do.
  
    CCDATA -- allocation for global variables, tables.  Used in various places
    CC -- main control -- parses command line and calls parser for each file
	CCINP -- file input and preprocessor.  Provides input chars.
	CCLEX -- lexical analysis.  Provides input tokens.
	CCDECL, CCSTMT -- parsing (recursive descent) and semantic analysis
data		CCDECL handles declarations; CCSTMT handles statments & exprs.
flows	    CCSYM -- symbol table manipulation
this	    CCTYPE -- type checking and automatic coercion
way	    CCERR -- error messages
 |	CCEVAL, CCFOLD -- parse tree optimization and constant folding
 |	CCGEN, CCGEN1, CCGEN2 -- code generation.  Decls, Stmts, Exprs.
 V	    CCGSWI -- code generation for "switch" statements
	    CCLAB - label management
	    CCREG -- register allocation
	    CCCODE -- subroutines to send pseudo-codes to optimizer
	CCOPT, CCJSKP, CCCSE -- peephole optimization (some elsewhere too)
	    CCCREG -- retroactively change registers
	CCOUT -- emission to assembly file
	CCASMB -- invoke assembler/linker on output file
	CCDUMP -- various debugging output routines

--------------------------
Message from David Eppstein, 4-Dec-85:

As indicated, you can think of the data as moving through successive
stages of interpretation -- input files get read into chars and munged
by the preprocesser, turned into tokens by the lexer, turned into
parse trees by the parser, munged by the parse tree optimizer, turned
into pseudo-codes by the generator, munged by the peepholer and then
turned into chars again as assembly language by the emitter.

(this parag modified --KLH) The control structure is a little different
-- the file argument handler is the main control, and it calls the
parser which calls the lexer which calls the preprocessor to pull data
in from the file.  The parser returns a parse tree, which the main loop
then pushes out in the other direction by calling the code generator
which calls the peepholer and emitter.
	The parse tree optimizer is called (for ccopt)
by the parser and (for cceval, whose sole purpose is to see if a for
loop test can be moved to the end of the loop) by code generation.
The symbol table is used and maintained by many different phases.
Also, the preprocessor can call the parser recursively to handle #if.

General overview: data flow

[NOTE: for clarity the following discussion talks only about "the" input
file.  KCC can compile multiple files, but each one is compiled
separately and handled in the same way.]

KCC is a one-pass compiler; it reads its input only once, and does so
sequentially without ever looking back.  It outputs an intermediate
assembly-language file which is then processed by the system's
assembler, in effect adding a second pass.  Each top-level declaration
statement (a function is a single such declaration statement) is handled
separately, and each goes through the following complete cycle:
	(1) The statement is completely parsed into a single "parse tree",
	which is a linked list of "NODE" structures.  (CCDECL, CCSTMT)
	(2) Some optimization is done on the parse tree. (CCOPT, CCEVAL)
	(3) The code generator is called on this parse tree, to fill a
	"peephole" buffer with what are called "pseudo-code" instructions.
	The buffer is a linear array of "PSEUDO" structures.  The code
	generator is subdivided into two general phases:
		(3a) CCGEN - Determine from parse-tree what pseudo-code to use.
			This phase knows about BOTH parse trees and
			pseudo-code, but does not manipulate the buffer.
		(3b) CCCODE - Put pseudo-code in buffer, applying optimization
			and forcing out buffer when there is no more need to
			look back at buffered pseudo-code.  (or when full!)
			This phase only knows about pseudo-code.
	(4) Whenever the buffer is forced out, the output processor (CCOUT) is
	called on each pseudo-code instruction to generate the actual
	character strings needed for the assembler file.  The output
	processor never sees the parse tree, only the pseudo-code
	as it comes from the buffer.

Optimization occurs at various places.  Some optimization can be done
directly on the parse tree itself, but it looks as if most actual optimization
comes during (3b) -- as pseudo-code is being added to the peephole buffer,
various checks are made trying to combine the current op with previous
ops in more efficient ways.  A few optimizations are done in (4) when
CCOUT selects the best way to do some minor things.
This description can use some more details.

Since the entire parse tree for a declaration is kept in memory, it
may be possible that extremely large functions or data initializations
could cause problems for KCC.  However, such things are considered
bad practice and are rarely if ever seen; this is unlikely to ever be
a problem.  Note this only applies to single declaration statements, not
to the entire file.

More about input processing:
	All input is handled by the CCINP module, which provides the
"nextc" function to read the next character.  CCINP handles all of the
pre-processor '#' commands as they are encountered; the rest of KCC
never sees them.  CCINP also provides macro-expansion text just as if
it was part of the regular input stream.
	The only thing it cannot do is start macro expansions; this is
taken care of by the next level of input, CCLEX, which provides the
"nextoken" function.  This is THE main input function as far as the
rest of KCC is concerned; nothing else but CCLEX calls CCINP's
"nextc".  "nextoken" simply reads the next C-language token and
returns its code (codes defined in CC.H).  Nothing outside of CCLEX
should be concerned with input characters; only with tokens.
"nextoken" does set certain globals as a side effect, since values may
be associated with some types of tokens (notably constants).  CCLEX
does not know anything about parse trees.  The comments at the start
of CCLEX pretty much explain what variables nextoken() can set as
side effects.

CCDECL and CCSTMT perform the actual semantic analysis of the tokens
that "nextoken" returns, building the parse tree.  Nothing else calls
"nextoken", except CCERR in order to attempt recovery from some kinds
of errors.

How a program gets compiled (control flow):

main()	does simple switch inits, and scans all arguments once in order to
	process all switches furnished by the user.  Then scans the arguments
	once more, and calls cfile() to process each remaining file argument.
	After all arguments are processed, the linking loader
	(LINK) may or may not be invoked.

    cfile()	Hacks an argument filename.  Processing depends on current
		switch values.  Does necessary per-file initializations,
		then normally tries to compile the file by first
		calling entdef() (a special hack, can be ignored) and
		then (TA DA!) loops calling extdef() to process external
		definitions as long as input exists, feeding the results
		into the code generator via gencode().
		When input is done, postamble() takes care of final output
		to the assembler file.  An additional "prefix" assembler
		file may also be generated at this point.  Finally,
		asmb() is used to invoke the assembler on the file.

So, extdef() and gencode() actually do all the real work; everything
seen in the file is some kind of top-level declaration, either of data
or of a function.
	extdef() initializes a few things (see note below re ENTRY hacking)
	and then calls declarator() to parse as far as the first identifier
	(symbol name).
	If it was anything but a function, datadef() is called to
		generate a parse tree from not only this definition but
		all succeeding definitions until the end of the declaration
		statement is seen.
	If it was a function, extdef will parses the parameter list
		itself and then invoke funcdef().  This routine parses the
		parameter declarations and the function body.

Note on ENTRY and entdef():
	In order to remain compatible with older source files used with
	KCC which used the "entry" statement, a special routine called
	entdef() is invoked at the start of compilation.  This scans
	ENTRY statements, which must precede any other kind
	of declaration.  Thereafter ENTRY is ignored.
	The ENTRY statement is peculiar to KCC.  It was needed
	only because the PDP10 assembler insists that any library
	entry points be specified BEFORE any code is produced!
	This was fixed by having KCC generate an auxiliary "prefix"
	assembler file and feeding that to the assembler just ahead of
	the main file.  ENTRY should no longer be used, and KCC should
	eventually not support it.

In more code-like form:

CC main()			- Main program
CC	<Initialize compiler switch values>
CC	Scan all arguments for switches.  For each switch, call
CC		cswitch() - process a switch.
CC	Scan all arguments again for filenames.  For each file, call
CC		cfile() - process one file.
CC			<initialize all per-file stuff>
CCDECL			Invoke entdef() for "entry" stmt compatibility
CC			Until EOF on input, call
CCDECL				extdef() - process one external def.
CCGEN				gencode() - generate code for it.
CCOUT			postamble()	- output assembler-file postamble.
CCOUT			makprefile()	- Make assembler prefix file if needed.
CCASMB			asmb()		- Invoke assembler.
CC	If invoking linking loader, call
CCASMB		runlnk().
	

Where the real work happens:

CCDECL extdef()	- Read one external definition, and return parse tree if any.
CCDECL	Initialize for a declaration statement, then call
CCDECL		declarator()	- Parse one identifier
CCDECL	If got non-function, call
CCDECL		datadef()	- Parse rest of declaration, return it.
CCDECL			Until semicolon seen, call
CCSTMT				defnode()	- Make parse-tree node
CCDECL				declarator()	- parse another identifier
CCDECL	else got function.
CCDECL		<parse parameter list>, then call
CCDECL		funcdef()	- Parse rest of function, return it.
CCDECL			<Set up various things>
CCDECL			Parse parameter declarations, by calling
CCDECL				declarator()
CCDECL			<Set up parameter (local var) stuff>
CCDECL			Then parse the function body, by calling
CCSTMT				statement()	- Parse a statement
CCDECL			Then finish consing together the resulting parse-tree
CCSTMT				with defnode().

CCGEN	gencode() is fairly easy to follow.


This page describes some of the basic KCC data structures and constant types.

In general, structures are declared in the following way:
	#define NAME _name
	struct _name {...};
and references to the structure tag should use NAME.  Pointers should
be declared as NAME *.

Important structures:
	TOKEN	Token definition (entry in table)
	  T_	  Token identifiers (codes)
	  TKTY_	  Token types
	TYPE	Type Definition
	  TS_	  Type Specifier codes
	SYMBOL	Symbol Definition
	  SC_	  Symbol Classes
	NODE	Parse tree node
	  N_	  Node identifiers (opcodes)
	  NF_	  Node flags
	  CAST_	  Cast (coercion) types
	PCODE	Pseudo-Code entry (in peephole buffer)
	  P_	  Opcode values (almost == to PDP10 instructions)
	  POF_	  Opcode flags
	  POS_	  Opcode skip values
	  POSF_	  Opcode skip value bits
	  PTF_	  Type Flags
	  PTA_	  Type Addressing modes
	  PTV_	  Type Values (flag/addr combos)

Minor structures:
	SMEM	Structure Member
	RESWD	Reserved Word table entry
	TYNAME	Built-in Type table entry

NOTE: Currently, for historical reasons the token types, node op-codes, and
a few other things are all mixed together.  The following prefixes indicate
symbols which used to all be "token constants":
	N_	- Only used as a node op-code
	T_	- Only used as a token value
	Q_	- Used as BOTH a node op-code and token value!
		(Why Q?  Well, it's exactly halfway between N and T...
		 also, think about a Q-ship, which isn't what it seems.)

See CCSYM.H for details on the format of symbol table entries (SYMBOL)
and type specifiers (TYPE).

---------------------- PARSE TREE STRUCTURE ---------------------

The parse tree generated by KCC's parser (and passed to the code generator)
is made of NODE structures.  The format and many details are explained
in CCNODE.H.  Specifics on the way each node-op uses the NODE structure
are contained in CCTOKS.H.

	Only the parser creates parse trees, and the CCDECL and CCSTMT
modules contain all of the parser.  The modules which form the code
generator and which read the parse tree are CCGEN, CCGEN1, CCGEN2 and
CCGSWI.

CODE GENERATOR

Virtual Registers:

	The code generator translates the parse-tree nodes into
pseudo-code operations on "virtual registers".  A virtual register
(vreg) is treated as if it was a machine register, but whether its
contents are actually in a machine register, or on the stack, is
generally unspecified; the register allocation routines in CCREG are
responsible for saving virtual registers on the stack as needed.
	A vreg may consist of one or two words, and this is sufficient
to hold an object of any type except structures/unions larger than 2 words.
Not all of the bits in a vreg are actually significant; this depends on
the type.  For example, types which are smaller than 1 word (such as
chars and bit-fields) can only treat the rightmost N bits as significant,
where N is the number of bits in the type.  Cast conversions can change
the contents of a vreg in two ways:
	(1) Change the number of significant bits (or # of words)
	(2) Change the representation (for those bits significant in the
		new type)

Note that, in general, cast conversion to a smaller type does not result
in any code being produced.  This is because all that changes is the
number of bits considered significant, and the representation of the
significant bits remains the same.  Conversion to a larger type, however,
will enlarge the number of significant bits, and these new bits have
to be set to some value, so some code is likely to be generated to
accomplish this.

For example, in the expression:
	int a, b;
	a = (char)b;
The value of b is first retrieved into a vreg, which amounts to a "MOVE vreg,b"
instruction, and the cast to (char) is then applied.  However, this does NOT
generate any code, because all that happens is that only the rightmost 9
bits of the vreg contents become significant.  It is only when this value
is implicitly converted back to (int) for the store into "a" that the 
cast from (char) to (int) takes effect and the non-significant bits are
zapped.