Before
discussing the problem of recognizing lexemes in the input, let us examine some
ways that the simple but important task of reading the source program can be
speeded. This task is made difficult by the fact that we often have to look one
or more characters beyond the next lexeme before we can be sure we have the
right lexeme.
Buffer
Pairs
Because
of the amount of time taken to process characters and the large number of
characters that must be processed during the compilation of a large source
program, specialized buffering techniques have been developed to reduce the
amount of overhead required to process a single input character.
Look
ahead of characters in the input is necessary to identify tokens. Specified
buffering techniques have been developed to reduce the large amount of time
consumed in moving characters. A buffer (array) divided into two N-character
halves, where N=number of characters on one disk block ‘eof’ marks the end of
source file and it is different from input character
E=M*C**2eof
Two
pointers are maintained: beginning of the lexeme pointer and forward pointer.
Initially,
both pointers point to the first character of the next lexeme to be found.
Forward pointer scans ahead until a match for a pattern is found. Once the next
lexeme is determined, processed and both pointers are set to the character
immediately past the lexeme. If the forward pointer moves halfway mark, the
right N half is filled with new characters. If the forward pointer moves right
end of the buffer then left N half is filled with new characters. The
disadvantage is look ahead is limited and it is impossible to recognize tokens,
when distance between the two pointers is more than the length of the buffer.
Advancing
forward requires that we first test whether we have reached the end of one of
the buffers, and if so, we must reload the other buffer from the input, and move
forward to the beginning of the newly loaded buffer. As long as we never need
to look so far ahead of the actual lexeme that the sum of the lexeme's length
plus the distance we look ahead is greater than N,
we shall never overwrite the lexeme in its buffer before
determining it.
Sentinels
It
is an extra key inserted at the end of the array. It is a special, dummy
character that can’t be part of source program. With respect to buffer pairs,
the code for advancing forward pointer is:
If forward is at the end of first half then
begin reload
second half
forward=forward+1
end
elseif forward is at the end of second half
then begin
reload first half
move forward to beginning of first half
end
else
forward=forward+1
Instead
of this, we provide an extra character, sentinel at the end of each half of the
buffer. Sentinel is not part of our source program and works as ‘eof’. Now only
one test is sufficient, that is, if forward=‘eof’ or not.
0 comments