Tokens, patterns and lexemes

Token is a terminal symbol in the grammar for the source language. When the character sequence ‘pi’ appears in the source program, a token representing identifier is returned to the parser. A token is a pair consisting of a token name and an optional attribute value. The token name is an abstract symbol representing a kind of lexical unit, e.g., a particular keyword, or a sequence of input characters denoting an identifier. The token names are the input symbols that the parser processes. In what follows, we shall generally write the name of a token in boldface. We will often refer to a token by its token name.
              
Pattern is a rule describing the set of lexemes that can represent a particular token in source programs. A pattern is a description of the form that the lexemes of a token may take. In the case of a keyword as a token, the pattern is just the sequence of characters that form the keyword. For identifiers and some other tokens, the pattern is a more complex structure that is matched by many strings.

Lexeme is a sequence of characters in the source program that is matched by the pattern for a token. A lexeme is a sequence of characters in the source program that matches the pattern for a token and is identified by the lexical analyzer as an instance of that token.

In many programming languages, the following classes cover most or all of the tokens:

1. One token for each keyword. The pattern for a keyword is the same as the keyword itself.
2. Tokens for the1 operators, either individually or in classes such as the token comparison.
3. One token representing all identifiers.
4. One or more tokens representing constants, such as numbers and literal

5. Tokens for each punctuation symbol, such as left and right parentheses, comma, and semicolon.

0 comments