Specification of tokens

An alphabet is any finite set of symbols. Typical examples of symbols are letters, digits, and punctuation. The set {0,1} is the binary alphabet.

A string over an alphabet is a finite sequence of symbols drawn from that alphabet. In language theory, the terms "sentence" and "word" are often used as synonyms for "string." The length of a string s, usually written |s|, is the number of occurrences of symbols in s. For example, banana is a string of length six. The empty string, denoted e, is the string of length zero.

The following string-related terms are commonly used:

1. A prefix of string s is any string obtained by removing zero or more. symbols from the end of s. For example, ban, banana, and e are prefixes of banana.
2. A suffix of string s is any string obtained by removing zero or more symbols from the beginning of s. For example, nana, banana, and e are suffixes of banana.
3. A substring of s is obtained by deleting any prefix and any suffix from s. For instance,banana, nan, and e are substrings of banana.
4. The proper prefixes, suffixes, and substrings of a string s are those, prefixes, suffixes, and substrings, respectively, of s that are not e or not equal to s itself.

5. A subsequence of s is any string formed by deleting zero or more not necessarily consecutive positions of s. For example, baan is a subsequence of banana.
A language is any countable set of strings over some fixed alphabet. This definition is very broad. Abstract languages like 0, the empty set, or {e}, the set containing only the empty string, are languages under this definition. So too are the set of all syntactically well-formed C programs and the set of all grammatically correct English sentences, although the latter two languages are difficult to specify exactly.
Operations on languages
LUD: Union operation, where L=set of alphabets
{A..Z,a..z} and D=set of digits {0..9}
LD: Concatenation
L4:exponentiation: set of strings with 4 letters
L0= _
Li=Li-1L
L*=all strings with _: Kleene closure of L
D+: set of all strings of digits of one or more
1. L U D is the set of letters and digits — strictly speaking the language with 62 strings of
length one, each of which strings is either one letter or one digit.
2. LD is the set of strings of length two, each consisting of one letter followed by one digit.
3. L4 is the set of all 4-letter strings.
4. L* is the set of all strings of letters, including e, the empty string.
5. L(L U D)* is the set of all strings of letters and digits beginning with a letter.
6. D+ is the set of all strings of one or more digits.

0 comments