The Scanner
Overview
- use a scanner generator such as
lex
orjlex
- The input to a scanner includes
one regular expression for each token
Finite State Machine
- S is the start state, each FSM has exactly one.
- A is a final state, drawn using a double circle. A FSM may have more than one final state.
- The finite-state machine stops when it gets stuck or when it has consumed all of the input characters.
An input string is accepted by a FSM if
* The entire string is consumed, and
* The machine ends in a final state.
Formal Definition
An FSM is a 5-tuple: (Q,Σ,δ,q,F)
- Q is a finite set of states ({S,A,B} in the above example).
- Σ (an uppercase sigma) is the alphabet of the machine, a finite set of characters that label the edges ({+,−,0,1,...,9} in the above example).
- q is the start state, an element of Q (S in the above example).
- F is the set of final states, a subset of Q ({B} in the above example).
- δ is the state transition relation: Q×Σ→Q (i.e., it is a function that takes two arguments -- a state in Q and a character in Σ -- and returns a state in Q).
Regular Expressions
Pay attention to the precedence of operators.
"|" < "." < "*"
for example: letter(letter|digit)* is different with
letter.letter|digit*, which is (letter.letter)|(digit*).
Using Regular Expressions and FSMs to Define a Scanner
There is a theorem that says that for every regular expression, there is a finite-state machine that defines the same language, and vice versa.