Chapter 12 Wielding Lexical Black Magic
(wield【wiːld】 means '使用,运用,掌握')
引子:通过前面章节的学习,我们应该能够execute arbitrary code while parsing和alter syntax recognition with semantic predicates。
The fundamental problem is that the lexer does the tokenizing, but sometimes only the parser has the context information needed to make tokenizing decisions.
lexer只做分词,但有时如何分词需要parser提供的context information。
island languages: sentences have islands of interesting bits surrounded by a sea of stuff we don’t care about. To parse island languages, we need island grammars and lexical modes.
12.1讲到的Broadcasting Tokens on Different Channels,是指把空格和注释不分发给parser,分发到另一个的Channel,parse阶段可以查询使用这些信息。
three lexical problems that fit into the context-sensitivity bucket
a) same token character sequence can mean different things to the parser
b) same character sequence can be one token or multiple
c) same token must sometimes be ignored and sometimes be recognized by the parser
There are two approaches to allow keywords to act as identifiers in some syntactic contexts. The first approach has the lexer pass all keywords to the parser as keyword token types, and then we create a parser id rule that matches ID and any of the keywords. The second approach has the lexer pass keywords as identifiers, and then we use predicates to test identifier names in the parser.
第一种方法是创建一条id规则,让其匹配ID和关键字,像如下这样:
id : 'if' | 'call' | 'then' | ID;
Avoiding the Maximal Munch Ambiguity讲的是当遇到+=时,lexer应该把+=当成是2个token传到parser,然后由parser根据context把+和=组合起来。
Fun with Python Newlines讲的是Python中的换行符,需要根据context才能确定是否是语句的结束。解决方法是在NEWLINE语法之前增加以下IGNORE_NEWLINE语法
IGNORE_NEWLINE : '\r' ? '\n' {nesting>0}? -> skip ;
然后遇到'('或者'['时nesting加1,遇到')'或者']'时nesting减1。
12.3的Islands in the Stream讲的是一种语言中有多种格式的rule,格式之间相互surround,比如XML。
ANTLR provides lexical modes that let lexers switch between contexts (modes).