Chapter 1 & 2: Language Processing and Python & Accessing Text Corpora and Lexical Resources
- NLTK
- concordance( ) function
- Word Sense Disambiguation & Pronoun Resolution
- Text Corpus Structure
- WordNet
1.Key:
What's NTLK?
NTLK是一个自然语言工具包,最初创建于2001年,最初是宾州大学计算机与信息科学系计算语言学课程的一部分,大部分NLP研究者入门的首选tool。
另外,这本书是关于用Python进行自然语言处理的一本入门书,基本上可以看做是NLTK这个库的HandBook,使用的方法均是nltk库中的方法。如果希望查阅API文档或者是下载安装NLTK,可以前往官方网站下载,官网上提供和的API文档涵盖了工具包中的每一个模块、类和函数,详细说明了各种参数,以及用法示例,在此不再赘述。
- 简单介绍一下NLTK的几个重要的模块以及功能描述:
语言处理任务 | NLTK模块 | 功能描述 |
---|---|---|
获取语料库 | nltk.corpus | 语料库和字典的标准化接口 |
字符串处理 | nltk.tokenize, nltk.stem | 分词、句子分解、提取主干 |
搭配探究 | nltk.collocations | t-检验、卡方、点互信息 |
词性标识符 | nltk.tag | n-gram、backoff、Brill、HMM、TnT |
分类 | nltk.classify,nltk.cluster | 决策树、最大熵、朴素贝叶斯、EM、k-means |
分块 | nltk.chunk | 正则表达式、n-gram、命名实体 |
解析 | nltk.parse | 图表、基于特征、一致性、概率性、依赖项 |
语义解释 | nltk.sem,nltk.inference | ℷ演算、一阶逻辑、模型检验 |
指标评测 | nltk.metrice | 精度、召回率、协议系数 |
概率与估计 | nltk.probability | 频率分布、平滑概率分布 |
应用 | nltk.app,nltk.chat | 图形化的关键词排序、分析器、WordNet查看器、聊天机器人 |
语言学领域的工作 | nltk.toolbox | 处理SIL数据格式的工具箱 |
concordance function.
- concordance函数:这个函数挺有意思的,是nltk下的一个函数,可以显示指定单词的出现情况(使用这个函数,指定单词的大小写不敏感),同时还可以显示一些上下文。下面是该函数的使用场景(其中text1的内容是nltk.book导入后中的text1):
>>> text1.concordance("monstrous")
Building index...
Displaying 11 of 11 matches:
ong the former , one was of a most monstrous size . ... This came towards us ,
ON OF THE PSALMS . " Touching that monstrous bulk of the whale or ork we have r
ll over with a heathenish array of monstrous clubs and spears . Some were thick
d as you gazed , and wondered what monstrous cannibal and savage could ever hav
that has survived the flood ; most monstrous and most mountainous ! That Himmal
they might scout at Moby Dick as a monstrous fable , or still worse and more de
th of Radney .'" CHAPTER 55 Of the monstrous Pictures of Whales . I shall ere l
ing Scenes . In connexion with the monstrous pictures of whales , I am strongly
ere to enter upon those still more monstrous stories of them which are to be fo
这个函数的具体实现如下:
def concordance(self, word, width=79, lines=25):
"""
Print a concordance for ``word`` with the specified context window.
Word matching is not case-sensitive.
:seealso: ``ConcordanceIndex``
"""
if '_concordance_index' not in self.__dict__:
print("Building index...")
self._concordance_index = ConcordanceIndex(self.tokens,
key=lambda s:s.lower())
self._concordance_index.print_concordance(word, width, lines)
Word Sense Disambiguation & Pronoun Resolution
- Word Sense Disambiguation
意思是“词义消歧”,简而言之,我们需要做的就是分析出特定上下文中的词被赋予的是哪个意思。例如:
a. serve: help with food or drink; hold an office; put ball into play
b. dish: plate; course of a meal; communications device
- Pronoun Resolution
指代消解,是解决"词义消歧"的一个手段,解决“谁对谁做了什么”
,即检测动词的主语和宾语,另外还有语义角色标注(semantic role labing)---确定名词短语如何与动词相关联(如代理、受事、工具等)。
Text Corpus Structure
以下是几种常见的语料库结构:
- 最简单的一种语料库是一些孤立的没有什么特别结构的文本集合;
- 一些语料库按如文体(布朗语料库)等分类成组织结构;
- 一些分类会重叠,如主题类别(路透社语料库);
- 另外一些语料库可以表示随时间变化,语言用法的改变(就职演说语料库);
WordNet
- Senses and Synonyms.(意义与同义词)
- Synonyms and Synset.(同义词集与词条)
- The WordNet Hierarchy.(WordNet的层次结构)
WordNet synsets correspond to abstract concepts, and they don’t always have corre- sponding words in English. These concepts are linked together in a hierarchy. Some concepts are very general, such as Entity, State, Event; these are called unique begin- ners or root synsets. Others, such as gas guzzler and hatchback, are much more specific.
WordNet概念的层次片段:每个节点对应一个同义词集;边表示上位词/下位词关系,即上级概念与从属概念的关系。
- Hyponyms and Hypernyms.(下位词与上位词)
- Antonyms.(反义词)
2.Correct errors in printing:
P19.
在[Your Trun]的那块内容中:
使用text2尝试前面频率分布的例子。...如果得到的是错误信息:NameError: name 'FreqDist'is not defined,则需要在一开始输入 **nltk.book import ***。
需更正为:
使用text2尝试前面频率分布的例子。...如果得到的是错误信息:NameError: name 'FreqDist'is not defined,则需要在一开始输入 **nltk.import ***。
原因:nltk.book中并不存在FreqDist( )这一function.
P48.
在[Inaugural Address Corpus]的那块代码部分中:
>>> cfd = nltk.ConditionalFreqDist(
... (target, file[:4])
... for fileid in inaugural.fileids()
需更正为:
>>> cfd = nltk.ConditionalFreqDist(
... (target, fileid[:4])
... for fileid in inaugural.fileids()
3.Practice:
6.○ In the discussion of comparative wordlists, we created an object called trans late, which you could look up using words in both German and Italian in order to get corresponding words in English. What problem might arise with this approach? Can you suggest a way to avoid this problem?
- 如果输入错误(不存在的词语或者其他没有通过translate.update(dict(xx))加入字典的语言词语,则会引发KeyError.)。其中一个解决办法是,添加一个错误处理情况。
8.◑ Define a conditional frequency distribution over the Names Corpus that allows you to see which initial letters are more frequent for males versus females (see Figure 2-7).
cfd = nltk.ConditionalFreqDist((fileid, name[1]
for fileid in names.fileids()
for name in names.words(fileid))
14.◑ Define a function supergloss(s) that takes a synset s as its argument and returns a string consisting of the concatenation of the definition of s, and the definitions of all the hypernyms and hyponyms of s.
def supergloss(s):
s = wn.synset('s')
hyponyms_of_s = s.hyponyms()
hypernyms_of_s = s.hypernyms()
return str(s) + str(hyponyms_of_s) + str(hypernyms_of_s)
17.◑ Write a function that finds the 50 most frequently occurring words of a text that are not stopwords.
def most_fifty_words(text):
stopwords = nltk.corpus.stopwords.words('english')
content = [w for w in text if w.lower() not in stopwords]
fdist = FreqDist(content)
vocabulary = list(fdist.keys())
return vocabulary[:50]
4.Still have Question:
- 暂无