<Natural Language Processing with Python> Chapter 1 & 2

Chapter 1 & 2: Language Processing and Python & Accessing Text Corpora and Lexical Resources

NLTK
concordance( ) function
Word Sense Disambiguation & Pronoun Resolution
Text Corpus Structure
WordNet

1.Key：

What's NTLK?

NTLK是一个自然语言工具包，最初创建于2001年，最初是宾州大学计算机与信息科学系计算语言学课程的一部分，大部分NLP研究者入门的首选tool。

另外，这本书是关于用Python进行自然语言处理的一本入门书，基本上可以看做是NLTK这个库的HandBook，使用的方法均是nltk库中的方法。如果希望查阅API文档或者是下载安装NLTK，可以前往官方网站下载，官网上提供和的API文档涵盖了工具包中的每一个模块、类和函数，详细说明了各种参数，以及用法示例，在此不再赘述。

简单介绍一下NLTK的几个重要的模块以及功能描述：

语言处理任务	NLTK模块	功能描述
获取语料库	nltk.corpus	语料库和字典的标准化接口
字符串处理	nltk.tokenize, nltk.stem	分词、句子分解、提取主干
搭配探究	nltk.collocations	t-检验、卡方、点互信息
词性标识符	nltk.tag	n-gram、backoff、Brill、HMM、TnT
分类	nltk.classify，nltk.cluster	决策树、最大熵、朴素贝叶斯、EM、k-means
分块	nltk.chunk	正则表达式、n-gram、命名实体
解析	nltk.parse	图表、基于特征、一致性、概率性、依赖项
语义解释	nltk.sem，nltk.inference	ℷ演算、一阶逻辑、模型检验
指标评测	nltk.metrice	精度、召回率、协议系数
概率与估计	nltk.probability	频率分布、平滑概率分布
应用	nltk.app，nltk.chat	图形化的关键词排序、分析器、WordNet查看器、聊天机器人
语言学领域的工作	nltk.toolbox	处理SIL数据格式的工具箱

concordance function.

concordance函数：这个函数挺有意思的，是nltk下的一个函数，可以显示指定单词的出现情况（使用这个函数，指定单词的大小写不敏感），同时还可以显示一些上下文。下面是该函数的使用场景（其中text1的内容是nltk.book导入后中的text1）:

>>> text1.concordance("monstrous")
Building index...
Displaying 11 of 11 matches:
ong the former , one was of a most monstrous size . ... This came towards us ,
ON OF THE PSALMS . " Touching that monstrous bulk of the whale or ork we have r
ll over with a heathenish array of monstrous clubs and spears . Some were thick
d as you gazed , and wondered what monstrous cannibal and savage could ever hav
that has survived the flood ; most monstrous and most mountainous ! That Himmal
they might scout at Moby Dick as a monstrous fable , or still worse and more de
th of Radney .'" CHAPTER 55 Of the monstrous Pictures of Whales . I shall ere l
ing Scenes . In connexion with the monstrous pictures of whales , I am strongly
ere to enter upon those still more monstrous stories of them which are to be fo

这个函数的具体实现如下：

def concordance(self, word, width=79, lines=25):
    """
    Print a concordance for ``word`` with the specified context window.
    Word matching is not case-sensitive.
    :seealso: ``ConcordanceIndex``
    """
    if '_concordance_index' not in self.__dict__:
        print("Building index...")
        self._concordance_index = ConcordanceIndex(self.tokens,
                                                   key=lambda s:s.lower())

    self._concordance_index.print_concordance(word, width, lines)

Word Sense Disambiguation & Pronoun Resolution

Word Sense Disambiguation
意思是“词义消歧”，简而言之，我们需要做的就是分析出特定上下文中的词被赋予的是哪个意思。例如：

a. serve: help with food or drink; hold an office; put ball into play

b. dish: plate; course of a meal; communications device

Pronoun Resolution
指代消解，是解决"词义消歧"的一个手段，解决“谁对谁做了什么”
，即检测动词的主语和宾语，另外还有语义角色标注（semantic role labing）---确定名词短语如何与动词相关联（如代理、受事、工具等）。

Text Corpus Structure

以下是几种常见的语料库结构：

最简单的一种语料库是一些孤立的没有什么特别结构的文本集合；
一些语料库按如文体（布朗语料库）等分类成组织结构；
一些分类会重叠，如主题类别（路透社语料库）；
另外一些语料库可以表示随时间变化，语言用法的改变（就职演说语料库）；

WordNet

Senses and Synonyms.（意义与同义词）
Synonyms and Synset.（同义词集与词条）
The WordNet Hierarchy.（WordNet的层次结构）

WordNet synsets correspond to abstract concepts, and they don’t always have corre- sponding words in English. These concepts are linked together in a hierarchy. Some concepts are very general, such as Entity, State, Event; these are called unique begin- ners or root synsets. Others, such as gas guzzler and hatchback, are much more specific.

WordNet概念的层次片段：每个节点对应一个同义词集；边表示上位词/下位词关系，即上级概念与从属概念的关系。

Hyponyms and Hypernyms.（下位词与上位词）
Antonyms.（反义词）

2.Correct errors in printing：

P19.

在[Your Trun]的那块内容中：

使用text2尝试前面频率分布的例子。...如果得到的是错误信息：NameError: name 'FreqDist'is not defined,则需要在一开始输入 **nltk.book import ***。

需更正为：

使用text2尝试前面频率分布的例子。...如果得到的是错误信息：NameError: name 'FreqDist'is not defined,则需要在一开始输入 **nltk.import ***。

原因:nltk.book中并不存在FreqDist( )这一function.

P48.

在[Inaugural Address Corpus]的那块代码部分中：

 >>> cfd = nltk.ConditionalFreqDist(
     ...           (target, file[:4])
     ...           for fileid in inaugural.fileids()

需更正为：

 >>> cfd = nltk.ConditionalFreqDist(
     ...           (target, fileid[:4])
     ...           for fileid in inaugural.fileids()

3.Practice：

6.○ In the discussion of comparative wordlists, we created an object called trans late, which you could look up using words in both German and Italian in order to get corresponding words in English. What problem might arise with this approach? Can you suggest a way to avoid this problem?

如果输入错误（不存在的词语或者其他没有通过translate.update(dict(xx))加入字典的语言词语，则会引发KeyError.）。其中一个解决办法是，添加一个错误处理情况。

8.◑ Define a conditional frequency distribution over the Names Corpus that allows you to see which initial letters are more frequent for males versus females (see Figure 2-7).

cfd = nltk.ConditionalFreqDist((fileid, name[1]
                                 for fileid in names.fileids()
                                 for name in names.words(fileid))

14.◑ Define a function supergloss(s) that takes a synset s as its argument and returns a string consisting of the concatenation of the definition of s, and the definitions of all the hypernyms and hyponyms of s.

def supergloss(s):
    s = wn.synset('s')
    hyponyms_of_s = s.hyponyms()
    hypernyms_of_s = s.hypernyms()
    return str(s) + str(hyponyms_of_s) + str(hypernyms_of_s)

17.◑ Write a function that finds the 50 most frequently occurring words of a text that are not stopwords.

def most_fifty_words(text):
    stopwords = nltk.corpus.stopwords.words('english')
    content = [w for w in text if w.lower() not in stopwords]
    fdist = FreqDist(content)
    vocabulary = list(fdist.keys())
    return vocabulary[:50]

4.Still have Question：

暂无

最后编辑于：2017.12.06 02:15:44

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 199,393评论 5赞 467
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 83,790评论 2赞 376
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 146,391评论 0赞 330
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 53,703评论 1赞 270
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 62,613评论 5赞 359
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 48,003评论 1赞 275
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 37,507评论 3赞 390
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 36,158评论 0赞 254
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 40,300评论 1赞 294
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 35,256评论 2赞 317
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 37,274评论 1赞 328
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 32,984评论 3赞 316
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 38,569评论 3赞 303
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 29,662评论 0赞 19
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 30,899评论 1赞 255
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 42,268评论 2赞 345
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 41,840评论 2赞 339