综合项目:词频统计
书中原代码为:
@count_time
def test1():
with open('Walden.txt','r') as text:
#将所有的单词按空格拆分,然后去除所有标点符号,然后转换为小写
words = [raw_word.strip(string.punctuation).lower() for raw_word in text.read().split()]
words_set = set(words)
counts_dict = {word: words.count(word) for word in words_set}
print len(counts_dict)```
耗时大概 23秒左右
改进后的代码为:
@count_time
def test2():
"""test1的性能优化版"""
counts_dict = {}
with open('Walden.txt', 'r') as text:
words = [word.strip(string.punctuation).lower() for raw_word in text for word in raw_word.split()]
for word_2 in words:
if word_2 in counts_dict:
counts_dict[word_2] += 1
else:
counts_dict[word_2] = 1
print len(counts_dict)
更加pythonic的版本:
@count_time
def test3():
"""tongji2的 更加pythonic的版本"""
with open('Walden.txt', 'r') as text:
words = [word.strip(string.punctuation).lower() for raw_word in text for word in raw_word.split()]
counts_dict = collections.Counter(words)
print len(counts_dict)
耗时如下:
![image.png](http://upload-images.jianshu.io/upload_images/4131789-018d62d69b23f23b.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)![image.png](http://upload-images.jianshu.io/upload_images/4131789-6e8df883acae84ce.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)
test1()的耗时 是test2,test3的两百多倍。
主要耗时是在 test1中的
counts_dict = {word: words.count(word) for word in words_set}
这里。count(word)每次查找均从集合的开始处开始查找元素。在集合很大的情况下,相当耗时。
应该避免使用count()进行频繁的查找,可以使用:
for word_2 in words:
if word_2 in counts_dict:
counts_dict[word_2] += 1
else:
counts_dict[word_2] = 1
或者更加pythonic的版本:
counts_dict = collections.Counter(words)
两者耗时测试相差0.2秒
Counter是collections模块的一个类,计数器(Counter)做容器使用,用来跟踪值出现了多少次
可以使用序列,字典,关键字参数进行初始化
collections.Counter(['a', 'b', 'c', 'a', 'b', 'b'])
collections.Counter({'a':2, 'b':3, 'c':1})
collections.Counter(a=2, b=3, c=1)
返回一个字典,以元素为key,出现次数为value。
使用字典来初始化Counter 好像没什么意义,字典本身就是key具有唯一性的容器。