1. 题目
第 0004 题:任一个英文的纯文本文件,统计其中的单词出现的个数。
2. 效果
#------1.txt-----------
There are moments in life when you miss only
one life and one chance to do
you want to do.is
isn't don't word_d common
#------输出------------
do: 2
word_d: 1
want: 1
to: 2
is: 1
you: 2
isn't: 1
don't: 1
...
- 将所有单词按照小写处理
-
isn't
和word_d
这种应当作为一个单词
3. 实现
# -*- coding:utf-8 -*-
import re
def get_word_dict(file_path=None):
if file_path is None:
print("Error")
return
word_dict = {}
with open(file_path, "r", encoding="utf-8") as file:
for line in file.readlines():
words = re.findall(r"[a-z\'_-]+\b", line.lower())
for word in words:
if word not in word_dict:
word_dict[word] = 1
else:
word_dict[word] += 1
for word, count in word_dict.items():
print("%s: %d\n" % (word, count))
return word_dict
if __name__ == "__main__":
get_word_dic("1.txt")
4. 解决问题
<i>I. 无法识别isn't
这样的单词</i>
在正则匹配时需要在加入一个\b
来作为单词边界。
<i>II. 读取文件出现编码错误</i>
在open()
函数中加入encoding参数。