第八章文本数据

整章知识架构

一、str对象

str 对象

主要是对str对象的一些介绍，其中str对象可以使用[]进行索引，并且strin类型和object类型的索引结果不同。

二、正则表达式基础

正则表达式基础

需要将表中的内容记牢，正则表达式匹配的是字符串的一种模式，在写正则表达式是，一定要准确写出和想要匹配字符串模式相同的正则表达式。

三、文本处理的五类操作

文本处理的五类基本操作

拆分
使用split函数对字符串进行拆分，需要注意的是指定的分隔字符串会被丢弃。

s = pd.Series(['上海市黄浦区方浜中路249号', '上海市宝山区密山路5号'])
s.str.split('[市区路]')

split

str.split

合并

合并

join函数是使用某个连接符把字符串列表连接起来

s = pd.Series([['a','b'], [1, 'a'], [['a', 'b'], 'c']])
s.str.join('-')

join

cat用于合并两个序列，可以指定连接的方式为left, right,outer

s1 = pd.Series(['a','b'])
s2 = pd.Series(['cat','dog'])
s2.index = [1, 2]
s1.str.cat(s2, sep='-', na_rep='?', join='outer')

cat

匹配

contains和match都支持正则表达式，startswith和endswith不支持正则表达式

匹配

替换

替换

当需要对不同部分进行有差别的替换时，可以利用子组的方法，并且此时可以通过传入自定义的替换函数来分别进行处理，注意group(k)代表匹配到的第k个子组（圆括号之间的内容）

s = pd.Series(['上海市黄浦区方浜中路249号',
                '上海市宝山区密山路5号',
                '北京市昌平区北农路2号'])
pat = '(\w+市)(\w+区)(\w+路)(\d+号)'
city = {'上海市': 'Shanghai', '北京市': 'Beijing'}
district = {'昌平区': 'CP District',
            '黄浦区': 'HP District',
            '宝山区': 'BS District'}
road = {'方浜中路': 'Mid Fangbin Road',
        '密山路': 'Mishan Road',
        '北农路': 'Beinong Road'}
def my_func(m):
    str_city = city[m.group(1)]
    str_district = district[m.group(2)]
    str_road = road[m.group(3)]
    str_no = 'No. ' + m.group(4)[:-1]
    return ' '.join([str_city,
                     str_district,
                     str_road,
                     str_no])
s.str.replace(pat, my_func)

out1

可以使用命名子组的方式给子组命名，方式如下

pat = '(?P<group_name>repr)'

提取
extract与extractall的区别是前者只提取一次，后者会尝试多次提取，findall会将匹配结果放入列表返回

提取

pat = '(?P<市名>\w+市)(?P<区名>\w+区)(?P<路名>\w+路)(?P<编号>\d+号)'
s.str.extract(pat)

out

四、常用字符串函数

常用字符串函数

字母型函数
主要是对字母大小写的一些变换

字母型函数

数值型函数
将string转换成数值，errors 有三种类型可选，downcast可以指定转换的类型为integer ，signed，unsigned，float

数值型函数

统计型函数
返回正则匹配的次数和字符串的长度

s = pd.Series(['cat rat fat at', 'get feed sheet heat'])
s.str.count('[r|f]at|ee')
s.str.len()

统计型函数

格式型函数
格式型函数主要分为两类，第一种是除空型，第二种时填充型。其中，第一类函数一共有三种，它们分别是strip, rstrip, lstrip，分别代表去除两侧空格、右侧空格和左侧空格。这些函数在数据清洗时是有用的，特别是列名含有非法空格的时候。

格式型函数

五、练习

Ex1：房屋信息数据集

现有一份房屋信息数据集如下：

house data

将year列改为整数年份存储。
将floor列替换为Level, Highest两列，其中的元素分别为string类型的层类别（高层、中层、低层）与整数类型的最高层数。
计算房屋每平米的均价avg_price，以***元/平米的格式存储到表中，其中***为整数。

# year列改为整数年分存储
df.dropna(how = 'any', inplace=True) # 删除无效项
df['year'] = pd.to_numeric(df['year'].str.extract('(?P<year>[0-9]{4})*').year, errors='ignore', downcast='integer')
df.head(3)

result 1

# floor替换为level，highest两列
floor = df.floor.str.extract('(?P<Level>[^0-9]+[层]).*[共](?P<Highest>\d+)[层]')
# Highest转化为整数类型
floor.Highest = pd.to_numeric(floor.Highest, errors='coerce', downcast='integer')
df = pd.concat([floor, df],axis=1)
df.drop(columns=['floor'], inplace=True)
df.head(3)

result 2

# 计算均价
aera = pd.to_numeric(df.area.str.extract('(?P<area>\d+.?\d+)').area, errors='coerce', downcast='float')
price = pd.to_numeric(df.price.str.extract('(?P<price>\d+)').price, errors='coerce', downcast='float')
avg_price = price / aera * 10000
avg_price.name = 'avg_price'
avg_price = avg_price.astype('int').astype('string') + '元/平米'
df = pd.concat([df, avg_price], axis=1)
df.head(3)

result 3

Ex2：《权力的游戏》剧本数据集

现有一份权力的游戏剧本数据集如下：

data

计算每一个Episode的台词条数。
以空格为单词的分割符号，请求出单句台词平均单词量最多的前五个人。
若某人的台词中含有问号，那么下一个说台词的人即为回答者。若上一人台词中含有 $n$ 个问号，则认为回答者回答了 $n$ 个问题，请求出回答最多问题的前五个人。

# 计算台词条数
df.columns = df.columns.str.strip()
df.groupby(['Season', 'Episode'])['Sentence'].count().head()