第06章数据加载、存储与文件格式

6.1 读写文本格式的数据

image.png

打开示例文件ex1.cv（有标题）

In [1]: !cat ex1.csv#！表示使用系统命令
a,b,c,d,message
1,2,3,4,hello
5,6,7,8,world
9,10,11,12,foo

使用read_csv

参数：

index_col : int, sequence or bool, optional Column to use as the row labels of the DataFrame. If a sequence is given, a MultiIndex is used. If you have a malformed file with delimiters at the end of each line, you might consider index_col=False to force pandas to not use the first column as the index (row names).

传入整数：表示将序号对应的列作为索引列；（行索引）
传入序列：表示将序列中的列作为索引列，产生多级索引
传入False/None:此时自动生成自然数索引列

parse_dates:
parse_dates : bool or list of int or names or list of lists or dict, default False
The behavior is as follows:

boolean. If True -> try parsing the index.
list of int or names. e.g. If [1, 2, 3] -> try parsing columns 1, 2, 3 each as a separate date column.
list of lists. e.g. If [[1, 3]] -> combine columns 1 and 3 and parse as a single date column.
dict, e.g. {‘foo’ : [1, 3]} -> parse columns 1, 3 as date and call result ‘foo’

将指定列解析为时间。

*pandas.read_table
参数header：默认为0即将第0行作为列名。通常设置header为None，表示不要将第一行解析为列名。
参数sep:每一行各个数据的分隔符

In [9]: df2=pd.read_table('ex1.csv',sep=',')

image.png

对于没有标题行的文件，read函数会为其添加默认行索引，第一行作为列索引

!cat 'examples/ex2.csv'
1,2,3,4,hello
5,6,7,8,world
9,10,11,12,foo

设置默认标题行

df=pd.read_csv('examples/ex2.csv',header=None)
df

image.png

自己定义列名

df=pd.read_csv('examples/ex2.csv',names=['one','two','three','four','message'])

image.png

指定message列为索引列

df=pd.read_csv('examples/ex2.csv',names=['one','two','three','four','message'],index_col='message')

image.png

多级索引

!cat examples/csv_mindex.csv

image.png

df=pd.read_csv('examples/csv_mindex.csv',index_col=['key1','key2'])
df

image.png

有些表格可能不是用固定的分隔符去分隔字段的
比如下面这个就是由不同数量空白符分隔的，此时就不能用之前的方法直接读取

In [10]: !cat 'examples/ex3.txt'
            A         B         C
aaa -0.264438 -1.026059 -0.619500
bbb  0.927272  0.302904 -0.032399
ccc -0.264273 -0.386314 -0.217601
ddd -0.871858 -0.348382  1.100491

方法是传递一个正则表达式‘\s’作为sep的参数值

In [34]: df=pd.read_csv('examples/ex3.txt',sep='\s+')

In [35]: df
Out[35]:
            A         B         C
aaa -0.264438 -1.026059 -0.619500
bbb  0.927272  0.302904 -0.032399
ccc -0.264273 -0.386314 -0.217601
ddd -0.871858 -0.348382  1.100491

使用skiprows参数跳过指定的行

In [36]: !cat 'examples/ex4.csv'
# hey!
a,b,c,d,message
# just wanted to make things more difficult for you
# who reads CSV files with computers, anyway?
1,2,3,4,hello
5,6,7,8,world
9,10,11,12,foo
In [37]: df=pd.read_csv('examples/ex4.csv',skiprows=[0,2,3])

In [38]: df
Out[38]:
   a   b   c   d message
0  1   2   3   4   hello
1  5   6   7   8   world
2  9  10  11  12     foo

isnull方法可以检验NA及NULL

In [41]: df
Out[41]:
  something  a   b     c   d message
0       one  1   2   3.0   4     NaN
1       two  5   6   NaN   8   world
2     three  9  10  11.0  12     foo

In [42]: pd.isnull(df)
Out[42]:
   something      a      b      c      d  message
0      False  False  False  False  False     True
1      False  False  False   True  False    False
2      False  False  False  False  False    False

设置缺省值列表，read系列函数会将列表中的读取为缺省值

result = pd.read_csv('examples/ex5.csv', na_values=['NULL','WHAT'])

image.png

逐块读取文本文件

读取5行数据

In [15]:  result = pd.read_csv('examples/ex6.csv',nrows=5)

按块读取数据

In [33]: chunker=pd.read_csv('examples/ex6.csv',chunksize=1000)
In [35]: chunker
Out[35]: <pandas.io.parsers.TextFileReader at 0x2cf5e55c860>

这里的TextFileReader是一个迭代器，每次迭代取出1000行（chunksize）
每1000行统计key列各个字母出现的次数并加到一个Series对象中：


In [52]: for piece in chunker:
    ...:     se=se.add(piece['key'].value_counts(),fill_value=0)

In [55]: se=se.sort_values(ascending=False)

In [57]: se[:10]
Out[57]:
E    336.0
X    327.0
L    315.0
M    309.0
K    303.0
Q    301.0
P    299.0
O    299.0
J    298.0
F    295.0
dtype: float64

将数据写出到文本格式


In [58]: data = pd.read_csv('examples/ex5.csv')

In [59]: data.to_csv('examples/out.csv')

缺失值在输出结果中会被表示为空字符串。使用参数na_rep改变默认值

In [60]: data.to_csv('examples/out.csv',na_rep='没有值')

禁用行标和列标

data.to_csv(sys.stdout, index=False, header=False)

Series也有一个to_csv方法：

In [71]: se
Out[71]:
0    4.0
1    8.0
2    NaN
Name: d, dtype: float64

In [72]: se.to_csv(sys.stdout)
0,4.0
1,8.0
2,

得到日期索引函数

dates=pd.date_range('1/1/1997',periods=10)
In [84]: dates
Out[84]:
DatetimeIndex(['1997-01-01', '1997-01-02', '1997-01-03', '1997-01-04',
               '1997-01-05', '1997-01-06', '1997-01-07', '1997-01-08',
               '1997-01-09', '1997-01-10'],
              dtype='datetime64[ns]', freq='D')

处理分隔符格式

fp=open('examples/ex7.csv')
li=list(csv.reader(fp))
header,values=li[0],li[1:]

In [53]: di={k:v for k,v in zip(header,zip(*values))}

In [54]: di
Out[54]: {'a': ('1', '1'), 'b': ('2', '2'), 'c': ('3', '3')}

定义自定义csv格式类

In [78]: class my_dialect(csv.Dialect):
    ...:     lineterminator='\n'
    ...:     delimiter=';'
    ...:     quotechar='"'
    ...:     quoting=csv.QUOTE_MINIMAL
    ...:

In [79]: reader=csv.reader(fp,dialect=my_dialect)

或者直接改变格式，而无须定义类

reader = csv.reader(f, delimiter='|')

image.png

**
DataFrame对象转json：

In [116]: df.to_json()
Out[116]: '{"age":{"0":30.0,"1":null},"hobby":{"0":null,"1":"sports"},"name":{"0":"anchor","1":null},"sex":{"0":null,"1":"male"}}'

read_html函数：自动解析html和xml中的表格


In [143]: tables = pd.read_html('examples/fdic_failed_bank_list.html')

In [144]: len(tables)
Out[144]: 1

In [145]: failures=tables[0]

调用head方法看列表大概

In [146]: failures.head()
Out[146]:
                      Bank Name             City        ...                Closing Date       Updated Date
0                   Allied Bank         Mulberry        ...          September 23, 2016  November 17, 2016
1  The Woodbury Banking Company         Woodbury        ...             August 19, 2016  November 17, 2016
2        First CornerStone Bank  King of Prussia        ...                 May 6, 2016  September 6, 2016
3            Trust Company Bank          Memphis        ...              April 29, 2016  September 6, 2016
4    North Milwaukee State Bank        Milwaukee        ...              March 11, 2016      June 16, 2016

[5 rows x 7 columns]

获取时间

In [147]:  close_timestamps = pd.to_datetime(failures['Closing Date'])

利用lxml.objectify解析XML

from lxml import objectify

path = 'datasets/mta_perf/Performance_MNR.xml'
parsed = objectify.parse(open(path))
root = parsed.getroot()

观察xml的结构以便对它进行解析

image.png

解析:每条记录填充为一个字典并存入列表

data=[]
for elt in root.INDICATOR:
    el_data={}
    for child in elt.getchildren():
        if child.tag in skip_fields:
            continue
        el_data[child.tag]=child.pyval
    data.append(el_data)

用data构造DataFrame：

In [176]: perf=pd.DataFrame(data)
     ...: perf.head()
     ...:
     ...:
Out[176]:
            AGENCY_NAME            CATEGORY    ...     YTD_ACTUAL YTD_TARGET
0  Metro-North Railroad  Service Indicators    ...           96.9         95
1  Metro-North Railroad  Service Indicators    ...             96         95
2  Metro-North Railroad  Service Indicators    ...           96.3         95
3  Metro-North Railroad  Service Indicators    ...           96.8         95
4  Metro-North Railroad  Service Indicators    ...           96.6         95

[5 rows x 12 columns]

6.2 二进制数据格式

HDF5

Hierarchical Data Format (HDF) 是一种用来存储和组织大量数据的文件格式集合（HDF4,HDF5）。

HDF发展于国家超级计算应用中心，由HDF组织所支持，一个非盈利公司。HDF组织旨在确保HDF技术的继续发展和HDF数据存储的继续可用性。
HDF5支持许多商业和非商业的软件平台，包括Java,MATLAB,Scliab,Octave,Mathematica,IDL,Python,R等。
HDF5是HDF的当前版本，其在设计上有显著不同，api来自主要遗留版本HDF4
----维基百科

HDF5库的基本使用

frame=pd.DataFrame({'a':np.random.randn(100)})
store=pd.HDFStore('my_store.h5')#创建一个名为“my_store”的hdf5库
store.put('obj1',frame,format='table')#增
store.select('obj2', where=['index >= 10 and index <= 15'])#查
pd.read_hdf('mydata.h5', 'obj3', where=['index < 5'])#使用库

**ps:HDF5不是数据库，是数据集

读取Microsoft Excel文件

//todo

6.4 数据库交互

由于DataFrame对象和关系型数据库中的数据格式很像，因而用pandas操作数据库也很见简单。
连接数据库（‘mydata.sqlite’）并创建一个空表（'test'）
import sqllite3
import sqlite3
query="""
CREATE TABLE test
(a VARCHAR(20),b VARCHAR(20),
c REAL, d INTEGER
);"""
con=sqlite3.connect('mydata.sqlite')
con.execute(query)
con.commit()

** 插入数据**

data = [('Atlanta', 'Georgia', 1.25, 6),
.....: ('Tallahassee', 'Florida', 2.6, 3),
.....: ('Sacramento', 'California', 1.7, 5)]
stmt="INSERT INTO test VALUES(?,?,?,?)"
con.executemany(stmt,data)

**读取数据**

cursor=con.execute('select * from test')
rows=cursor.fetchall()

**再将row's转换为DataFrame对象，同时添加列名**

但这样从数据库读取数据——》将数据转换为DataFrame对象过于繁杂
**可以利用SQLAlchemy直接读取数据为DataFrame对象：**

db=sqla.create_engine('sqlite:///mydata.sqlite')
pd.read_sql('select * from test',db)

第06章 数据加载、存储与文件格式

6.1 读写文本格式的数据

使用read_csv

逐块读取文本文件

将数据写出到文本格式

利用lxml.objectify解析XML

6.2 二进制数据格式

读取Microsoft Excel文件

6.4 数据库交互

第06章数据加载、存储与文件格式