6.1 读写文本格式的数据
打开示例文件ex1.cv(有标题)
In [1]: !cat ex1.csv#!表示使用系统命令
a,b,c,d,message
1,2,3,4,hello
5,6,7,8,world
9,10,11,12,foo
使用read_csv
参数:
index_col : int, sequence or bool, optional Column to use as the row labels of the DataFrame. If a sequence is given, a MultiIndex is used. If you have a malformed file with delimiters at the end of each line, you might consider index_col=False to force pandas to not use the first column as the index (row names).
传入整数:表示将序号对应的列作为索引列;(行索引)
传入序列:表示将序列中的列作为索引列,产生多级索引
传入False/None:此时自动生成自然数索引列
parse_dates:
parse_dates : bool or list of int or names or list of lists or dict, default False
The behavior is as follows:
boolean. If True -> try parsing the index.
list of int or names. e.g. If [1, 2, 3] -> try parsing columns 1, 2, 3 each as a separate date column.
list of lists. e.g. If [[1, 3]] -> combine columns 1 and 3 and parse as a single date column.
dict, e.g. {‘foo’ : [1, 3]} -> parse columns 1, 3 as date and call result ‘foo’
将指定列解析为时间。
*pandas.read_table
参数header:默认为0即将第0行作为列名。通常设置header为None,表示不要将第一行解析为列名。
参数sep:每一行各个数据的分隔符
In [9]: df2=pd.read_table('ex1.csv',sep=',')
对于没有标题行的文件,read函数会为其添加默认行索引,第一行作为列索引
!cat 'examples/ex2.csv'
1,2,3,4,hello
5,6,7,8,world
9,10,11,12,foo
设置默认标题行
df=pd.read_csv('examples/ex2.csv',header=None)
df
自己定义列名
df=pd.read_csv('examples/ex2.csv',names=['one','two','three','four','message'])
指定message列为索引列
df=pd.read_csv('examples/ex2.csv',names=['one','two','three','four','message'],index_col='message')
多级索引
!cat examples/csv_mindex.csv
df=pd.read_csv('examples/csv_mindex.csv',index_col=['key1','key2'])
df
有些表格可能不是用固定的分隔符去分隔字段的
比如下面这个就是由不同数量空白符分隔的,此时就不能用之前的方法直接读取
In [10]: !cat 'examples/ex3.txt'
A B C
aaa -0.264438 -1.026059 -0.619500
bbb 0.927272 0.302904 -0.032399
ccc -0.264273 -0.386314 -0.217601
ddd -0.871858 -0.348382 1.100491
方法是传递一个正则表达式‘\s’作为sep的参数值
In [34]: df=pd.read_csv('examples/ex3.txt',sep='\s+')
In [35]: df
Out[35]:
A B C
aaa -0.264438 -1.026059 -0.619500
bbb 0.927272 0.302904 -0.032399
ccc -0.264273 -0.386314 -0.217601
ddd -0.871858 -0.348382 1.100491
使用skiprows参数跳过指定的行
In [36]: !cat 'examples/ex4.csv'
# hey!
a,b,c,d,message
# just wanted to make things more difficult for you
# who reads CSV files with computers, anyway?
1,2,3,4,hello
5,6,7,8,world
9,10,11,12,foo
In [37]: df=pd.read_csv('examples/ex4.csv',skiprows=[0,2,3])
In [38]: df
Out[38]:
a b c d message
0 1 2 3 4 hello
1 5 6 7 8 world
2 9 10 11 12 foo
isnull方法可以检验NA及NULL
In [41]: df
Out[41]:
something a b c d message
0 one 1 2 3.0 4 NaN
1 two 5 6 NaN 8 world
2 three 9 10 11.0 12 foo
In [42]: pd.isnull(df)
Out[42]:
something a b c d message
0 False False False False False True
1 False False False True False False
2 False False False False False False
设置缺省值列表,read系列函数会将列表中的读取为缺省值
result = pd.read_csv('examples/ex5.csv', na_values=['NULL','WHAT'])
逐块读取文本文件
读取5行数据
In [15]: result = pd.read_csv('examples/ex6.csv',nrows=5)
按块读取数据
In [33]: chunker=pd.read_csv('examples/ex6.csv',chunksize=1000)
In [35]: chunker
Out[35]: <pandas.io.parsers.TextFileReader at 0x2cf5e55c860>
这里的TextFileReader是一个迭代器,每次迭代取出1000行(chunksize)
每1000行统计key列各个字母出现的次数并加到一个Series对象中:
In [52]: for piece in chunker:
...: se=se.add(piece['key'].value_counts(),fill_value=0)
In [55]: se=se.sort_values(ascending=False)
In [57]: se[:10]
Out[57]:
E 336.0
X 327.0
L 315.0
M 309.0
K 303.0
Q 301.0
P 299.0
O 299.0
J 298.0
F 295.0
dtype: float64
将数据写出到文本格式
In [58]: data = pd.read_csv('examples/ex5.csv')
In [59]: data.to_csv('examples/out.csv')
缺失值在输出结果中会被表示为空字符串。使用参数na_rep改变默认值
In [60]: data.to_csv('examples/out.csv',na_rep='没有值')
禁用行标和列标
data.to_csv(sys.stdout, index=False, header=False)
Series也有一个to_csv方法:
In [71]: se
Out[71]:
0 4.0
1 8.0
2 NaN
Name: d, dtype: float64
In [72]: se.to_csv(sys.stdout)
0,4.0
1,8.0
2,
得到日期索引函数
dates=pd.date_range('1/1/1997',periods=10)
In [84]: dates
Out[84]:
DatetimeIndex(['1997-01-01', '1997-01-02', '1997-01-03', '1997-01-04',
'1997-01-05', '1997-01-06', '1997-01-07', '1997-01-08',
'1997-01-09', '1997-01-10'],
dtype='datetime64[ns]', freq='D')
处理分隔符格式
fp=open('examples/ex7.csv')
li=list(csv.reader(fp))
header,values=li[0],li[1:]
In [53]: di={k:v for k,v in zip(header,zip(*values))}
In [54]: di
Out[54]: {'a': ('1', '1'), 'b': ('2', '2'), 'c': ('3', '3')}
定义自定义csv格式类
In [78]: class my_dialect(csv.Dialect):
...: lineterminator='\n'
...: delimiter=';'
...: quotechar='"'
...: quoting=csv.QUOTE_MINIMAL
...:
In [79]: reader=csv.reader(fp,dialect=my_dialect)
或者直接改变格式,而无须定义类
reader = csv.reader(f, delimiter='|')
**
DataFrame对象转json:
In [116]: df.to_json()
Out[116]: '{"age":{"0":30.0,"1":null},"hobby":{"0":null,"1":"sports"},"name":{"0":"anchor","1":null},"sex":{"0":null,"1":"male"}}'
read_html函数:自动解析html和xml中的表格
In [143]: tables = pd.read_html('examples/fdic_failed_bank_list.html')
In [144]: len(tables)
Out[144]: 1
In [145]: failures=tables[0]
调用head方法看列表大概
In [146]: failures.head()
Out[146]:
Bank Name City ... Closing Date Updated Date
0 Allied Bank Mulberry ... September 23, 2016 November 17, 2016
1 The Woodbury Banking Company Woodbury ... August 19, 2016 November 17, 2016
2 First CornerStone Bank King of Prussia ... May 6, 2016 September 6, 2016
3 Trust Company Bank Memphis ... April 29, 2016 September 6, 2016
4 North Milwaukee State Bank Milwaukee ... March 11, 2016 June 16, 2016
[5 rows x 7 columns]
获取时间
In [147]: close_timestamps = pd.to_datetime(failures['Closing Date'])
利用lxml.objectify解析XML
from lxml import objectify
path = 'datasets/mta_perf/Performance_MNR.xml'
parsed = objectify.parse(open(path))
root = parsed.getroot()
观察xml的结构以便对它进行解析
解析:每条记录填充为一个字典并存入列表
data=[]
for elt in root.INDICATOR:
el_data={}
for child in elt.getchildren():
if child.tag in skip_fields:
continue
el_data[child.tag]=child.pyval
data.append(el_data)
用data构造DataFrame:
In [176]: perf=pd.DataFrame(data)
...: perf.head()
...:
...:
Out[176]:
AGENCY_NAME CATEGORY ... YTD_ACTUAL YTD_TARGET
0 Metro-North Railroad Service Indicators ... 96.9 95
1 Metro-North Railroad Service Indicators ... 96 95
2 Metro-North Railroad Service Indicators ... 96.3 95
3 Metro-North Railroad Service Indicators ... 96.8 95
4 Metro-North Railroad Service Indicators ... 96.6 95
[5 rows x 12 columns]
6.2 二进制数据格式
HDF5
Hierarchical Data Format (HDF) 是一种用来存储和组织大量数据的文件格式集合(HDF4,HDF5)。
HDF发展于国家超级计算应用中心,由HDF组织所支持,一个非盈利公司。HDF组织旨在确保HDF技术的继续发展和HDF数据存储的继续可用性。
HDF5支持许多商业和非商业的软件平台,包括Java,MATLAB,Scliab,Octave,Mathematica,IDL,Python,R等。
HDF5是HDF的当前版本,其在设计上有显著不同,api来自主要遗留版本HDF4
----维基百科
HDF5库的基本使用
frame=pd.DataFrame({'a':np.random.randn(100)})
store=pd.HDFStore('my_store.h5')#创建一个名为“my_store”的hdf5库
store.put('obj1',frame,format='table')#增
store.select('obj2', where=['index >= 10 and index <= 15'])#查
pd.read_hdf('mydata.h5', 'obj3', where=['index < 5'])#使用库
**ps:HDF5不是数据库,是数据集
读取Microsoft Excel文件
//todo
6.4 数据库交互
由于DataFrame对象和关系型数据库中的数据格式很像,因而用pandas操作数据库也很见简单。
连接数据库(‘mydata.sqlite’)并创建一个空表('test')
import sqllite3
import sqlite3
query="""
CREATE TABLE test
(a VARCHAR(20),b VARCHAR(20),
c REAL, d INTEGER
);"""
con=sqlite3.connect('mydata.sqlite')
con.execute(query)
con.commit()
** 插入数据**
data = [('Atlanta', 'Georgia', 1.25, 6),
.....: ('Tallahassee', 'Florida', 2.6, 3),
.....: ('Sacramento', 'California', 1.7, 5)]
stmt="INSERT INTO test VALUES(?,?,?,?)"
con.executemany(stmt,data)
**读取数据**
cursor=con.execute('select * from test')
rows=cursor.fetchall()
**再将row's转换为DataFrame对象,同时添加列名**
但这样从数据库读取数据——》将数据转换为DataFrame对象过于繁杂
**可以利用SQLAlchemy直接读取数据为DataFrame对象:**
db=sqla.create_engine('sqlite:///mydata.sqlite')
pd.read_sql('select * from test',db)