Python 数据科学笔记2

Python DataScience Handbook 学习笔记

第二部分 numpy(2)




In [29]: from scipy import special

In [30]: x = np.random.randint(15, size = (5,5), dtype = 'int32')

In [31]: x
array([[ 4, 14,  8,  5,  7],
       [ 0,  8,  8, 14,  9],
       [ 9,  9, 10, 14,  1],
       [13, 10,  0, 12, 12],
       [ 7,  3,  2, 14,  2]], dtype=int32)

In [32]: special.erf(x)
array([[ 0.99999998,  1.        ,  1.        ,  1.        ,  1.        ],
       [ 0.        ,  1.        ,  1.        ,  1.        ,  1.        ],
       [ 1.        ,  1.        ,  1.        ,  1.        ,  0.84270079],
       [ 1.        ,  1.        ,  0.        ,  1.        ,  1.        ],
       [ 1.        ,  0.99997791,  0.99532227,  1.        ,  0.99532227]])

In [33]: x
array([[ 4, 14,  8,  5,  7],
       [ 0,  8,  8, 14,  9],
       [ 9,  9, 10, 14,  1],
       [13, 10,  0, 12, 12],
       [ 7,  3,  2, 14,  2]], dtype=int32)

Specifying output

In [24]:
x = np.arange(5)
y = np.empty(5)
np.multiply(x, 10, out=y)
[  0.  10.  20.  30.  40.]

y = np.zeros(10)
np.power(2, x, out=y[::2])
[  1.   0.   2.   0.   4.   0.   8.   0.  16.   0.]

在y[::2] = 2 ** x的过程中,我们会创建一个临时数组,储存右边语句的值,再将其拷贝到左边的子数组中。很显然,使用specifying output提升了效率。


In [36]: x = np.linspace(0, 10, 5)

In [37]: x
Out[37]: array([  0. ,   2.5,   5. ,   7.5,  10. ])

In [38]: np.add.reduce(x)
Out[38]: 25.0

In [39]: np.multiply.reduce(x)
Out[39]: 0.0

In [40]: np.add.accumulate(x)
Out[40]: array([  0. ,   2.5,   7.5,  15. ,  25. ])

Outer 外积

In [41]: x = np.arange(1, 5)

In [42]: x
Out[42]: array([1, 2, 3, 4])

In [43]: np.multiply.outer(x, x)
array([[ 1,  2,  3,  4],
       [ 2,  4,  6,  8],
       [ 3,  6,  9, 12],
       [ 4,  8, 12, 16]])

In [44]: x = np.arange(1, 10)

In [45]: x
Out[45]: array([1, 2, 3, 4, 5, 6, 7, 8, 9])

In [46]: %timeit x.sum()
1.11 µs ± 72.4 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [47]: %timeit sum(x)         #Be careful, don't use the python-version sum()
1.3 µs ± 5.45 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [48]: x.min()
Out[48]: 1

In [49]: x.max()
Out[49]: 9


In [50]: Mat = np.random.random((3,4))

In [51]: Mat.sum(axis = 1)
Out[51]: array([ 2.54634383,  2.42121143,  1.28962794])

In [52]: Mat
array([[ 0.77880176,  0.57543626,  0.6840498 ,  0.508056  ],
       [ 0.75612961,  0.15132258,  0.65047932,  0.86327992],
       [ 0.25738888,  0.5731711 ,  0.03401482,  0.42505314]])

In [53]: Mat.sum(axis = 0)
Out[53]: array([ 1.79232025,  1.29992993,  1.36854395,  1.79638906])

In [54]: # axis = 0 means adding the elements around column



 In [1]: import numpy as np

In [2]: a = np.array([1, 2, 3])

In [3]: b = 3

In [4]: a + b
Out[4]: array([4, 5, 6])


In [5]: M = np.ones((3, 3))

In [6]: M + a
array([[ 2.,  3.,  4.],
       [ 2.,  3.,  4.],
       [ 2.,  3.,  4.]])

In [7]: a = np.arange(3)

In [8]: b = np.arange(3)[:, np.newaxis]

In [9]: a
Out[9]: array([0, 1, 2])

In [10]: b

In [11]: a + b
array([[0, 1, 2],
       [1, 2, 3],
       [2, 3, 4]])


How it works


  1. Rule 1: If the two arrays differ in their number of dimensions, the shape of the one with fewer dimensions is padded with ones on its leading (left) side.
  2. Rule 2: If the shape of the two arrays does not match in any dimension, the array with shape equal to 1 in that dimension is stretched to match the other shape.
  3. Rule 3: If in any dimension the sizes disagree and neither is equal to 1, an error is raised.


创建一个z = f(x,y) 的数据集

# x and y have 50 steps from 0 to 5
x = np.linspace(0, 5, 50)
y = np.linspace(0, 5, 50)[:, np.newaxis]
z = np.sin(x) ** 10 + np.cos(10 + y * x) * np.cos(x)

Boolean masking

这里书中使用了一个关于雨水的数据集来展示boolean masking的妙用。

In [1]: import numpy as np

In [2]: import pandas as pd

In [3]: rainfall = pd.read_csv('data/Seattle2014.csv')['PRCP'].values

In [4]: inches = rainfall / 254.0

In [5]: inches.shape
Out[5]: (365,)



前面我们提到过ufunc是一类对array整体进行操作的函数,这里我们把他与boolean masking相结合.

In [1]: import numpy as np

In [2]: rng = np.random.RandomState(0)

In [3]: x = rng.randint(10, size = (3, 4))

In [4]: x
array([[5, 0, 3, 3],
       [7, 9, 3, 5],
       [2, 4, 7, 6]])

In [5]: x < 6
array([[ True,  True,  True,  True],
       [False, False,  True,  True],
       [ True,  True, False, False]], dtype=bool)

上述的ufunc操作会带给了我们一个boolean array, 接下来作者就展示了boolean array 的妙用。

In [5]: x < 6
array([[ True,  True,  True,  True],
       [False, False,  True,  True],
       [ True,  True, False, False]], dtype=bool)

In [6]: np.count_nonzero(_)
Out[6]: 8

In [7]: np.sum(x < 6)
Out[7]: 8

In [8]: np.any(x > 8)
Out[8]: True

In [9]: np.all(x < 8, axis = 1)
Out[9]: array([ True, False,  True], dtype=bool)

In [10]: # Working together with boolean operators

In [11]: np.sum((x < 6) & (x >= 0))
Out[11]: 8

最后boolean array 还可以用为mask,这里与matlab中的logic array还是非常类似的

In [12]: x[x < 6]
Out[12]: array([5, 0, 3, 3, 3, 5, 2, 4])


# construct a mask of all rainy days
rainy = (inches > 0)

# construct a mask of all summer days (June 21st is the 172nd day)
days = np.arange(365)
summer = (days > 172) & (days < 262)

print("Median precip on rainy days in 2014 (inches):   ",
print("Median precip on summer days in 2014 (inches):  ",
print("Maximum precip on summer days in 2014 (inches): ",
print("Median precip on non-summer rainy days (inches):",
      np.median(inches[rainy & ~summer]))
Median precip on rainy days in 2014 (inches):    0.194881889764
Median precip on summer days in 2014 (inches):   0.0
Maximum precip on summer days in 2014 (inches):  0.850393700787
Median precip on non-summer rainy days (inches): 0.200787401575

最后要注意and, & 与 or, | 的区别,后者是位运算符。

Fancy Indexing

fancy indexing指我们以一个array作为数组的index(就例如上一届的boolean masks)

In [14]: ind = np.array([[3, 7], [4, 5]])

In [15]: rand = np.random.RandomState(45)

In [16]: x= rand.randint(100, size = (10, 5))

In [17]: x
array([[75, 30,  3, 32, 95],
       [61, 85, 35, 68, 15],
       [65, 14, 53, 57, 72],
       [87, 46,  8, 53, 12],
       [34, 24, 12, 17, 68],
       [30, 56, 14, 36, 31],
       [86, 36, 57, 61, 79],
       [17,  6, 42, 11,  8],
       [49, 77, 75, 63, 42],
       [54, 16, 24, 95, 63]])

In [18]: x[ind]
array([[[87, 46,  8, 53, 12],
        [17,  6, 42, 11,  8]],

       [[34, 24, 12, 17, 68],
        [30, 56, 14, 36, 31]]])

In [19]: # Shape of the result reflects the shape of the index arrays rather tha
    ...: n the shape of the array being indexed

In [20]: X = np.arange(12).reshape((3, 4))

In [21]: X
array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])

In [22]: row = np.array([0, 1, 2])

In [23]: col = np.array([2, 1, 3])

In [24]: X[row, col]
Out[24]: array([ 2,  5, 11])

In [25]: # We get the (0, 2), (1, 1), (2, 3) th element

In [34]: X.shape
Out[34]: (100, 2)

In [35]: import matplotlib.pyplot as plt

In [36]: import seaborn; seaborn.set()

In [37]: plt.scatter(X[:, 0], X[:, 1])
Out[37]: <matplotlib.collections.PathCollection at 0x7f0cc9c461d0>
<matplotlib.figure.Figure at 0x7f0cc9c6b5f8>

In [38]:

In [39]: indices = np.random.choice(X.shape[0], 20, replace = False)

In [40]: indices
array([15, 87, 73, 17, 44, 66, 89, 91,  8, 25, 19, 39, 85, 49, 26, 20, 58,
       41, 55, 24])

In [41]: selection = X[indices] # fancy indexing

In [42]: selection
array([[ -1.80623391e-01,  -2.15707232e+00],
       [ -8.04178492e-01,  -1.34828994e+00],
       [ -1.24272035e+00,  -2.42157557e+00],
       [  3.57111518e-01,   8.94495954e-02],
       [  2.15274973e+00,   3.24279140e+00],
       [ -4.18439156e-01,  -8.58736471e-01],
       [  6.08859877e-01,  -2.59284917e-01],
       [ -6.29633042e-01,   1.32258627e-01],
       [  1.11113414e+00,   1.77185490e+00],
       [  1.65522319e+00,   4.23558698e+00],
       [ -1.40629915e-01,  -1.62069848e-01],
       [  5.21162541e-01,   2.89756456e+00],
       [ -1.11282410e+00,  -1.82987036e+00],
       [ -5.71948987e-01,  -3.34258009e+00],
       [ -2.34528800e+00,  -3.77554207e+00],
       [ -2.58467915e-01,  -8.69598951e-01],
       [ -1.46270269e-01,  -1.27384266e-04],
       [ -7.79152780e-02,  -2.01423478e+00],
       [ -1.79097697e+00,  -1.08351482e+00],
       [ -1.31637907e+00,  -1.86128924e+00]])

Using Fancy Index to modify values

In [53]: x
Out[53]: array([ 0.,  0.,  2.,  3.,  4.,  0.])

In [54]: i
Out[54]: [2, 3, 3, 4, 4, 4]

In [55]: x[i] += 1

In [56]: x
Out[56]: array([ 0.,  0.,  3.,  4.,  5.,  0.])

In [57]: x = np.zeros(10)

In [58]:, i, 1) # proper way to do

In [59]: x
Out[59]: array([ 0.,  0.,  1.,  2.,  3.,  0.,  0.,  0.,  0.,  0.])

Binning Data

In [67]: np.random.seed(42)

In [68]: x = np.random.randn(100)

In [69]: size(x)
Out[69]: 100

In [70]: bins = np.linspace(-5, 5, 20)

In [71]: counts = np.zeros_like(bins)

In [72]: size(counts)
Out[72]: 20

In [73]: i = np.searchsorted(bins, x)

In [74]: i
array([11, 10, 11, 13, 10, 10, 13, 11,  9, 11,  9,  9, 10,  6,  7,  9,  8,
       11,  8,  7, 13, 10, 10,  7,  9, 10,  8, 11,  9,  9,  9, 14, 10,  8,
       12,  8, 10,  6,  7, 10, 11, 10, 10,  9,  7,  9,  9, 12, 11,  7, 11,
        9,  9, 11, 12, 12,  8,  9, 11, 12,  9, 10,  8,  8, 12, 13, 10, 12,
       11,  9, 11, 13, 10, 13,  5, 12, 10,  9, 10,  6, 10, 11, 13,  9,  8,
        9, 12, 11,  9, 11, 10, 12,  9,  9,  9,  7, 11, 10, 10, 10])

In [75]: x
array([ 0.49671415, -0.1382643 ,  0.64768854,  1.52302986, -0.23415337,
       -0.23413696,  1.57921282,  0.76743473, -0.46947439,  0.54256004,
       -0.46341769, -0.46572975,  0.24196227, -1.91328024, -1.72491783,
       -0.56228753, -1.01283112,  0.31424733, -0.90802408, -1.4123037 ,
        1.46564877, -0.2257763 ,  0.0675282 , -1.42474819, -0.54438272,
        0.11092259, -1.15099358,  0.37569802, -0.60063869, -0.29169375,
       -0.60170661,  1.85227818, -0.01349722, -1.05771093,  0.82254491,
       -1.22084365,  0.2088636 , -1.95967012, -1.32818605,  0.19686124,
        0.73846658,  0.17136828, -0.11564828, -0.3011037 , -1.47852199,
       -0.71984421, -0.46063877,  1.05712223,  0.34361829, -1.76304016,
        0.32408397, -0.38508228, -0.676922  ,  0.61167629,  1.03099952,
        0.93128012, -0.83921752, -0.30921238,  0.33126343,  0.97554513,
       -0.47917424, -0.18565898, -1.10633497, -1.19620662,  0.81252582,
        1.35624003, -0.07201012,  1.0035329 ,  0.36163603, -0.64511975,
        0.36139561,  1.53803657, -0.03582604,  1.56464366, -2.6197451 ,
        0.8219025 ,  0.08704707, -0.29900735,  0.09176078, -1.98756891,
       -0.21967189,  0.35711257,  1.47789404, -0.51827022, -0.8084936 ,
       -0.50175704,  0.91540212,  0.32875111, -0.5297602 ,  0.51326743,
        0.09707755,  0.96864499, -0.70205309, -0.32766215, -0.39210815,
       -1.46351495,  0.29612028,  0.26105527,  0.00511346, -0.23458713])

In [76]:, i, 1)

In [77]: counts
array([  0.,   0.,   0.,   0.,   0.,   1.,   3.,   7.,   9.,  23.,  22.,
        17.,  10.,   7.,   1.,   0.,   0.,   0.,   0.,   0.])



In [18]: x
Out[18]: array([14, 92, 58, 74, 22])

In [19]: i = np.argsort(x)

In [20]: x[i]
Out[20]: array([14, 22, 58, 74, 92])

根据argsort得到的index array, 我们可以用fancy index来构建出排序后的数组

In [21]: x = np.arange(1,10)

In [22]: np.random.shuffle(x)

In [23]: x
Out[23]: array([2, 9, 4, 3, 8, 6, 7, 5, 1])

In [24]: np.partition(x, 5)
Out[24]: array([1, 2, 3, 4, 5, 6, 7, 9, 8])


Structured arrays

In [25]: name = ['Alice', 'Bob', 'Cathy', 'Doug']
    ...: age = [25, 45, 37, 19]
    ...: weight = [55.0, 85.5, 68.0, 61.5]

In [26]: x = np.zeros(4, dtype=int)

In [27]: # compound data type

In [28]: data = np.zeros(4, dtype={'names':('name', 'age', 'weight'), 'formats'
    ...: :('U10', 'i4', 'f8')})

In [29]: data.dtype
Out[29]: dtype([('name', '<U10'), ('age', '<i4'), ('weight', '<f8')])

In [30]: data['name']=name;data['age']=age;data['weight']=weight

In [31]: data
array([('Alice', 25,  55. ), ('Bob', 45,  85.5), ('Cathy', 37,  68. ),
       ('Doug', 19,  61.5)],
      dtype=[('name', '<U10'), ('age', '<i4'), ('weight', '<f8')])

In [32]: data[data['age'] < 30]['name']
array(['Alice', 'Doug'],

除了structured array, numpy还内置了record array,最大的区别是能够把上面的这些key作为属性来访问,但坏处是访问速度要慢于按键访问

