用Numpy和Pandas分析二维数据

1. 数据说明

  • UNIT
    Remote unit that collects turnstile information. Can collect from multiple banks of turnstiles. Large subway stations can have more than one unit.

  • DATEn
    Date in “yyyy­mm­dd” (2011­05­21) format.

  • TIMEn
    Time in “hh:mm:ss” (08:05:02) format.

  • ENTRIESn
    Raw reading of cummulative turnstile entries from the remote unit. Occasionally resets to 0.

  • EXITSn
    Raw reading of cummulative turnstile exits from the remote unit. Occasionally resets to 0.

  • ENTRIESn_hourly
    Difference in ENTRIES from the previous REGULAR reading.

  • EXITSn_hourly
    Difference in EXITS from the previous REGULAR reading.

  • datetime
    Date and time in “yyyy­mm­dd hh:mm:ss” format (2011­05­01 00:00:00). Can be parsed into a Pandas datetime object without modifications.

  • hour
    Hour of the timestamp from TIMEn. Truncated rather than rounded.

  • day_week
    Integer (0 ­ 6 Mon ­ Sun) corresponding to the day of the week.

  • weekday
    Indicator (0 or 1) if the date is a weekday (Mon ­ Fri).

  • station
    Subway station corresponding to the remote unit.

  • latitude
    Latitude of the subway station corresponding to the remote unit.

  • longitude
    Longitude of the subway station corresponding to the remote unit.

  • conds Categorical variable of the weather conditions (Clear, Cloudy etc.) for the time and location.

  • fog
    Indicator (0 or 1) if there was fog at the time and location.

  • precipi
    Precipitation in inches at the time and location.

  • pressurei
    Barometric pressure in inches Hg at the time and location.

  • rain
    Indicator (0 or 1) if rain occurred within the calendar day at the location.

  • tempi
    Temperature in ℉ at the time and location.

  • wspdi
    Wind speed in mph at the time and location.

  • meanprecipi
    Daily average of precipi for the location.

  • meanpressurei
    Daily average of pressurei for the location.

  • meantempi
    Daily average of tempi for the location.

  • meanwspdi
    Daily average of wspdi for the location.

  • weather_lat
    Latitude of the weather station the weather data is from.

  • weather_lon
    Longitude of the weather station the weather data is from.

questions i thought of :

  • what variables are related to subwary ridership?
    -- which stations have the most riders?
    -- what are the ridership patterns over time?
    -- how does the weather affect ridership?

  • what patterns can i find in the weather?
    -- is the temperature rising throughout the month?
    -- how does weather vary across the city?

3. 二维numpy数组

two-dimensional data:
python:list of lists
numpy:2D array
pandas:dataframe

2D arrays as opposed to array of arrays:

  • more memory efficient
  • accessing element is a bit different a[1,3]
  • mean(),std() operate on entire array
import numpy as np


ridership = np.array([
    [   0,    0,    2,    5,    0],
    [1478, 3877, 3674, 2328, 2539],
    [1613, 4088, 3991, 6461, 2691],
    [1560, 3392, 3826, 4787, 2613],
    [1608, 4802, 3932, 4477, 2705],
    [1576, 3933, 3909, 4979, 2685],
    [  95,  229,  255,  496,  201],
    [   2,    0,    1,   27,    0],
    [1438, 3785, 3589, 4174, 2215],
    [1342, 4043, 4009, 4665, 3033]
])
print ridership
print ridership[1, 3]
print ridership[1:3, 3:5]
print ridership[1, :]
    
# Vectorized operations on rows or columns
print ridership[0, :] + ridership[1, :]
print ridership[:, 0] + ridership[:, 1]
    
# Vectorized operations on entire arrays
a = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
b = np.array([[1, 1, 1], [2, 2, 2], [3, 3, 3]])
print a + b


write a function:
find the max riders on the first day
find the mean riders per days
def mean_riders_for_max_station(ridership):
    
    overall_mean = ridership.mean() # Replace this with your code
    max_station = ridership[0,:].argmax()
    mean_for_max = ridership[:,max_station].mean() # Replace this with your code
    
    return (overall_mean, mean_for_max)

4. NumPy 轴

行的平均值

ridership.mean(axis=1)

列的平均值

ridership.mean(axis=0)
import numpy as np


a = np.array([
    [1, 2, 3],
    [4, 5, 6],
    [7, 8, 9]
])
    
print a.sum()
print a.sum(axis=0)
print a.sum(axis=1)
    

ridership = np.array([
    [   0,    0,    2,    5,    0],
    [1478, 3877, 3674, 2328, 2539],
    [1613, 4088, 3991, 6461, 2691],
    [1560, 3392, 3826, 4787, 2613],
    [1608, 4802, 3932, 4477, 2705],
    [1576, 3933, 3909, 4979, 2685],
    [  95,  229,  255,  496,  201],
    [   2,    0,    1,   27,    0],
    [1438, 3785, 3589, 4174, 2215],
    [1342, 4043, 4009, 4665, 3033]
])

def min_and_max_riders_per_day(ridership):
    mean_ridership_for_station = ridership.mean(axis=0)
    
    max_daily_ridership = mean_ridership_for_station.max()    # Replace this with your code
    min_daily_ridership = mean_ridership_for_station.min()   # Replace this with your code
    
    return (max_daily_ridership, min_daily_ridership)

5. NumPy 和 Pandas 数据类型

Pandas dataframe 每一列可以是不同的类型
dataframe.mean() 计算每一列的平均值

6. 访问 DataFrame 元素

.loc['索引名']  #访问相应的一行
.iloc[9] #按位置获取一行
.iloc[1,3]
df['列名']  #获取列
df.values #返回不含列名称或行索引,仅含有df中值的numpy二维数据,这样就可以计算整个df的统计量
import pandas as pd

# Subway ridership for 5 stations on 10 different days
ridership_df = pd.DataFrame(
    data=[[   0,    0,    2,    5,    0],
          [1478, 3877, 3674, 2328, 2539],
          [1613, 4088, 3991, 6461, 2691],
          [1560, 3392, 3826, 4787, 2613],
          [1608, 4802, 3932, 4477, 2705],
          [1576, 3933, 3909, 4979, 2685],
          [  95,  229,  255,  496,  201],
          [   2,    0,    1,   27,    0],
          [1438, 3785, 3589, 4174, 2215],
          [1342, 4043, 4009, 4665, 3033]],
    index=['05-01-11', '05-02-11', '05-03-11', '05-04-11', '05-05-11',
           '05-06-11', '05-07-11', '05-08-11', '05-09-11', '05-10-11'],
    columns=['R003', 'R004', 'R005', 'R006', 'R007']
)


# DataFrame creation
print ridership_df.iloc[0]
print ridership_df.loc['05-05-11']
print ridership_df['R003']
print ridership_df.iloc[1, 3]
        
print ridership_df[['R003', 'R005']]
    
df = pd.DataFrame({'A': [0, 1, 2], 'B': [3, 4, 5]})
print df.sum()
print df.sum(axis=1)
print df.values.sum()
    
def mean_riders_for_max_station(ridership):
    overall_mean = ridership.values.mean() 
    max_station = ridership.iloc[0].argmax()  #return the colunm name 
    mean_for_max = ridership.loc[:,max_station].mean() # Replace this with your code
    
    return (overall_mean, mean_for_max)

7. 将数据加载到 DataFrame 中

DataFrame 可有效表示csv文件内容,可使每一列的数据类型不同

df = pd.read_csv('filename.csv')

8. 计算相关性

默认情况下,Pandas 的 std() 函数使用贝塞耳校正系数来计算标准偏差。调用 std(ddof=0) 可以禁止使用贝塞耳校正系数。
计算皮尔森系数时,需要使用ddof=0

NumPy 的 corrcoef() 函数可用来计算皮尔逊积矩相关系数,也简称为“相关系数”。

import pandas as pd
def correlation(x, y):
    x_standard = (x-x.mean())/x.std(ddof=0) 
    y_standard = (y-y.mean())/y.std(ddof=0)
    return (x_standard * y_standard).mean()

9. Pandas 轴名

axis = 1 axis='column' 行
axis = 0 axis='index' 列

10. DataFrame 向量化运算

import pandas as pd

df1 = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6], 'c': [7, 8, 9]})
df2 = pd.DataFrame({'a': [10, 20, 30], 'b': [40, 50, 60], 'c': [70, 80, 90]})
print df1 + df2
    
df1 = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6], 'c': [7, 8, 9]})
df2 = pd.DataFrame({'d': [10, 20, 30], 'c': [40, 50, 60], 'b': [70, 80, 90]})
print df1 + df2

df1 = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6], 'c': [7, 8, 9]},
                    index=['row1', 'row2', 'row3'])
df2 = pd.DataFrame({'a': [10, 20, 30], 'b': [40, 50, 60], 'c': [70, 80, 90]},
                    index=['row4', 'row3', 'row2'])
print df1 + df2


# Cumulative entries and exits for one station for a few hours.
entries_and_exits = pd.DataFrame({
    'ENTRIESn': [3144312, 3144335, 3144353, 3144424, 3144594,
                 3144808, 3144895, 3144905, 3144941, 3145094],
    'EXITSn': [1088151, 1088159, 1088177, 1088231, 1088275,
               1088317, 1088328, 1088331, 1088420, 1088753]
})

def get_hourly_entries_and_exits(entries_and_exits):
    '''
    Fill in this function to take a DataFrame with cumulative entries
    and exits (entries in the first column, exits in the second) and
    return a DataFrame with hourly entries and exits (entries in the
    first column, exits in the second).
    '''
    return entries_and_exits-entries_and_exits.shift(1)

11. DataFrame applymap()

import pandas as pd

df = pd.DataFrame({
    'a': [1, 2, 3],
    'b': [10, 20, 30],
    'c': [5, 10, 15]
})
    
def add_one(x):
    return x + 1
        
print df.applymap(add_one)
    
grades_df = pd.DataFrame(
    data={'exam1': [43, 81, 78, 75, 89, 70, 91, 65, 98, 87],
          'exam2': [24, 63, 56, 56, 67, 51, 79, 46, 72, 60]},
    index=['Andre', 'Barry', 'Chris', 'Dan', 'Emilio', 
           'Fred', 'Greta', 'Humbert', 'Ivan', 'James']
)
 
def convert_grade(x):
    if x>= 90:
        return 'A'
    elif x>= 80:
        return 'B'
    elif x>= 70:
        return 'C'
    elif x>=60:
        return 'D'
    else:
        return 'F'
def convert_grades(grades):
    
    return grades.applymap(convert_grade)

12.DataFrame apply()

def standardize_column(column):
    return (column - column.mean())/column.std(ddof=0)
def standardize(df):
    return df.apply(standardize_column)

计算得出的默认标准偏差类型在 numpy 的 .std() 和 pandas 的 .std() 函数之间是不同的。默认情况下,numpy 计算的是总体标准偏差,ddof = 0。另一方面,pandas 计算的是样本标准偏差,ddof = 1。如果我们知道所有的分数,那么我们就有了总体——因此,要使用 pandas 进行归一化处理,我们需要将“ddof”设置为 0。

13. DataFrame apply() 使用案例 2

将一列数据转化为单个值

def column_second_largest(column):
    sorted_values = column.sort_values(ascending = False)
    return sorted_values.iloc[1]
    
def second_largest(df):
    '''
    Fill in this function to return the second-largest value of each 
    column of the input DataFrame.
    '''
    return df.apply(column_second_largest)

14. 向 Series 添加 DataFrame

import pandas as pd
s = pd.Series([1, 2, 3, 4])
df = pd.DataFrame({
    0: [10, 20, 30, 40],
    1: [50, 60, 70, 80],
    2: [90, 100, 110, 120],
    3: [130, 140, 150, 160]
})

# Adding a Series to a square DataFrame    
print df + s
    
s = pd.Series([1, 2, 3, 4])
df = pd.DataFrame({0: [10], 1: [20], 2: [30], 3: [40]})
# Adding a Series to a one-row DataFrame 
print df + s

s = pd.Series([1, 2, 3, 4])
df = pd.DataFrame({0: [10, 20, 30, 40]})
# Adding a Series to a one-column DataFrame
print df + s
    

    
# Adding when DataFrame column names match Series index
s = pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])
df = pd.DataFrame({
    'a': [10, 20, 30, 40],
    'b': [50, 60, 70, 80],
    'c': [90, 100, 110, 120],
    'd': [130, 140, 150, 160]
})
    
print df + s
    
# Adding when DataFrame column names don't match Series index
s = pd.Series([1, 2, 3, 4])
df = pd.DataFrame({
    'a': [10, 20, 30, 40],
    'b': [50, 60, 70, 80],
    'c': [90, 100, 110, 120],
    'd': [130, 140, 150, 160]
})
print df + s

df.add(s) --- df+s
df.add(s,axis='columns')
df.add(s,axis='index')

将dataframe与series相加,就是将dataframe的每一列与series的每一个值相加,它根据series的索引值和dataframe的列名匹配dataframe和series.

15. 再次归一化每一列

def standardize(df):
    '''
    归一化每一列
    '''
    return (df-df.mean())/df.std(ddof=0)

def standardize_rows(df):
    '''
    归一化每一行
    '''
    mean = df.mean(axis='columns')
    mean_difference = df-mean
    std = df.std(axis = 'columns',ddof=0)
    return mean_difference/std

16. Pandas groupby()

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

values = np.array([1, 3, 2, 4, 1, 6, 4])
example_df = pd.DataFrame({
    'value': values,
    'even': values % 2 == 0,
    'above_three': values > 3 
}, index=['a', 'b', 'c', 'd', 'e', 'f', 'g'])


print example_df
    
grouped_data = example_df.groupby('even')
print grouped_data.groups
    
# Group by multiple columns
grouped_data = example_df.groupby(['even', 'above_three'])
print grouped_data.groups
    
# Get sum of each group
grouped_data = example_df.groupby('even')
print grouped_data.sum()
    

grouped_data = example_df.groupby('even')
print grouped_data.sum()['value']
print grouped_data['value'].sum()

17. 每小时入站和出站数

def hourly(column):
return column - column.shift(1)

def get_hourly_entries_and_exits(entries_and_exits):
'''
Fill in this function to take a DataFrame with cumulative entries
and exits and return a DataFrame with hourly entries and exits.
The hourly entries and exits should be calculated separately for
each station (the 'UNIT' column).
'''
return entries_and_exits.groupby('UNIT')[['ENTRIESn','EXITSn']].apply(hourly)

18.合并 Pandas DataFrame

import pandas as pd

subway_df = pd.DataFrame({
    'UNIT': ['R003', 'R003', 'R003', 'R003', 'R003', 'R004', 'R004', 'R004',
             'R004', 'R004'],
    'DATEn': ['05-01-11', '05-02-11', '05-03-11', '05-04-11', '05-05-11',
              '05-01-11', '05-02-11', '05-03-11', '05-04-11', '05-05-11'],
    'hour': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
    'ENTRIESn': [ 4388333,  4388348,  4389885,  4391507,  4393043, 14656120,
                 14656174, 14660126, 14664247, 14668301],
    'EXITSn': [ 2911002,  2911036,  2912127,  2913223,  2914284, 14451774,
               14451851, 14454734, 14457780, 14460818],
    'latitude': [ 40.689945,  40.689945,  40.689945,  40.689945,  40.689945,
                  40.69132 ,  40.69132 ,  40.69132 ,  40.69132 ,  40.69132 ],
    'longitude': [-73.872564, -73.872564, -73.872564, -73.872564, -73.872564,
                  -73.867135, -73.867135, -73.867135, -73.867135, -73.867135]
})

weather_df = pd.DataFrame({
    'DATEn': ['05-01-11', '05-01-11', '05-02-11', '05-02-11', '05-03-11',
              '05-03-11', '05-04-11', '05-04-11', '05-05-11', '05-05-11'],
    'hour': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
    'latitude': [ 40.689945,  40.69132 ,  40.689945,  40.69132 ,  40.689945,
                  40.69132 ,  40.689945,  40.69132 ,  40.689945,  40.69132 ],
    'longitude': [-73.872564, -73.867135, -73.872564, -73.867135, -73.872564,
                  -73.867135, -73.872564, -73.867135, -73.872564, -73.867135],
    'pressurei': [ 30.24,  30.24,  30.32,  30.32,  30.14,  30.14,  29.98,  29.98,
                   30.01,  30.01],
    'fog': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
    'rain': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
    'tempi': [ 52. ,  52. ,  48.9,  48.9,  54. ,  54. ,  57.2,  57.2,  48.9,  48.9],
    'wspdi': [  8.1,   8.1,   6.9,   6.9,   3.5,   3.5,  15. ,  15. ,  15. ,  15. ]
})

def combine_dfs(subway_df, weather_df):
    '''
    Fill in this function to take 2 DataFrames, one with subway data and one with weather data,
    and return a single dataframe with one row for each date, hour, and location. Only include
    times and locations that have both subway data and weather data available.
    '''
    return subway_df.merge(weather_df,
        on=['DATEn','hour','latitude','longitude'],
        how='inner')
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 212,884评论 6 492
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 90,755评论 3 385
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 158,369评论 0 348
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 56,799评论 1 285
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 65,910评论 6 386
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 50,096评论 1 291
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 39,159评论 3 411
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 37,917评论 0 268
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 44,360评论 1 303
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 36,673评论 2 327
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 38,814评论 1 341
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 34,509评论 4 334
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 40,156评论 3 317
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 30,882评论 0 21
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 32,123评论 1 267
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 46,641评论 2 362
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 43,728评论 2 351

推荐阅读更多精彩内容

  • rljs by sennchi Timeline of History Part One The Cognitiv...
    sennchi阅读 7,312评论 0 10
  • 【02/19/2017 周日 第172天 丁酉年正月二十三日】 ✔静√智√勇√仁√强√礼 小结。 √ 妈妈感冒咳嗽...
    妈妈熊阅读 308评论 1 2
  • 秋来气清爽,天高白云淡。 苍黄野草色,湖水幽还蓝。 菊花黄遍野,霜叶红满山。 流急沙填海,雨沛土塞川。 江南红豆润...
    黄土原阅读 862评论 7 16
  • 0.作业要求 使用ASN.1编写一个数据结构。数据结构自己考虑。 分别使用asn1c、JavaAsn1Compil...
    htkz阅读 16,072评论 5 7
  • 你好,女神,我还是我,一个三本的大学生。很幸运高中的时候认识了你,认识了你的古灵精怪,膜拜了你的聪明伶俐,挑逗过你...
    乐侠阅读 490评论 0 1