《Pandas 1.x Cookbook · 第二版》第07章 过滤行

第01章 Pandas基础
第02章 DataFrame基础运算
第03章 创建和持久化DataFrame
第04章 开始数据分析
第05章 探索性数据分析
第06章 选取数据子集
第07章 过滤行
第08章 索引对齐


7.1 计算布尔统计信息

读取电影数据集,检查前几行:

>>> import pandas as pd
>>> import numpy as np
>>> movie = pd.read_csv(
...     "data/movie.csv", index_col="movie_title"
... )
>>> movie[["duration"]].head()
                                            Duration
movie_title
Avatar                                         178.0
Pirates of the Caribbean: At World's End       169.0
Spectre                                        148.0
The Dark Knight Rises                          164.0
Star Wars: Episode VII - The Force Awakens       NaN

判断电影时长是否超过两小时:

>>> movie_2_hours = movie["duration"] > 120
>>> movie_2_hours.head(10)
movie_title
Avatar                                         True
Pirates of the Caribbean: At World's End       True
Spectre                                        True
The Dark Knight Rises                          True
Star Wars: Episode VII - The Force Awakens    False
John Carter                                    True
Spider-Man 3                                   True
Tangled                                       False
Avengers: Age of Ultron                        True
Harry Potter and the Half-Blood Prince         True
Name: duration, dtype: bool

使用这个Series判断时长超过两小时的电影总和:

>>> movie_2_hours.sum()
1039

时长超过两小时的电影所占的比例:

>>> movie_2_hours.mean() * 100
21.13506916192026

前面的步骤没有删除缺失值,其实有误导性:

>>> movie["duration"].dropna().gt(120).mean() * 100
21.199755152009794

使用.describe方法输出概括统计性信息:

>>> movie_2_hours.describe()
count      4916
unique        2
top       False
freq       3877
Name: duration, dtype: object

原理

使用.value_counts方法统计FalseTrue所占的比例:

>>> movie_2_hours.value_counts(normalize=True)
False    0.788649
True     0.211351
Name: duration, dtype: float64

更多

可以使用DataFrame中的两列,创建布尔Series:

>>> actors = movie[
...     ["actor_1_facebook_likes", "actor_2_facebook_likes"]
... ].dropna()
>>> (
...     actors["actor_1_facebook_likes"]
...     > actors["actor_2_facebook_likes"]
... ).mean()
0.9777687130328371

7.2 构造布尔条件

读取数据:

>>> movie = pd.read_csv(
...     "data/movie.csv", index_col="movie_title"
... )

创建变量用于存储布尔数组:

>>> criteria1 = movie.imdb_score > 8
>>> criteria2 = movie.content_rating == "PG-13"
>>> criteria3 = (movie.title_year < 2000) | (
...     movie.title_year > 2009
... )

将所有过滤器组成一个布尔数组:

>>> criteria_final = criteria1 & criteria2 & criteria3
>>> criteria_final.head()
movie_title
Avatar                                        False
Pirates of the Caribbean: At World's End      False
Spectre                                       False
The Dark Knight Rises                          True
Star Wars: Episode VII - The Force Awakens    False
dtype: bool

更多

比较运算符是有顺序的:

>>> 5 < 10 and 3 > 4
False
>>> 5 < 10 and 3 > 4
False    
>>> True and 3 > 4
False    
>>> True and False
False    
>>> False
False 

7.3 使用布尔数组进行过滤

读取数据,设置过滤条件:

>>> movie = pd.read_csv(
...     "data/movie.csv", index_col="movie_title"
... )
>>> crit_a1 = movie.imdb_score > 8
>>> crit_a2 = movie.content_rating == "PG-13"
>>> crit_a3 = (movie.title_year < 2000) | (
...     movie.title_year > 2009
... )
>>> final_crit_a = crit_a1 & crit_a2 & crit_a3

再创建一组条件:

>>> crit_b1 = movie.imdb_score < 5
>>> crit_b2 = movie.content_rating == "R"
>>> crit_b3 = (movie.title_year >= 2000) & (
...     movie.title_year <= 2010
... )
>>> final_crit_b = crit_b1 & crit_b2 & crit_b3

将这两个条件组成最后的条件:

>>> final_crit_all = final_crit_a | final_crit_b
>>> final_crit_all.head()
movie_title
Avatar                                        False
Pirates of the Caribbean: At World's End      False
Spectre                                       False
The Dark Knight Rises                          True
Star Wars: Episode VII - The Force Awakens    False
dtype: bool

用最后的条件过滤数据:

>>> movie[final_crit_all].head()
                            color  ... movie/likes
movie_title                        ...
The Dark Knight Rises       Color  ...      164000
The Avengers                Color  ...      123000
Captain America: Civil War  Color  ...       72000
Guardians of the Galaxy     Color  ...       96000
Interstellar                Color  ...      349000

.loc也可以使用这个过滤条件:

>>> movie.loc[final_crit_all].head()
                            color  ... movie/likes
movie_title                        ...
The Dark Knight Rises       Color  ...      164000
The Avengers                Color  ...      123000
Captain America: Civil War  Color  ...       72000
Guardians of the Galaxy     Color  ...       96000
Interstellar                Color  ...      349000

另外也可以在.loc中指定列:

>>> cols = ["imdb_score", "content_rating", "title_year"]
>>> movie_filtered = movie.loc[final_crit_all, cols]
>>> movie_filtered.head(10)
              imdb_score content_rating  title_year
movie_title
The Dark ...         8.5        PG-13        2012.0
The Avengers         8.1        PG-13        2012.0
Captain A...         8.2        PG-13        2016.0
Guardians...         8.1        PG-13        2014.0
Interstellar         8.6        PG-13        2014.0
Inception            8.8        PG-13        2010.0
The Martian          8.1        PG-13        2015.0
Town & Co...         4.4            R        2001.0
Sex and t...         4.3            R        2010.0
Rollerball           3.0            R        2002.0

.iloc不支持布尔数组,但支持NumPy数组:

>>> movie.iloc[final_crit_all]
Traceback (most recent call last):
  ...
ValueError: iLocation based boolean indexing cannot use an indexable as a mask    
>>> movie.iloc[final_crit_all.to_numpy()]
                            color  ... movie/likes
movie_title                        ...
The Dark Knight Rises       Color  ...      164000
The Avengers                Color  ...      123000
Captain America: Civil War  Color  ...       72000
Guardians of the Galaxy     Color  ...       96000
Interstellar                Color  ...      349000
...                           ...  ...         ...
The Young Unknowns          Color  ...           4
Bled                        Color  ...         128
Hoop Dreams                 Color  ...           0
Death Calls                 Color  ...          16
The Legend of God's Gun     Color  ...          13

更多

可以将所有条件放入一行:

>>> final_crit_a2 = (
...     (movie.imdb_score > 8)
...     & (movie.content_rating == "PG-13")
...     & (
...         (movie.title_year < 2000)
...         | (movie.title_year > 2009)
...     )
... )
>>> final_crit_a2.equals(final_crit_a)
True

7.4 对比行过滤和索引过滤

读取数据,并进行筛选:

>>> college = pd.read_csv("data/college.csv")
>>> college[college["STABBR"] == "TX"].head()
                            INSTNM  ...        GRAD_/_SUPP
3610  Abilene Christian University  ...              25985
3611       Alvin Community College  ...               6750
3612              Amarillo College  ...              10950
3613              Angelina College  ...  PrivacySuppressed
3614       Angelo State University  ...            21319.5

重复上面的步骤,使用STABBR列作为行索引,然后使用基于标签的进行提取:

>>> college2 = college.set_index("STABBR")
>>> college2.loc["TX"].head()
                            INSTNM  ...        GRAD_/_SUPP
3610  Abilene Christian University  ...              25985
3611       Alvin Community College  ...               6750
3612              Amarillo College  ...              10950
3613              Angelina College  ...  PrivacySuppressed
3614       Angelo State University  ...            21319.5

比较两种方法的速度:

>>> %timeit college[college['STABBR'] == 'TX']
1.75 ms ± 187 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
>>> %timeit college2.loc['TX']
882 µs ± 69.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

虽然用行索引快,但是创建行索引也需要时间:

>>> %timeit college2 = college.set_index('STABBR')
2.01 ms ± 107 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

更多

使用布尔条件选取多列:

>>> states = ["TX", "CA", "NY"]
>>> college[college["STABBR"].isin(states)]
           INSTNM         CITY  ... MD_EARN_WNE_P10  GRAD_DEBT_MDN_SUPP
192   Academy ...  San Fran...  ...        36000           35093
193   ITT Tech...  Rancho C...  ...        38800         25827.5
194   Academy ...      Oakland  ...          NaN     PrivacyS...
195   The Acad...  Huntingt...  ...        28400            9500
196   Avalon S...      Alameda  ...        21600            9860
...           ...          ...  ...          ...             ...
7528  WestMed ...       Merced  ...          NaN         15623.5
7529  Vantage ...      El Paso  ...          NaN            9500
7530  SAE Inst...   Emeryville  ...          NaN            9500
7533  Bay Area...     San Jose  ...          NaN     PrivacyS...
7534  Excel Le...  San Antonio  ...          NaN           12125
>>> college2.loc[states]
             INSTNM        CITY  ...  MD_EARN_WNE_P10  GRAD_DEBT_MDN_SUPP
STABBR                           ...
TX      Abilene ...     Abilene  ...        40200            25985
TX      Alvin Co...       Alvin  ...        34500             6750
TX      Amarillo...    Amarillo  ...        31700            10950
TX      Angelina...      Lufkin  ...        26900      PrivacyS...
TX      Angelo S...  San Angelo  ...        37700          21319.5
...             ...         ...  ...          ...              ...
NY      Briarcli...   Patchogue  ...        38200          28720.5
NY      Jamestow...   Salamanca  ...          NaN            12050
NY      Pratt Ma...    New York  ...        40900            26691
NY      Saint Jo...   Patchogue  ...        52000          22143.5
NY      Franklin...    Brooklyn  ...        20000      PrivacyS...

7.5 使用唯一和有序索引选取

读取数据集,使用STABBR作为索引,判断索引是否是单调的:

>>> college = pd.read_csv("data/college.csv")
>>> college2 = college.set_index("STABBR")
>>> college2.index.is_monotonic
False

对索引进行排序,并判断是否单调:

>>> college3 = college2.sort_index()
>>> college3.index.is_monotonic
True

查询从这三个DataFrame选取TX的速度:

>>> %timeit college[college['STABBR'] == 'TX']
1.75 ms ± 187 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
>>> %timeit college2.loc['TX']
1.09 ms ± 232 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
>>> %timeit college3.loc['TX']
304 µs ± 17.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

有序索引大大提高了速度。现在试试唯一索引:

>>> college_unique = college.set_index("INSTNM")
>>> college_unique.index.is_unique
True

使用布尔索引选取数据,返回的是个DataFrame:

>>> college[college["INSTNM"] == "Stanford University"]
           INSTNM      CITY  ... MD_EARN_WNE_P10  GRAD_DEBT_MDN_SUPP
4217  Stanford...  Stanford  ...        86000           12782

使用行索引进行选取:

>>> college_unique.loc["Stanford University"]
CITY                  Stanford
STABBR                      CA
HBCU                         0
MENONLY                      0
WOMENONLY                    0
                        ...
PCTPELL                 0.1556
PCTFLOAN                0.1256
UG25ABV                 0.0401
MD_EARN_WNE_P10          86000
GRAD_DEBT_MDN_SUPP       12782
Name: Stanford University, Length: 26, dtype: object

更多

使用城市名和州缩写作为行索引:

>>> college.index = (
...     college["CITY"] + ", " + college["STABBR"]
... )
>>> college = college.sort_index()
>>> college.head()
                   INSTNM      CITY  ... MD_EARN_WNE_P10  GRAD_DEBT_MDN_SUPP
ARTESIA, CA   Angeles ...   ARTESIA  ...          NaN           16850
Aberdeen, SD  Presenta...  Aberdeen  ...        35900           25000
Aberdeen, SD  Northern...  Aberdeen  ...        33600           24847
Aberdeen, WA  Grays Ha...  Aberdeen  ...        27000           11490
Abilene, TX   Hardin-S...   Abilene  ...        38700           25864

选取所有来自Miami, FL的学校:

>>> college.loc["Miami, FL"].head()
                INSTNM   CITY  ... MD_EARN_WNE_P10  GRAD_DEBT_MDN_SUPP
Miami, FL  New Prof...  Miami  ...        18700            8682
Miami, FL  Manageme...  Miami  ...  PrivacyS...           12182
Miami, FL  Strayer ...  Miami  ...        49200         36173.5
Miami, FL  Keiser U...  Miami  ...        29700           26063
Miami, FL  George T...  Miami  ...        38600     PrivacyS...

比较二者的速度:

>>> %%timeit
>>> crit1 = college["CITY"] == "Miami"
>>> crit2 = college["STABBR"] == "FL"
>>> college[crit1 & crit2]
3.05 ms ± 66.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>>> %timeit college.loc['Miami, FL']
369 µs ± 130 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

7.6 翻译SQL的WHERE子句

SQL语句如下:

SELECT
    UNIQUE_ID,
    DEPARTMENT,
    GENDER,
    BASE_SALARY
FROM
    EMPLOYEE
WHERE
    DEPARTMENT IN ('Houston Police Department-HPD',
                   'Houston Fire Department (HFD)') AND
    GENDER = 'Female' AND
    BASE_SALARY BETWEEN 80000 AND 120000;

使用Pandas实现上面SQL语句同样的目的:

>>> employee = pd.read_csv("data/employee.csv")

查看数据集的信息:

>>> employee.dtypes
UNIQUE_ID              int64
POSITION_TITLE        object
DEPARTMENT            object
BASE_SALARY          float64
RACE                  object
EMPLOYMENT_TYPE       object
GENDER                object
EMPLOYMENT_STATUS     object
HIRE_DATE             object
JOB_DATE              object
dtype: object
>>> employee.DEPARTMENT.value_counts().head()
Houston Police Department-HPD     638
Houston Fire Department (HFD)     384
Public Works & Engineering-PWE    343
Health & Human Services           110
Houston Airport System (HAS)      106
Name: DEPARTMENT, dtype: int64    
>>> employee.GENDER.value_counts()
Male      1397
Female     603
Name: GENDER, dtype: int64    
>>> employee.BASE_SALARY.describe()
count      1886.000000
mean      55767.931601
std       21693.706679
min       24960.000000
25%       40170.000000
50%       54461.000000
75%       66614.000000
max      275000.000000
Name: BASE_SALARY, dtype: float64

创建过滤条件:

>>> depts = [
...     "Houston Police Department-HPD",
...     "Houston Fire Department (HFD)",
... ]
>>> criteria_dept = employee.DEPARTMENT.isin(depts)
>>> criteria_gender = employee.GENDER == "Female"
>>> criteria_sal = (employee.BASE_SALARY >= 80000) & (
...     employee.BASE_SALARY <= 120000
... )
>>> criteria_final = (
...     criteria_dept & criteria_gender & criteria_sal
... )

使用过滤条件筛选数据:

>>> select_columns = [
...     "UNIQUE_ID",
...     "DEPARTMENT",
...     "GENDER",
...     "BASE_SALARY",
... ]
>>> employee.loc[criteria_final, select_columns].head()
     UNIQUE_ID   DEPARTMENT  GENDER  BASE_SALARY
61          61  Houston ...  Female      96668.0
136        136  Houston ...  Female      81239.0
367        367  Houston ...  Female      86534.0
474        474  Houston ...  Female      91181.0
513        513  Houston ...  Female      81239.0

更多

和SQL类似,Pandas也有between方法:

''' {.sourceCode .pycon}    
>>> criteria_sal = employee.BASE_SALARY.between(
...     80_000, 120_000
... )
'''

7.7 用查询方法提高布尔索引的可读性

本节使用DataFrame的query方法。

读取数据:

>>> employee = pd.read_csv("data/employee.csv")
>>> depts = [
...     "Houston Police Department-HPD",
...     "Houston Fire Department (HFD)",
... ]
>>> select_columns = [
...     "UNIQUE_ID",
...     "DEPARTMENT",
...     "GENDER",
...     "BASE_SALARY",
... ]

创建查询字符串:

>>> qs = (
...     "DEPARTMENT in @depts "
...     " and GENDER == 'Female' "
...     " and 80000 <= BASE_SALARY <= 120000"
... )
>>> emp_filtered = employee.query(qs)
>>> emp_filtered[select_columns].head()
     UNIQUE_ID   DEPARTMENT  GENDER  BASE_SALARY
61          61  Houston ...  Female      96668.0
136        136  Houston ...  Female      81239.0
367        367  Houston ...  Female      86534.0
474        474  Houston ...  Female      91181.0
513        513  Houston ...  Female      81239.0
>>> top10_depts = (
...     employee.DEPARTMENT.value_counts()
...     .index[:10]
...     .tolist()
... )
>>> qs = "DEPARTMENT not in @top10_depts and GENDER == 'Female'"
>>> employee_filtered2 = employee.query(qs)
>>> employee_filtered2.head()
     UNIQUE_ID POSITION_TITLE  ...   HIRE_DATE    JOB_DATE
0            0  ASSISTAN...    ...  2006-06-12  2012-10-13
73          73  ADMINIST...    ...  2011-12-19  2013-11-23
96          96  ASSISTAN...    ...  2013-06-10  2013-06-10
117        117  SENIOR A...    ...  1998-03-20  2012-07-21
146        146  SENIOR S...    ...  2014-03-17  2014-03-17

7.8 用.where方法保留Series的大小

读取数据,电影名作为索引,actor_1_facebook_likes列不为空:

>>> movie = pd.read_csv(
...     "data/movie.csv", index_col="movie_title"
... )
>>> fb_likes = movie["actor_1_facebook_likes"].dropna()
>>> fb_likes.head()
movie_title
Avatar                                         1000.0
Pirates of the Caribbean: At World's End      40000.0
Spectre                                       11000.0
The Dark Knight Rises                         27000.0
Star Wars: Episode VII - The Force Awakens      131.0
Name: actor_1_facebook_likes, dtype: float64

使用describe方法查看:

>>> fb_likes.describe()
count      4909.000000
mean       6494.488491
std       15106.986884
min           0.000000
25%         607.000000
50%         982.000000
75%       11000.000000
max      640000.000000
Name: actor_1_facebook_likes, dtype: float64

用柱状图查看分布:

>>> import matplotlib.pyplot as plt
>>> fig, ax = plt.subplots(figsize=(10, 8))
>>> fb_likes.hist(ax=ax)
>>> fig.savefig(
...     "c7-hist.png", dpi=300
... ) 

这张图看不出数据分布,大部分都是小于20000的:

>>> criteria_high = fb_likes < 20_000
>>> criteria_high.mean().round(2)
0.91

数据中有缺失值:

>>> fb_likes.where(criteria_high).head()
movie_title
Avatar                                         1000.0
Pirates of the Caribbean: At World's End          NaN
Spectre                                       11000.0
The Dark Knight Rises                             NaN
Star Wars: Episode VII - The Force Awakens      131.0
Name: actor_1_facebook_likes, dtype: float64

where中可以设置other参数可以用于控制替换值:

>>> fb_likes.where(criteria_high, other=20000).head()
movie_title
Avatar                                         1000.0
Pirates of the Caribbean: At World's End      20000.0
Spectre                                       11000.0
The Dark Knight Rises                         20000.0
Star Wars: Episode VII - The Force Awakens      131.0
Name: actor_1_facebook_likes, dtype: float64

创建另一个where条件:

>>> criteria_low = fb_likes > 300
>>> fb_likes_cap = fb_likes.where(
...     criteria_high, other=20_000
... ).where(criteria_low, 300)
>>> fb_likes_cap.head()
movie_title
Avatar                                         1000.0
Pirates of the Caribbean: At World's End      20000.0
Spectre                                       11000.0
The Dark Knight Rises                         20000.0
Star Wars: Episode VII - The Force Awakens      300.0
Name: actor_1_facebook_likes, dtype: float64

前后两个Series大小相同:

>>> len(fb_likes), len(fb_likes_cap)
(4909, 4909)

重新用柱状图查看分布:

>>> fig, ax = plt.subplots(figsize=(10, 8))
>>> fb_likes_cap.hist(ax=ax)
>>> fig.savefig(
...     "c7-hist2.png", dpi=300
... )

更多

Pandas有.clip.clip_lower.clip_upper三个方法用于最低值和最高值:

>>> fb_likes_cap2 = fb_likes.clip(lower=300, upper=20000)
>>> fb_likes_cap2.equals(fb_likes_cap)
True

7.9 遮掩DataFrame的行

读取数据,创建条件:

>>> movie = pd.read_csv(
...     "data/movie.csv", index_col="movie_title"
... )
>>> c1 = movie["title_year"] >= 2010
>>> c2 = movie["title_year"].isna()
>>> criteria = c1 | c2

mask方法遮挡上述条件的数据:

>>> movie.mask(criteria).head()
                                            color  ...
movie_title                                        ...
Avatar                                      Color  ...
Pirates of the Caribbean: At World's End    Color  ...
Spectre                                       NaN  ...
The Dark Knight Rises                         NaN  ...
Star Wars: Episode VII - The Force Awakens    NaN  ...

注意上面三四五是缺失值:

>>> movie_mask = movie.mask(criteria).dropna(how="all")
>>> movie_mask.head()
                                          color  ...
movie_title                                      ...
Avatar                                    Color  ...
Pirates of the Caribbean: At World's End  Color  ...
Spider-Man 3                              Color  ...
Harry Potter and the Half-Blood Prince    Color  ...
Superman Returns                          Color  ...

.equals方法检查这两个条件是不一样的:

>>> movie_boolean = movie[movie["title_year"] < 2010]
>>> movie_mask.equals(movie_boolean)
False

但形状是一样的:

>>> movie_mask.shape == movie_boolean.shape
True

检查两个条件的数据类型:

>>> movie_mask.dtypes == movie_boolean.dtypes
color                       True
director_name               True
num_critic_for_reviews      True
duration                    True
director_facebook_likes     True
                           ...
title_year                  True
actor_2_facebook_likes      True
imdb_score                  True
aspect_ratio                True
movie_facebook_likes       False
Length: 27, dtype: bool

Pandas有一个assert_frame_equal方法,也可以判断DataFrame是否相同:

>>> from pandas.testing import assert_frame_equal
>>> assert_frame_equal(
...     movie_boolean, movie_mask, check_dtype=False
... )

更多

比较这两个条件的速度:

>>> %timeit movie.mask(criteria).dropna(how='all')
11.2 ms ± 144 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>>> %timeit movie[movie['title_year'] < 2010]
1.07 ms ± 34.9 μs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

7.10 使用布尔值、整数位置和标签选取数据

读取数据,创建条件:

>>> movie = pd.read_csv(
...     "data/movie.csv", index_col="movie_title"
... )
>>> c1 = movie["content_rating"] == "G"
>>> c2 = movie["imdb_score"] < 4
>>> criteria = c1 & c2

使用.loc过滤行:

>>> movie_loc = movie.loc[criteria]
>>> movie_loc.head()
                                color  ... movie/likes
movie_title                            ...
The True Story of Puss'N Boots  Color  ...          90
Doogal                          Color  ...         346
Thomas and the Magic Railroad   Color  ...         663
Barney's Great Adventure        Color  ...         436
Justin Bieber: Never Say Never  Color  ...       62000

这两个方法是等价的:

>>> movie_loc.equals(movie[criteria])
True

iloc需要将条件转换为numpy数组:

>>> movie_iloc = movie.iloc[criteria.to_numpy()]
>>> movie_iloc.equals(movie_loc)
True

选取数据类型是int64的:

>>> criteria_col = movie.dtypes == np.int64
>>> criteria_col.head()
color                      False
director_name              False
num_critic_for_reviews     False
duration                   False
director_facebook_likes    False
dtype: bool
>>> movie.loc[:, criteria_col].head()
              num_voted_users  cast_total_facebook_likes  movie_facebook_likes
movie_title
Avatar             886204             4834                      33000
Pirates o...       471220            48350                          0
Spectre            275868            11700                      85000
The Dark ...      1144337           106759                     164000
Star Wars...            8              143                          0

因为是Series,criteria_col必须要转化为numpy就可以用于iloc

>>> movie.iloc[:, criteria_col.to_numpy()].head()
              num_voted_users  cast_total_facebook_likes  movie_facebook_likes
movie_title
Avatar             886204             4834                      33000
Pirates o...       471220            48350                          0
Spectre            275868            11700                      85000
The Dark ...      1144337           106759                     164000
Star Wars...            8              143                          0

loc中将条件和列合用:

>>> cols = [
...     "content_rating",
...     "imdb_score",
...     "title_year",
...     "gross",
... ]
>>> movie.loc[criteria, cols].sort_values("imdb_score")
             content_rating  imdb_score  title_year       gross
movie_title
Justin Bi...            G           1.6      2011.0  73000942.0
Sunday Sc...            G           2.5      2008.0         NaN
Doogal                  G           2.8      2006.0   7382993.0
Barney's ...            G           2.8      1998.0  11144518.0
The True ...            G           2.9      2009.0         NaN
Thomas an...            G           3.6      2000.0  15911333.0

.iloc必须使用列的位置:

>>> col_index = [movie.columns.get_loc(col) for col in cols]
>>> col_index
[20, 24, 22, 8]
>>> movie.iloc[criteria.to_numpy(), col_index].sort_values(
...     "imdb_score"
... )
             content_rating  imdb_score  title_year       gross
movie_title
Justin Bi...            G           1.6      2011.0  73000942.0
Sunday Sc...            G           2.5      2008.0         NaN
Doogal                  G           2.8      2006.0   7382993.0
Barney's ...            G           2.8      1998.0  11144518.0
The True ...            G           2.9      2009.0         NaN
Thomas an...            G           3.6      2000.0  15911333.0

(这小节和之前的内容重复不少)


第01章 Pandas基础
第02章 DataFrame基础运算
第03章 创建和持久化DataFrame
第04章 开始数据分析
第05章 探索性数据分析
第06章 选取数据子集
第07章 过滤行
第08章 索引对齐

最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 211,290评论 6 491
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 90,107评论 2 385
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 156,872评论 0 347
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 56,415评论 1 283
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 65,453评论 6 385
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 49,784评论 1 290
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 38,927评论 3 406
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 37,691评论 0 266
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 44,137评论 1 303
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 36,472评论 2 326
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 38,622评论 1 340
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 34,289评论 4 329
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 39,887评论 3 312
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 30,741评论 0 21
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 31,977评论 1 265
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 46,316评论 2 360
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 43,490评论 2 348

推荐阅读更多精彩内容