摩拜单车数据初步分析

根据摩拜提供的骑行，对其进行初步分析。
训练集取北京某一区域的一段时间内的部分数据，测试集为同一区域未来一段时间的数据。
标注数据中包含300万条出行记录数据，覆盖超过30万用户和40万摩拜单车。数据包括骑行起始时间和地点、车辆ID、车辆类型和用户ID等信息。

首先导入数据分析包

import pandas as pd 
import seaborn as sns
import geohash
import matplotlib.pyplot as plt
from math import radians, cos, sin, asin, sqrt
%matplotlib inline

train = pd.read_csv("train.csv",sep = ',',parse_dates=['starttime'])
test = pd.read_csv("test.csv",sep = ',',parse_dates=['starttime'])

查到数据

train.head()

image.png

print(train.shape)
print(test.shape)

image.png

train=train.sample(frac=0.3) #抽样30%

GEOHASH分析

GeoHash将二维的经纬度转换成字符串，比如下图展示了北京9个区域的GeoHash字符串，分别是WX4ER，WX4G2、WX4G3等等，每一个字符串代表了某一矩形区域。也就是说，这个矩形区域内所有的点（经纬度坐标）都共享相同的GeoHash字符串，这样既可以保护隐私（只表示大概区域位置而不是具体的点），又比较容易做缓存，比如左上角这个区域内的用户不断发送位置信息请求餐馆数据，由于这些用户的GeoHash字符串都是WX4ER，所以可以把WX4ER当作key，把该区域的餐馆信息当作value来进行缓存，而如果不使用GeoHash的话，由于区域内的用户传来的经纬度是各不相同的，很难做缓存。
字符串越长，表示的范围越精确。如图所示，5位的编码能表示10平方千米范围的矩形区域，而6位编码能表示更精细的区域（约0.34平方千米）
字符串相似的表示距离相近，这样可以利用字符串的前缀匹配来查询附近的POI信息

image.png

geo编码长度误差

image.png

对geohash的信息解码

def processData(df):
    #time
    df['weekday']=df['starttime'].apply(lambda s:s.weekday())
    df['hour']=df['starttime'].apply(lambda s:s.hour)
    df['day']=df['starttime'].apply(lambda s:str(s)[:10])
    print('time process succuessfully')
    
    #Geohash
    df['start_lat_lng']=df['geohashed_start_loc'].apply(lambda s:geohash.decode(s))
    df['end_lat_lng']=df['geohashed_end_loc'].apply(lambda s:geohash.decode(s))
    df['start_neighbors']=df['geohashed_start_loc'].apply(lambda s:geohash.neighbors(s))
    
    df['geohashed_start_loc_6'] = df['geohashed_start_loc'].apply(lambda s : s[:6])
    df['geohashed_end_loc_6'] = df['geohashed_end_loc'].apply(lambda s : s[:6])
    df['start_neighbors_6'] =  df['geohashed_start_loc_6'].apply(lambda s : geohash.neighbors(s))
    
    df['geohashed_start_loc_5'] = df['geohashed_start_loc'].apply(lambda s : s[:5])
    df['geohashed_end_loc_5'] = df['geohashed_end_loc'].apply(lambda s : s[:5])
    df['start_neighbors_5'] =  df['geohashed_start_loc_5'].apply(lambda s : geohash.neighbors(s))
    
    print('geohash process successfully')
    
    #判断目的地是否在neighbors
    def inGeohash(start_geohash,end_geohash,names):
        names.append(start_geohash)
        if end_geohash in names:
            return 1
        else:
            return 0
    df['inside']=df.apply(lambda s:inGeohash(s['geohashed_start_loc'],s['geohashed_end_loc'],s['start_neighbors']),axis=1)
    df['inside_6']=df.apply(lambda s:inGeohash(s['geohashed_start_loc_6'],s['geohashed_end_loc_6'],s['start_neighbors_6']),axis=1)
    df['inside_5']=df.apply(lambda s:inGeohash(s['geohashed_start_loc_5'],s['geohashed_end_loc_5'],s['start_neighbors_5']),axis=1)
    print('geo_inside process successfully')
    
    #计算两个经纬度点之间的公式 start->end
    def haversine(lon1,lat1,lon2,lat2):
        lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])
        #公式
        dislon = lon2-lon1
        dislat = lat2-lat1
        a = sin(dislat/2)**2+cos(lat1)*cos(lat2)*sin(dislon/2)**2
        c= 2*asin(sqrt(a))
        r= 6371 #地球平均半径（公里）
        return c*r*1000
    df['start_end_distance'] = df.apply(lambda s: haversine(s['start_lat_lng'][1],s['start_lat_lng'][0],
                                                            s['end_lat_lng'][1],s['end_lat_lng'][0]),axis=1)
    print('distance process successfully')
    return df

train =processData(train)

查看完成后的数据

image.png

根据时间段对数据进行分析

def timeanalysis(df):
    #day
    print("数据集包含的天数：")
    print(df['day'].unique())
    print("*"*60)
    
    #周一至周日的用车分析
    g1 = df.groupby('weekday')
    print("周一至周日的用车数")
    print(pd.DataFrame(g1['orderid'].count()))
    print("*"*60)
    
    #周一至周日不同时间的用车分析
    df.loc[(df['weekday']==5)|(df['weekday']==6),'isweekend']=1
    df.loc[~((df['weekday']==5)|(df['weekday']==6)),'isweekend']=0
    g2 = df.groupby(['isweekend','hour'])
    
    
    print("*"*60)
    
    #计算工作日和周末的天数
    g3 = df.groupby(['day','weekday'])
    w = 0 #周末天数
    c = 0 #工作日天数
    for i,j in list(g3.groups.keys()):
        if j>=5:
            w +=1
        else:
            c +=1
    #print(w)
    #print(c)
    #
    temp = pd.DataFrame(g2['orderid'].count()).reset_index()
    
    temp.loc[temp['isweekend']==0.0,'orderid'] = temp['orderid']/c
    temp.loc[temp['isweekend']==1.0,'orderid'] = temp['orderid']/w
    #print(temp)
    print("周末和工作日平均每日每小时用车数比较")
    fig =plt.figure(figsize=(12,6))
    sns.barplot(temp['hour'],temp['orderid'],hue=temp['isweekend'])

timeanalysis(train)

image.png

周末和工作日平均每日每小时用车数比较

image.png

数据可视化分析

# 出行距离的描述统计
train['start_end_distance'].describe()

image.png

sns.distplot(train['start_end_distance'])

image.png

去除极值的影响

start_end_distance = train['start_end_distance']
start_end_distance=start_end_distance.loc[start_end_distance<5000]
sns.distplot(start_end_distance)

image.png

#不同时间骑行的距离是否不一样
hour_group = train.groupby('hour')
hour_distance = hour_group['start_end_distance'].mean().reset_index()
sns.barplot(x='hour',y='start_end_distance',data=hour_distance)

image.png

不同时间段对骑行距离没有很大影响

# 不同小时的出行次数
hour_id_num = hour_group['orderid'].count().reset_index()
sns.barplot(x='hour',y='orderid',data=hour_id_num)

image.png

可以看到早高峰和晚高峰人数比较多

isw_hour_group =train.groupby(['isweekend','hour'])
isw_hour_id_num =isw_hour_group['orderid'].count().reset_index()
fig = plt.figure(figsize=(10,6))
sns.barplot(x='hour',y='orderid',hue='isweekend',data=isw_hour_id_num)

plt.title("周末和工作日每小时总用车数比较")

image.png

可以看到工作日有早高峰晚高峰，而周末整个白天订单的趋势都比较均衡

用户出发地与目的地分析

每天有多少用户/车辆从该点出发或者到达

def analysis_1(data,target):
    g1 = data.groupby(['day',target])
    group_data = g1.agg({'orderid':'count','userid':'nunique','bikeid':'nunique'}).reset_index()
    for each in ['orderid','userid','bikeid']:
        sns.distplot(group_data[each])
        plt.show()
    return group_data

group_data = analysis_1(train,'geohashed_start_loc')

image.png

group_data_6 = analysis_1(train,'geohashed_start_loc_6')

image.png

出发点-目的地组合分析

start_end = train.groupby(['day','geohashed_start_loc','geohashed_end_loc'])
#计算出发点到停车店的订单数，车辆数，用户数
start_end.agg({'orderid':'count','userid':'nunique','bikeid':'nunique',
               'start_end_distance':'mean'}).reset_index().sort_values(by='orderid',ascending=False)

image.png

出发点和停车点不在一个区域的数量有

# 出发点在g5范围内不一致的数量
train.loc[train['geohashed_start_loc_5']!=train['geohashed_end_loc_5']].shape[0]

225562

# 出发点在g6范围内不一致的数量
train.loc[train['geohashed_start_loc_6']!=train['geohashed_end_loc_6']].shape[0]

772933

对于摩拜单车的是可视化分析先做到这里，主要做了：

区域geohash的解码，计算了经纬坐标的距离
不同时间段的骑行数据可视化展示，发现工作日和周末的骑行数据不同之处，而不同时间段对骑行距离是没有影响的
出发地的分析，总结出高频地点
出发地和目的地的组合分析，总结出高频路线
在不同的g6，g5, g4网格下，检查统计用户是否会骑出所在geohash网格

最后编辑于：2020.12.03 16:35:30

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 199,340评论 5赞 467
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 83,762评论 2赞 376
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 146,329评论 0赞 329
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 53,678评论 1赞 270
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 62,583评论 5赞 359
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 47,995评论 1赞 275
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 37,493评论 3赞 390
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 36,145评论 0赞 254
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 40,293评论 1赞 294
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 35,250评论 2赞 317
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 37,267评论 1赞 328
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 32,973评论 3赞 316
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 38,556评论 3赞 303
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 29,648评论 0赞 19
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 30,873评论 1赞 255
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 42,257评论 2赞 345
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 41,809评论 2赞 339

摩拜单车数据初步分析

推荐阅读更多精彩内容