利用 Python 分析网站访问日志文件

场景介绍

有一个前台的 Web 应用，框架会记录访问日志，并定期归档，存储在特定的目录，目录格式如下：
/onlinelogs/应用名/环境名/年/月/日/小时/，例如 /onlinelogs/<app_name>/Prod/2018/08/07/01。
在该目录下：

访问日志文件可能有多个，文件名以 access-log 开头
已压缩为 .gz 文件，并且只读，例如：

访问日志文件
访问日志文件中，有部分行是记录 HTTP 请求的，格式如下所示：
- 从中可以看出，请求的目标资源，响应码，客户端信息

222.67.225.134 - - [04/Aug/2018:01:16:44 +0000] "GET /?ref=as_cn_ags_resource_tb&ck-tparam-anchor=123067 HTTP/1.1" 200 7798 "https://gs.amazon.cn/resources.html/ref=as_cn_ags_hnav1_re_class" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/11.1.2 Safari/605.1.15"

222.67.225.134 - - [04/Aug/2018:01:16:54 +0000] "GET /tndetails?tnid=3be3f34dee8a4bf08baa072a478fc882 HTTP/1.1" 200 9152 "https://gs.amazon.cn/sba/?ref=as_cn_ags_resource_tb&ck-tparam-anchor=123067&tnm=Offline" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/11.1.2 Safari/605.1.15"

222.67.225.134 - - [04/Aug/2018:01:17:37 +0000] "GET /paymentinfo?oid=10763561&rtxref=164f9f8e2b5e4d7790d02d1220eae435 HTTP/1.1" 200 7138 "https://gs.amazon.cn/sba/paymentinfo?oid=10763561&rtxref=164f9f8e2b5e4d7790d02d1220eae435" "Mozilla/5.0 (Linux; Android 8.1.0; DE106 Build/OPM1.171019.026; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/62.0.3202.84 Mobile Safari/537.36 AliApp(DingTalk/4.5.3) com.alibaba.android.rimet/0 Channel/10006872 language/zh-CN"

222.67.225.134 - - [04/Aug/2018:01:17:39 +0000] "GET /paymentinfo?oid=10763561&rtxref=164f9f8e2b5e4d7790d02d1220eae435 HTTP/1.1" 200 7138 "https://gs.amazon.cn/sba/paymentinfo?oid=10763561&rtxref=164f9f8e2b5e4d7790d02d1220eae435" "Mozilla/5.0 (Linux; Android 8.1.0; DE106 Build/OPM1.171019.026; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/62.0.3202.84 Mobile Safari/537.36 AliApp(DingTalk/4.5.3) com.alibaba.android.rimet/0 Channel/10006872 language/zh-CN"

222.67.225.134 - - [04/Aug/2018:01:17:40 +0000] "GET /paymentinfo?oid=10763561&rtxref=164f9f8e2b5e4d7790d02d1220eae435 HTTP/1.1" 200 7138 "https://gs.amazon.cn/sba/paymentinfo?oid=10763561&rtxref=164f9f8e2b5e4d7790d02d1220eae435" "Mozilla/5.0 (Linux; Android 8.1.0; DE106 Build/OPM1.171019.026; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/62.0.3202.84 Mobile Safari/537.36 AliApp(DingTalk/4.5.3) com.alibaba.android.rimet/0 Channel/10006872 language/zh-CN"

140.243.121.197 - - [04/Aug/2018:01:17:41 +0000] "GET /?ref=as_cn_ags_resource_tb HTTP/1.1" 302 - "https://gs.amazon.cn/resources.html/ref=as_cn_ags_hnav1_re_class" "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36"

分析目标

记录不同时间段的访问量
统计 PC 和 Mobile 端访问量
统计不同页面的访问量

基本思想：

由于原日志文件已压缩，并且只读，所以需要创建一个临时目录 /tmp/logs/unziped_logs 来解压缩日志文件
利用正则表达式 ^access\-log.*gz+$ 过滤日志文件
利用正则表达式 ^.*GET (.*) HTTP.*$ 过滤日志文件中的 HTTP 请求
通过日志行是否有 Mobile 来判断客户端

代码如下：（部分内容屏蔽）

#!/usr/bin/python3

import os
import os.path
import re
import shutil
import gzip
from collections import defaultdict

# define the ziped and unziped log file directorys
source_logs_dir = '/onlinelogs/<app_name>/Prod'
unziped_logs_dir = '/tmp/logs/unziped_logs'

# clear the unziped log file directory if exists
if os.path.exists(unziped_logs_dir):
    shutil.rmtree(unziped_logs_dir)

# create the unziped log file directory
os.mkdir(unziped_logs_dir)

# regex used to match target log file name
log_file_name_regex = re.compile(r'^access\-log.*gz+$')

# regex used to match HTTP request
http_request_regex = re.compile(r'^.*GET.*gs.amazon.cn')

# regex used to match request page
request_page_regex = re.compile(r'^.*GET (.*) HTTP.*$')
# request_page_regex = re.compile(r'^.*GET (.*)\?.*$')

# a dictionary to store the HTTP request count of each day
day_count = defaultdict(int)

# a dictionary to store the count of each device (PC or Mobile)
device_count = defaultdict(int)
device_count['PC'] = 0
device_count['Mobile'] = 0

# a dictionary to store the count of each request page
request_page_count = defaultdict(int)

for root, dirs, files in os.walk(source_logs_dir):
    for name in files:
        # find the target log files
        if log_file_name_regex.search(name):
            # parst the day
            day = root[-13:-3]

            # copy the target log files
            shutil.copyfile(os.path.join(root, name), os.path.join(unziped_logs_dir, name))

            # unzip the log files
            unziped_log_file = gzip.open(os.path.join(unziped_logs_dir, name), 'rb')

            http_request_count = 0
            pc_count = 0
            mobile_count = 0
            for line in unziped_log_file:
                if(http_request_regex.search(line)):
                    # parse the request page
                    regex_obj = request_page_regex.search(line)
                    request_page = regex_obj.group(1)
                    # remove params of the request page
                    if('?' in request_page):
                        request_page = request_page[:request_page.find('?')]

                    http_request_count = http_request_count + 1

                    if('Mobile' in line):
                        mobile_count = mobile_count + 1
                    else:
                        pc_count = pc_count + 1

                    # update the count of each request page
                    if(request_page in request_page_count):
                                        request_page_count[request_page] = request_page_count[request_page] + 1
                                else:
                                        request_page_count[request_page] = 1

            # update the HTTP request count of each day
            if(day in day_count):
                day_count[day] = day_count[day] + http_request_count
            else:
                day_count[day] = http_request_count

            # update the count of each device (PC or Mobile)
            device_count['PC'] = device_count['PC'] + pc_count
            device_count['Mobile'] = device_count['Mobile'] + mobile_count

            # remvoe the original zip log files
            os.remove(os.path.join(unziped_logs_dir, name))

# print the HTTP request count of each day
total = 0
print 'HTTP request count of each day'
for day, count in sorted(day_count.items()):
    print day, ':', count
    total = total + count
print 'Total = ', total

print '###############################'

total = 0
print 'count of each device (PC or Mobile)'
# print the count of each device (PC or Mobile)
for device, count in sorted(device_count.items()):
        print device, ':', count
    total = total + count
print 'Total = ', total

print '###############################'

total = 0
print 'count of each request page'
# print the count of each request page
for request_page, count in sorted(request_page_count.items()):
        print request_page, ':', count
    total = total + count
print 'Total = ', total

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 219,539评论 6赞 508
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 93,594评论 3赞 396
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 165,871评论 0赞 356
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 58,963评论 1赞 295
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 67,984评论 6赞 393
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 51,763评论 1赞 307
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 40,468评论 3赞 420
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 39,357评论 0赞 276
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 45,850评论 1赞 317
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 38,002评论 3赞 338
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 40,144评论 1赞 351
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 35,823评论 5赞 346
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 41,483评论 3赞 331
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 32,026评论 0赞 22
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 33,150评论 1赞 272
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 48,415评论 3赞 373
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 45,092评论 2赞 355

利用 Python 分析网站访问日志文件

场景介绍

分析目标

推荐阅读更多精彩内容