awk 用于文本文件的分析与处理
0x00 使用方法
awk '{pattern + action}' [filenames]
其中pattern
代表的是正则表达式,用于匹配我们需要截取的数据,需要用斜杠括起来。
action
是在找到数据时执行的操作。
0x01 例子
awk工作流程是这样的:读入有\n
换行符分割的一条记录,然后将记录按指定的域分隔符(默认空白符或制表符)划分域,填充域,$0则表示所有域,$1表示第一个域,$n表示第n个域。
如下我们执行last -n 5
last -n 5
root pts/4 172.20.3.158 Mon Aug 1 11:20 still logged in
root pts/3 172.20.3.158 Mon Aug 1 10:58 still logged in
root pts/2 172.20.3.158 Mon Aug 1 10:57 still logged in
root pts/1 172.20.3.158 Mon Aug 1 10:57 still logged in
root pts/0 172.20.3.158 Mon Aug 1 10:57 still logged in
wtmp begins Mon Apr 25 17:46:29 2016
再以默认分隔符去分割输出 可得第一个域和第二个域
last -n 5 | awk '{print $1,$2}'
root pts/4
root pts/3
root pts/2
root pts/1
root pts/0
wtmp begins
接下来我们尝试设置其域分隔符,通常以-F
来设置域分隔符。再将打印的域以\t
分隔打印输出。
cat /etc/passwd | awk -F ':' '{print $1"\t"$7}'
at /bin/bash
bin /bin/bash
daemon /bin/bash
ftp /bin/bash
ftpsecure /bin/false
games /bin/bash
gdm /bin/false
lp /bin/bash
mail /bin/false
man /bin/bash
messagebus /bin/false
news /bin/bash
nobody /bin/bash
nscd /sbin/nologin
ntp /bin/false
openslp /sbin/nologin
polkitd /sbin/nologin
postfix /bin/false
pulse /sbin/nologin
root /bin/zsh
rpc /sbin/nologin
rtkit /bin/false
scard /usr/sbin/nologin
sshd /bin/false
statd /sbin/nologin
usbmux /sbin/nologin
uucp /bin/bash
vnc /sbin/nologin
wwwrun /bin/false
edward /bin/zsh
ftp-edward /bin/bash
lighthttpd /bin/bash
再接着我们尝试用BEGIN,PROC,END来指定程序的执行流程。一般来说,程序会先执行BEGIN部分代码,再读取文件以\n
划分被处理的一条条记录,执行PROC部分内容,填充域,最后在执行完PROC部分之后再执行END部分内容。
现在我们将上面的程序改造一下,让他先打印name shell
,最后输出一段话Action Finished
cat /etc/passwd | awk -F ':' 'BEGIN {print "name,shell"} {print $1","$7} END {print "Action Finished"}'
name,shell
at,/bin/bash
bin,/bin/bash
daemon,/bin/bash
ftp,/bin/bash
ftpsecure,/bin/false
games,/bin/bash
gdm,/bin/false
lp,/bin/bash
mail,/bin/false
man,/bin/bash
messagebus,/bin/false
news,/bin/bash
nobody,/bin/bash
nscd,/sbin/nologin
ntp,/bin/false
openslp,/sbin/nologin
polkitd,/sbin/nologin
postfix,/bin/false
pulse,/sbin/nologin
root,/bin/zsh
rpc,/sbin/nologin
rtkit,/bin/false
scard,/usr/sbin/nologin
sshd,/bin/false
statd,/sbin/nologin
usbmux,/sbin/nologin
uucp,/bin/bash
vnc,/sbin/nologin
wwwrun,/bin/false
edward,/bin/zsh
ftp-edward,/bin/bash
lighthttpd,/bin/bash
Action Finished
那么我们要获取/etc/passwd
里关于root账户的shell信息该怎么做呢?
awk -F ':' '/root/{print $7}' /etc/passwd
/bin/zsh
这里的意思就是先//
之中的为pattern,即若当前行匹配root的正则表达式,则对该行进行处理。
0x02 内置变量
awk存在许多内置变量来设置环境信息,这些变量可以被改变。
ARGC 命令行参数个数
ARGV 命令行参数排列
ENVIRON 支持队列中系统环境变量的使用
FILENAME awk浏览的文件名
FNR 浏览文件的记录数
FS 设置输入域分隔符,等价于命令行 -F选项
NF 当前行中域的个数
NR 已读的行数
OFS 输出域分隔符
ORS 输出记录分隔符
RS 控制记录分隔符
现在我们对其进行试用
awk -F ':' 'BEGIN {print "ARGC:" ARGC " ARGV:" ARGV[0]","ARGV[1] " Filename:" FILENAME " Total:" FNR "}{print "currLine:" NR " currColumns:" NF " content:" $0}' /etc/passwd
ARGC:2 ARGV:awk,/etc/passwd Filename: Total:0 Field Separator: Row Separator:
currLine:1 currColumns:7 content:at:x:25:25:Batch jobs daemon:/var/spool/atjobs:/bin/bash
currLine:2 currColumns:7 content:bin:x:1:1:bin:/bin:/bin/bash
currLine:3 currColumns:7 content:daemon:x:2:2:Daemon:/sbin:/bin/bash
currLine:4 currColumns:7 content:ftp:x:40:49:FTP account:/srv/ftp:/bin/bash
currLine:5 currColumns:7 content:ftpsecure:x:488:65534:Secure FTP User:/var/lib/empty:/bin/false
currLine:6 currColumns:7 content:games:x:12:100:Games account:/var/games:/bin/bash
currLine:7 currColumns:7 content:gdm:x:486:485:Gnome Display Manager daemon:/var/lib/gdm:/bin/false
currLine:8 currColumns:7 content:lp:x:4:7:Printing daemon:/var/spool/lpd:/bin/bash
currLine:9 currColumns:7 content:mail:x:8:12:Mailer daemon:/var/spool/clientmqueue:/bin/false
currLine:10 currColumns:7 content:man:x:13:62:Manual pages viewer:/var/cache/man:/bin/bash
currLine:11 currColumns:7 content:messagebus:x:499:499:User for D-Bus:/var/run/dbus:/bin/false
currLine:12 currColumns:7 content:news:x:9:13:News system:/etc/news:/bin/bash
currLine:13 currColumns:7 content:nobody:x:65534:65533:nobody:/var/lib/nobody:/bin/bash
currLine:14 currColumns:7 content:nscd:x:496:495:User for nscd:/run/nscd:/sbin/nologin
currLine:15 currColumns:7 content:ntp:x:74:492:NTP daemon:/var/lib/ntp:/bin/false
currLine:16 currColumns:7 content:openslp:x:494:2:openslp daemon:/var/lib/empty:/sbin/nologin
currLine:17 currColumns:7 content:polkitd:x:497:496:User for polkitd:/var/lib/polkit:/sbin/nologin
currLine:18 currColumns:7 content:postfix:x:51:51:Postfix Daemon:/var/spool/postfix:/bin/false
currLine:19 currColumns:7 content:pulse:x:490:489:PulseAudio daemon:/var/lib/pulseaudio:/sbin/nologin
currLine:20 currColumns:7 content:root:x:0:0:root:/root:/bin/zsh
currLine:21 currColumns:7 content:rpc:x:495:65534:user for rpcbind:/var/lib/empty:/sbin/nologin
currLine:22 currColumns:7 content:rtkit:x:491:490:RealtimeKit:/proc:/bin/false
currLine:23 currColumns:7 content:scard:x:487:487:Smart Card Reader:/var/run/pcscd:/usr/sbin/nologin
currLine:24 currColumns:7 content:sshd:x:498:498:SSH daemon:/var/lib/sshd:/bin/false
currLine:25 currColumns:7 content:statd:x:489:65534:NFS statd daemon:/var/lib/nfs:/sbin/nologin
currLine:26 currColumns:7 content:usbmux:x:493:65534:usbmuxd daemon:/var/lib/usbmuxd:/sbin/nologin
currLine:27 currColumns:7 content:uucp:x:10:14:Unix-to-Unix CoPy system:/etc/uucp:/bin/bash
currLine:28 currColumns:7 content:vnc:x:492:491:user for VNC:/var/lib/empty:/sbin/nologin
currLine:29 currColumns:7 content:wwwrun:x:30:8:WWW daemon apache:/var/lib/wwwrun:/bin/false
currLine:30 currColumns:7 content:edward:x:1000:100:Edward:/home/edward:/bin/zsh
currLine:31 currColumns:7 content:ftp-edward:x:1001:100::/home/ftp-edward:/bin/bash
currLine:32 currColumns:7 content:lighthttpd:x:1004:1000::/home/lighthttpd:/bin/bash
由此可见在未读入目标文件时,文件名,域分隔符,记录分隔符,以及总记录数未知。于是我们修改为:
awk -F ':' 'BEGIN {
print "ARGC:" ARGC
print "ARGV:"
for (i=0;i<ARGC;i++)
print ARGV[i]
}{
print "currLine:" NR " currColumns:" NF " content:" $0
} END {
print "Filename:" FILENAME " Total:" FNR
}' /etc/passwd
ARGC:2
ARGV:
awk
/etc/passwd
currLine:1 currColumns:7 content:at:x:25:25:Batch jobs daemon:/var/spool/atjobs:/bin/bash
currLine:2 currColumns:7 content:bin:x:1:1:bin:/bin:/bin/bash
currLine:3 currColumns:7 content:daemon:x:2:2:Daemon:/sbin:/bin/bash
currLine:4 currColumns:7 content:ftp:x:40:49:FTP account:/srv/ftp:/bin/bash
currLine:5 currColumns:7 content:ftpsecure:x:488:65534:Secure FTP User:/var/lib/empty:/bin/false
currLine:6 currColumns:7 content:games:x:12:100:Games account:/var/games:/bin/bash
currLine:7 currColumns:7 content:gdm:x:486:485:Gnome Display Manager daemon:/var/lib/gdm:/bin/false
currLine:8 currColumns:7 content:lp:x:4:7:Printing daemon:/var/spool/lpd:/bin/bash
currLine:9 currColumns:7 content:mail:x:8:12:Mailer daemon:/var/spool/clientmqueue:/bin/false
currLine:10 currColumns:7 content:man:x:13:62:Manual pages viewer:/var/cache/man:/bin/bash
currLine:11 currColumns:7 content:messagebus:x:499:499:User for D-Bus:/var/run/dbus:/bin/false
currLine:12 currColumns:7 content:news:x:9:13:News system:/etc/news:/bin/bash
currLine:13 currColumns:7 content:nobody:x:65534:65533:nobody:/var/lib/nobody:/bin/bash
currLine:14 currColumns:7 content:nscd:x:496:495:User for nscd:/run/nscd:/sbin/nologin
currLine:15 currColumns:7 content:ntp:x:74:492:NTP daemon:/var/lib/ntp:/bin/false
currLine:16 currColumns:7 content:openslp:x:494:2:openslp daemon:/var/lib/empty:/sbin/nologin
currLine:17 currColumns:7 content:polkitd:x:497:496:User for polkitd:/var/lib/polkit:/sbin/nologin
currLine:18 currColumns:7 content:postfix:x:51:51:Postfix Daemon:/var/spool/postfix:/bin/false
currLine:19 currColumns:7 content:pulse:x:490:489:PulseAudio daemon:/var/lib/pulseaudio:/sbin/nologin
currLine:20 currColumns:7 content:root:x:0:0:root:/root:/bin/zsh
currLine:21 currColumns:7 content:rpc:x:495:65534:user for rpcbind:/var/lib/empty:/sbin/nologin
currLine:22 currColumns:7 content:rtkit:x:491:490:RealtimeKit:/proc:/bin/false
currLine:23 currColumns:7 content:scard:x:487:487:Smart Card Reader:/var/run/pcscd:/usr/sbin/nologin
currLine:24 currColumns:7 content:sshd:x:498:498:SSH daemon:/var/lib/sshd:/bin/false
currLine:25 currColumns:7 content:statd:x:489:65534:NFS statd daemon:/var/lib/nfs:/sbin/nologin
currLine:26 currColumns:7 content:usbmux:x:493:65534:usbmuxd daemon:/var/lib/usbmuxd:/sbin/nologin
currLine:27 currColumns:7 content:uucp:x:10:14:Unix-to-Unix CoPy system:/etc/uucp:/bin/bash
currLine:28 currColumns:7 content:vnc:x:492:491:user for VNC:/var/lib/empty:/sbin/nologin
currLine:29 currColumns:7 content:wwwrun:x:30:8:WWW daemon apache:/var/lib/wwwrun:/bin/false
currLine:30 currColumns:7 content:edward:x:1000:100:Edward:/home/edward:/bin/zsh
currLine:31 currColumns:7 content:ftp-edward:x:1001:100::/home/ftp-edward:/bin/bash
currLine:32 currColumns:7 content:lighthttpd:x:1004:1000::/home/lighthttpd:/bin/bash
Filename:/etc/passwd Total:32
同样的我们可以通过printf
函数对输出进行格式化,使代码更加易懂。
0x03 awk编程
变量与赋值
除了awk的内置变量,awk还可以设置自定义变量。
如下我们统计/etc/passwd
里用户的个数。我们先初始化count为1,若不初始化,其初值为0。
awk 'BEGIN {
count = 1;
print count;
}
{
count++;
print $0;
}
END {
print "user count is "count;
}
' /etc/passwd
1
at:x:25:25:Batch jobs daemon:/var/spool/atjobs:/bin/bash
bin:x:1:1:bin:/bin:/bin/bash
daemon:x:2:2:Daemon:/sbin:/bin/bash
ftp:x:40:49:FTP account:/srv/ftp:/bin/bash
ftpsecure:x:488:65534:Secure FTP User:/var/lib/empty:/bin/false
games:x:12:100:Games account:/var/games:/bin/bash
gdm:x:486:485:Gnome Display Manager daemon:/var/lib/gdm:/bin/false
lp:x:4:7:Printing daemon:/var/spool/lpd:/bin/bash
mail:x:8:12:Mailer daemon:/var/spool/clientmqueue:/bin/false
man:x:13:62:Manual pages viewer:/var/cache/man:/bin/bash
messagebus:x:499:499:User for D-Bus:/var/run/dbus:/bin/false
news:x:9:13:News system:/etc/news:/bin/bash
nobody:x:65534:65533:nobody:/var/lib/nobody:/bin/bash
nscd:x:496:495:User for nscd:/run/nscd:/sbin/nologin
ntp:x:74:492:NTP daemon:/var/lib/ntp:/bin/false
openslp:x:494:2:openslp daemon:/var/lib/empty:/sbin/nologin
polkitd:x:497:496:User for polkitd:/var/lib/polkit:/sbin/nologin
postfix:x:51:51:Postfix Daemon:/var/spool/postfix:/bin/false
pulse:x:490:489:PulseAudio daemon:/var/lib/pulseaudio:/sbin/nologin
root:x:0:0:root:/root:/bin/zsh
rpc:x:495:65534:user for rpcbind:/var/lib/empty:/sbin/nologin
rtkit:x:491:490:RealtimeKit:/proc:/bin/false
scard:x:487:487:Smart Card Reader:/var/run/pcscd:/usr/sbin/nologin
sshd:x:498:498:SSH daemon:/var/lib/sshd:/bin/false
statd:x:489:65534:NFS statd daemon:/var/lib/nfs:/sbin/nologin
usbmux:x:493:65534:usbmuxd daemon:/var/lib/usbmuxd:/sbin/nologin
uucp:x:10:14:Unix-to-Unix CoPy system:/etc/uucp:/bin/bash
vnc:x:492:491:user for VNC:/var/lib/empty:/sbin/nologin
wwwrun:x:30:8:WWW daemon apache:/var/lib/wwwrun:/bin/false
edward:x:1000:100:Edward:/home/edward:/bin/zsh
ftp-edward:x:1001:100::/home/ftp-edward:/bin/bash
lighthttpd:x:1004:1000::/home/lighthttpd:/bin/bash
user count is 33
接下来统计一个文件夹下文件占用的字节总数。
ls -l | awk 'BEGIN {
size = 0;
printf("[start]Initial Size is %s\n",size);
}{
print $5;
size = size + $5;
}
END {
printf("[end]Final Size is %s\n",size);
}'
[start]Initial Size is 0
2713
0
472
244
58464
0
0
0
26
0
0
31729
0
49548
46
49548
47650
11
[end]Final Size is 240451
若要以M显示。
ls -l | awk 'BEGIN {
size = 0;
printf("[start]Initial Size is %s\n",size);
}{
print $5;
size = size + $5;
}
END {
printf("[end]Final Size is %sM\n",size/1024/1024);
}'
[start]Initial Size is 0
2713
0
472
244
58464
0
0
0
26
0
0
31729
0
49548
46
49548
47650
11
[end]Final Size is 0.229312M
条件语句
if (expression) {
statement;
statement;
... ...
}
if (expression) {
statement;
} else {
statement2;
}
if (expression) {
statement1;
} else if (expression1) {
statement2;
} else {
statement3;
}
## 循环语句
循环语句也差不多的
## 数组
因为awk中数组的下标可以是数字和字母,数组的下标通常被称为关键字(key)。值和关键字都存储在内部的一张针对key/value应用hash的表格里。由于hash不是顺序存储,因此在显示数组内容时会发现,它们并不是按照你预料的顺序显示出来的。数组和变量一样,都是在使用时自动创建的,awk也同样会自动判断其存储的是数字还是字符串。一般而言,awk中的数组用来从记录中**收集信息**,可以用于**计算总和**、**统计单词**以及**跟踪模板被匹配的次数**等等。