转载这篇帖子解决了我的问题。我的问题背景是我迁移了虚拟机到另一个 ip 地址
非常感谢:
转载地址:https://www.cnblogs.com/-xiaoyu-/p/11399287.html
问题可能存在好几种,比如配置问题,比如设置的user 不对:
<property>
<name>hadoop.proxyuser.root.hosts</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.root.groups</name>
<value>*</value>
</property>
我的猜测(大概率真):
因为我之前部署了 flink,而 flink 又存在 checkpoint,也就是一直在保存快照,所以迁移服务器的时候必然导致 flink 的文件损坏,而 flink 又保存在 hadoop,也就导致了 hadoop 的文件损坏,然后 hadoop 状态有问题;
下面是我复制转载的内容:
命令hadoop fs –safemode get 查看安全模式状态
命令hadoop fs –safemode enter 进入安全模式状态
命令hadoop fs –safemode leave 离开安全模式状态
第一步:检查hadoop文件系统hadoop fsck /
[root@node03 export]# hadoop fsck /
....................................................................................................
.............Status: CORRUPT #Hadoop状态:不正常
Total size: 273821489 B
Total dirs: 403
Total files: 213
Total symlinks: 0
Total blocks (validated): 201 (avg. block size 1362295 B)
********************************
UNDER MIN REPL'D BLOCKS: 2 (0.99502486 %)
dfs.namenode.replication.min: 1
CORRUPT FILES: 2 #损坏了两个文件
MISSING BLOCKS: 2 #丢失了两个块
MISSING SIZE: 6174 B
CORRUPT BLOCKS: 2
********************************
Minimally replicated blocks: 199 (99.004974 %)
Over-replicated blocks: 0 (0.0 %)
Under-replicated blocks: 0 (0.0 %)
Mis-replicated blocks: 0 (0.0 %)
Default replication factor: 3
Average block replication: 2.8208954
Corrupt blocks: 2
Missing replicas: 0 (0.0 %)
Number of data-nodes: 3
Number of racks: 1
FSCK ended at Fri Aug 23 10:43:11 CST 2019 in 12 milliseconds
看到这些代表hadoop集群不正常,有文件丢失:
.............Status: CORRUPT #Hadoop状态:不正常
CORRUPT FILES: 2 #损坏了两个文件
MISSING BLOCKS: 2 #丢失了两个块
第二步:将hadoop文件状态信息打印到文件中
hadoop fsck / -files -blocks -locations -racks >/export/missingFile.txt 将检查到的内容打印到/export/missingFile.txt文件中
[root@node03 export]# hadoop fsck / -files -blocks -locations -racks >/export/missingFile.txt
/flink-checkpoint/11748bc079799f330078967fbf018a48/chk-74/_metadata 452 bytes, 1 block(s): OK
0. BP-2135962035-192.168.52.100-1562110398602:blk_1073742825_2005 len=452 Live_repl=1 [/default-rack/192.168.52.110:50010]
/flink-checkpoint/11748bc079799f330078967fbf018a48/shared <dir>
/flink-checkpoint/11748bc079799f330078967fbf018a48/taskowned <dir>
/flink-checkpoint/42d81db182771fe71932120fa8933612 <dir>
/flink-checkpoint/42d81db182771fe71932120fa8933612/chk-950 <dir>
/flink-checkpoint/42d81db182771fe71932120fa8933612/chk-950/_metadata 337 bytes, 1 block(s): OK
0. BP-2135962035-192.168.52.100-1562110398602:blk_1073745657_4837 len=337 Live_repl=1 [/default-rack/192.168.52.120:50010]
/flink-checkpoint/42d81db182771fe71932120fa8933612/chk-950/f59c63a0-a35d-4d4b-8e73-72c2aa1dd383 5657 bytes, 1 block(s): OK
0. BP-2135962035-192.168.52.100-1562110398602:blk_1073745656_4836 len=5657 Live_repl=1 [/default-rack/192.168.52.100:50010]
/flink-checkpoint/42d81db182771fe71932120fa8933612/shared <dir>
/flink-checkpoint/42d81db182771fe71932120fa8933612/taskowned <dir>
/flink-checkpoint/50aebc9e7aac85fd33bff905972a6e01 <dir>
/flink-checkpoint/50aebc9e7aac85fd33bff905972a6e01/chk-9 <dir>
/flink-checkpoint/50aebc9e7aac85fd33bff905972a6e01/chk-9/_metadata 451 bytes, 1 block(s): OK
0. BP-2135962035-192.168.52.100-1562110398602:blk_1073742843_2023 len=451 Live_repl=1 [/default-rack/192.168.52.100:50010]
/flink-checkpoint/50aebc9e7aac85fd33bff905972a6e01/chk-9/c58c8c49-8782-41b4-a3df-2fa7ff1d1eba 5663 bytes, 1 block(s): OK
0. BP-2135962035-192.168.52.100-1562110398602:blk_1073742842_2022 len=5663 Live_repl=1 [/default-rack/192.168.52.120:50010]
/flink-checkpoint/50aebc9e7aac85fd33bff905972a6e01/shared <dir>
/flink-checkpoint/50aebc9e7aac85fd33bff905972a6e01/taskowned <dir>
/flink-checkpoint/626ea65de810a2ec3b1799b605a6a995 <dir>
/flink-checkpoint/626ea65de810a2ec3b1799b605a6a995/chk-175 <dir>
/flink-checkpoint/626ea65de810a2ec3b1799b605a6a995/chk-175/19195239-a205-4462-921d-09e0483a4080 5663 bytes, 1 block(s):
/flink-checkpoint/626ea65de810a2ec3b1799b605a6a995/chk-175/19195239-a205-4462-921d-09e0483a4080: CORRUPT blockpool BP-2135962035-192.168.52.100-1562110398602 block blk_1073743749
MISSING 1 blocks of total size 5663 B
0. BP-2135962035-192.168.52.100-1562110398602:blk_1073743749_2929 len=5663 MISSING!
/flink-checkpoint/626ea65de810a2ec3b1799b605a6a995/chk-175/_metadata 511 bytes, 1 block(s):
/flink-checkpoint/626ea65de810a2ec3b1799b605a6a995/chk-175/_metadata: CORRUPT blockpool BP-2135962035-192.168.52.100-1562110398602 block blk_1073743750
MISSING 1 blocks of total size 511 B
0. BP-2135962035-192.168.52.100-1562110398602:blk_1073743750_2930 len=511 MISSING!
可以看到正常文件后面都有ok字样,有MISSING!字样的就是丢失的文件。
/flink-checkpoint/626ea65de810a2ec3b1799b605a6a995/chk-175/19195239-a205-4462-921d-09e0483a4080: CORRUPT blockpool BP-2135962035-192.168.52.100-1562110398602 block blk_1073743749
MISSING 1 blocks of total size 5663 B
/flink-checkpoint/626ea65de810a2ec3b1799b605a6a995/chk-175/_metadata: CORRUPT blockpool BP-2135962035-192.168.52.100-1562110398602 block blk_1073743750
MISSING 1 blocks of total size 511 B
根据这个的路劲可以在hadoop浏览器界面中找到对应的文件路径
第三步:修复两个丢失、损坏的文件
[root@node03 conf]# hdfs debug recoverLease -path /flink-checkpoint/626ea65de810a2ec3b1799b605a6a995/chk-175/19195239-a205-4462-921d-09e0483a4080 -retries 10
[root@node03 conf]# hdfs debug recoverLease -path /flink-checkpoint/626ea65de810a2ec3b1799b605a6a995/chk-175/_metadata -retries 10
[root@node03 conf]# hdfs debug recoverLease -path /flink-checkpoint/626ea65de810a2ec3b1799b605a6a995/chk-175/19195239-a205-4462-921d-09e0483a4080 -retries 10
recoverLease SUCCEEDED on /flink-checkpoint/626ea65de810a2ec3b1799b605a6a995/chk-175/19195239-a205-4462-921d-09e0483a4080
[root@node03 conf]# hdfs debug recoverLease -path /flink-checkpoint/626ea65de810a2ec3b1799b605a6a995/chk-175/_metadata -retries 10
recoverLease SUCCEEDED on /flink-checkpoint/626ea65de810a2ec3b1799b605a6a995/chk-175/_metadata
[root@node03 conf]#
可以看到:
...........Status: HEALTHY
Total size: 273815315 B
Total dirs: 403
Total files: 211
Total symlinks: 0
Total blocks (validated): 199 (avg. block size 1375956 B)
Minimally replicated blocks: 199 (100.0 %)
Over-replicated blocks: 0 (0.0 %)
Under-replicated blocks: 0 (0.0 %)
Mis-replicated blocks: 0 (0.0 %)
Default replication factor: 3
Average block replication: 2.8492463
Corrupt blocks: 0
Missing replicas: 0 (0.0 %)
Number of data-nodes: 3
Number of racks: 1
FSCK ended at Fri Aug 23 11:15:01 CST 2019 in 11 milliseconds
...........Status: HEALTHY 集群状态:健康
现在重新启动hadoop就不会一直处于安全模式了,hiveserver2也能正常启动了。。
第四:意外状况
如果修复不了,或者提示修复成功但是集群状态还是下面这样:
.............Status: CORRUPT #Hadoop状态:不正常
Total size: 273821489 B
Total dirs: 403
Total files: 213
Total symlinks: 0
Total blocks (validated): 201 (avg. block size 1362295 B)
********************************
UNDER MIN REPL'D BLOCKS: 2 (0.99502486 %)
dfs.namenode.replication.min: 1
CORRUPT FILES: 2 #损坏了两个文件
MISSING BLOCKS: 2 #丢失了两个块
MISSING SIZE: 6174 B
CORRUPT BLOCKS: 2
********************************
Minimally replicated blocks: 199 (99.004974 %)
Over-replicated blocks: 0 (0.0 %)
Under-replicated blocks: 0 (0.0 %)
Mis-replicated blocks: 0 (0.0 %)
Default replication factor: 3
Average block replication: 2.8208954
Corrupt blocks: 2
Missing replicas: 0 (0.0 %)
Number of data-nodes: 3
Number of racks: 1
FSCK ended at Fri Aug 23 10:43:11 CST 2019 in 12 milliseconds
1、如果损坏的文件不重要
首先:将找到的损坏文件备份好
然后:执行[root@node03 export]# hadoop fsck / -delete将损坏文件删除
[root@node03 export]# hadoop fsck / -delete
也可以使用
先关闭安全模式
hdfs dfsadmin -safemode leave
再使用 hdfs 删除文件
hdfs dfs -rm -r /flink/flink-checkpoints/f33ee2464b69383f3a06112ee36cda90
此命令一次不成功可以多试几次,前提是丢失、损坏的文件不重要!!!!!!!!!
2、如果损坏的文件很重要不能丢失
可以先执行此命令:hadoop fs –safemode leave 强制离开安全模式状态
[root@node03 export]# hadoop fs –safemode leave
此操作不能完全解决问题,只能暂时让集群能够工作!!!!
而且,以后每次启动hadoop集群都要执行此命令,直到问题彻底解决。
如果并非以上问题请转这篇:
https://www.cnblogs.com/-xiaoyu-/p/12158984.html