上次讲过HIVE 的一个常用命令 MSCK REPAIR TABLE , 这次讲讲HIVE的ANALYZE TABLE
命令,接下来还会讲下Impala的 COMPUTE STATS
命令。这几个命令都是用来统计表的信息的,用于加速查询。
HIVE的 ANALYZE TABLE 命令是做啥的
Statistics such as the number of rows of a table or partition and the histograms of a particular interesting column are important in many ways. One of the key use cases of statistics is query optimization. Statistics serve as the input to the cost functions of the optimizer so that it can compare different plans and choose among them. Statistics may sometimes meet the purpose of the users' queries. Users can quickly get the answers for some of their queries by only querying stored statistics rather than firing long-running execution plans. Some examples are getting the quantile of the users' age distribution, the top 10 apps that are used by people, and the number of distinct sessions.
其实主要就是为了优化查询,加快查询的速度。
HIVE的主要统计数据
1. 表和分区的统计数据 (Table and Partition Statistics):
- 行数 (Number of rows)
- 文件数 (Number of files)
- 文件大小 (Size in Bytes)
2. 列的统计数据 (Column Statistics): 这是Hive 0.10.0后加入的
3. Top K Statistics
语法
analyze table t [partition p] compute statistics for [columns c,...];
例子1 对hive进行表和分区的分析
#执行语句
ANALYZE TABLE dw_wy_srvcc_customized_drilldown_table_daily PARTITION(date_time='20190228') COMPUTE STATISTICS;
#查看结果
DESCRIBE extended dw_wy_srvcc_customized_drilldown_table_daily PARTITION(date_time='20190228');
结果如下 numRows=12552, totalSize=1564884, rawDataSize=1552332
Partition(values:[20190228], dbName:coordinator_test, tableName:dw_wy_srvcc_customized_drilldown_table_daily, createTime:1551344149, lastAccessTime:0, sd:StorageDescriptor(cols:[FieldSchema(name:level, type:bigint, comment:null), FieldSchema(name:pi_name, type:string, comment:null), FieldSchema(name:dimension_type, type:string, comment:null), FieldSchema(name:demension_item, type:string, comment:null), FieldSchema(name:numerator_value, type:double, comment:null), FieldSchema(name:denominator_value, type:double, comment:null), FieldSchema(name:pi_value, type:double, comment:null), FieldSchema(name:degrad_weight, type:double, comment:null), FieldSchema(name:date_time, type:string, comment:null)], location:hdfs://dlbdn3:8020/user/hive/warehouse/coordinator_test.db/dw_wy_srvcc_customized_drilldown_table_daily/date_time=20190228, inputFormat:org.apache.hadoop.mapred.TextInputFormat, outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, parameters:{serialization.format=,, field.delim=,}), bucketCols:[], sortCols:[], parameters:{}, skewedInfo:SkewedInfo(skewedColNames:[], skewedColValues:[], skewedColValueLocationMaps:{}), storedAsSubDirectories:false), parameters:{numFiles=20, COLUMN_STATS_ACCURATE=true, transient_lastDdlTime=1551408396, numRows=12552, totalSize=1564884, rawDataSize=1552332})
例子2 对hive进行列的分析
#执行语句
analyze table dw_wy_srvcc_customized_drilldown_table_daily compute statistics for columns level;
#查看结果
describe formatted dw_wy_srvcc_customized_drilldown_table_daily level;