需求:对于xxx项目入库到hive数仓的小时数据,用户希望准实时了解这部分入库数据的hdfs大小
方案1:直接du
缺点:高峰期如果用户du的path不合理,或者频率过高,可能会导致namenode 出现hung,造成集群响应慢
方案2:通过metastore的parameters.totalSize拿到表或者分区的大小
也可以直接查 metastore 数据库,数据源和上面使用desc查询一样:
select
dbs.DB_ID,
dbs.NAME,
tbls.TBL_NAME,
tbls.TBL_ID,
partition_params.PARAM_KEY,
sum(partition_params.PARAM_VALUE) as totalSize
from
DBS As dbs
left join TBLS as tbls
on tbls.DB_ID = dbs.DB_ID
left JOIN partitions AS partitions
on tbls.TBL_ID = partitions.TBL_ID
left JOIN partition_params
on partition_params.part_ID = partitions.part_ID
where
dbs.NAME in ('pods','yfzx','pdwd')
AND tbls.TBL_NAME in ('x','y','z','u')
AND partition_params.PARAM_KEY in ('totalSize')
group by dbs.DB_ID,dbs.NAME,tbls.TBL_NAME,partition_params.PARAM_KEY,tbls.TBL_ID;