问题提出
数据不断写入iceberg, 也进行合并与清理快照,发现快照和manifest文件都被清理,但metadata的文件没有被清理的痕迹
数据文件只有6.3M,数据个数20个,但metadata总大小33.1G,metadata个数8715个, 清理最后一个快照前5分钟的所有数据,发现对数据没影响
问题解决方法? 待后续解决,关注后面更新。。。
出现问题的建表方式
基于hiveCatalog在sqlClient建表,建表语句,具体查看11课。
在第11课结尾中也发现这个问题。单独写一篇文章以显示它的重要性。
iceberg小文件合并后出现的问题(现状)
文件大小
[root@hadoop103 ~]# hadoop fs -du -h /user/hive/warehouse/hive_catalog6/iceberg_db6.db/behavior_log_ib6/
6.3 M /user/hive/warehouse/hive_catalog6/iceberg_db6.db/behavior_log_ib6/data
33.1 G /user/hive/warehouse/hive_catalog6/iceberg_db6.db/behavior_log_ib6/metadata
1
2
3
文件个数
[root@hadoop101 ~]# hadoop fs -du -h /user/hive/warehouse/hive_catalog6/iceberg_db6.db/behavior_log_ib6/data|wc
21 61 2940
[root@hadoop101 ~]# hadoop fs -du -h /user/hive/warehouse/hive_catalog6/iceberg_db6.db/behavior_log_ib6/metadata|wc
8715 26144 1246221
1
2
3
4
metadata目录
-rw-r--r-- 2 root supergroup 8118751 2022-01-26 11:19 /user/hive/warehouse/hive_catalog6/iceberg_db6.db/behavior_log_ib6/metadata/08690-b9a3c862-443e-4f6b-a1fc-c17fe3e517dc.metadata.json
-rw-r--r-- 2 root supergroup 8119685 2022-01-26 11:20 /user/hive/warehouse/hive_catalog6/iceberg_db6.db/behavior_log_ib6/metadata/08691-34894f4a-d881-4b8f-b228-7adba992a08f.metadata.json
-rw-r--r-- 2 root supergroup 8120615 2022-01-26 11:21 /user/hive/warehouse/hive_catalog6/iceberg_db6.db/behavior_log_ib6/metadata/08692-1ce25766-4ca5-473e-945f-3fd848cae5e3.metadata.json
-rw-r--r-- 2 root supergroup 8121549 2022-01-26 11:22 /user/hive/warehouse/hive_catalog6/iceberg_db6.db/behavior_log_ib6/metadata/08693-4bd481a5-f32b-4f15-aad7-4cd3a5af6b39.metadata.json
-rw-r--r-- 2 root supergroup 8122483 2022-01-26 11:23 /user/hive/warehouse/hive_catalog6/iceberg_db6.db/behavior_log_ib6/metadata/08694-4f3554aa-4db7-443d-bbb9-ac0871ec02da.metadata.json
-rw-r--r-- 2 root supergroup 8123417 2022-01-26 11:24 /user/hive/warehouse/hive_catalog6/iceberg_db6.db/behavior_log_ib6/metadata/08695-e8bf9bda-44e7-4624-83a2-d64db09f5660.metadata.json
-rw-r--r-- 2 root supergroup 8124351 2022-01-26 11:25 /user/hive/warehouse/hive_catalog6/iceberg_db6.db/behavior_log_ib6/metadata/08696-2b95f1d4-6843-41e6-9e16-77bbe1875b7f.metadata.json
-rw-r--r-- 2 root supergroup 8125285 2022-01-26 11:26 /user/hive/warehouse/hive_catalog6/iceberg_db6.db/behavior_log_ib6/metadata/08697-f11c1b8f-f987-4589-8159-521c65328163.metadata.json
-rw-r--r-- 2 root supergroup 8126219 2022-01-26 11:27 /user/hive/warehouse/hive_catalog6/iceberg_db6.db/behavior_log_ib6/metadata/08698-fb8b744a-db03-4b80-8612-15de1d6278cc.metadata.json
-rw-r--r-- 2 root supergroup 8127153 2022-01-26 11:28 /user/hive/warehouse/hive_catalog6/iceberg_db6.db/behavior_log_ib6/metadata/08699-a6b6683d-d9f1-45a1-a09b-b242a8284b96.metadata.json
-rw-r--r-- 2 root supergroup 8128087 2022-01-26 11:29 /user/hive/warehouse/hive_catalog6/iceberg_db6.db/behavior_log_ib6/metadata/08700-cad78b24-8cd7-464f-95fe-296e96bfd648.metadata.json
-rw-r--r-- 2 root supergroup 8129021 2022-01-26 11:30 /user/hive/warehouse/hive_catalog6/iceberg_db6.db/behavior_log_ib6/metadata/08701-0f702902-b2ae-4029-b8cd-97b5df0474ff.metadata.json
-rw-r--r-- 2 root supergroup 8129955 2022-01-26 11:31 /user/hive/warehouse/hive_catalog6/iceberg_db6.db/behavior_log_ib6/metadata/08702-91dbcc1f-9d40-4662-874e-8f1091c0a52f.metadata.json
-rw-r--r-- 2 root supergroup 8130889 2022-01-26 11:32 /user/hive/warehouse/hive_catalog6/iceberg_db6.db/behavior_log_ib6/metadata/08703-2c78ad8f-69ff-408f-afec-8d707ff944e8.metadata.json
-rw-r--r-- 2 root supergroup 8131823 2022-01-26 11:33 /user/hive/warehouse/hive_catalog6/iceberg_db6.db/behavior_log_ib6/metadata/08704-84085a27-b185-468f-9c23-2984a9330762.metadata.json
-rw-r--r-- 2 root supergroup 8132757 2022-01-26 11:34 /user/hive/warehouse/hive_catalog6/iceberg_db6.db/behavior_log_ib6/metadata/08705-edc7f661-0ed2-4e46-82a0-a2006dd01ad5.metadata.json
-rw-r--r-- 2 root supergroup 8133691 2022-01-26 11:35 /user/hive/warehouse/hive_catalog6/iceberg_db6.db/behavior_log_ib6/metadata/08706-9c3378aa-21cb-48bf-be52-70b25ea59308.metadata.json
-rw-r--r-- 2 root supergroup 8343948 2022-01-27 11:52 /user/hive/warehouse/hive_catalog6/iceberg_db6.db/behavior_log_ib6/metadata/08707-afd79c3c-e280-45c4-9797-2fa9a4fa27f4.metadata.json
-rw-r--r-- 2 root supergroup 8344913 2022-01-27 14:16 /user/hive/warehouse/hive_catalog6/iceberg_db6.db/behavior_log_ib6/metadata/08708-75efd8f6-ba3f-47dc-8b89-b3177c477a62.metadata.json
-rw-r--r-- 2 root supergroup 8345875 2022-01-27 14:38 /user/hive/warehouse/hive_catalog6/iceberg_db6.db/behavior_log_ib6/metadata/08709-78209251-777c-4a4f-9292-64cf3f2190ae.metadata.json
-rw-r--r-- 2 root supergroup 23219 2022-01-27 15:17 /user/hive/warehouse/hive_catalog6/iceberg_db6.db/behavior_log_ib6/metadata/08710-d69a0a2b-959e-488d-8443-471986f49e32.metadata.json
-rw-r--r-- 2 root supergroup 5777 2022-01-27 14:38 /user/hive/warehouse/hive_catalog6/iceberg_db6.db/behavior_log_ib6/metadata/6c6d7719-74a9-4817-914a-b0df5eb8f6ba-m0.avro
-rw-r--r-- 2 root supergroup 6441 2022-01-27 14:38 /user/hive/warehouse/hive_catalog6/iceberg_db6.db/behavior_log_ib6/metadata/6c6d7719-74a9-4817-914a-b0df5eb8f6ba-m1.avro
-rw-r--r-- 2 root supergroup 5771 2022-01-27 14:38 /user/hive/warehouse/hive_catalog6/iceberg_db6.db/behavior_log_ib6/metadata/6c6d7719-74a9-4817-914a-b0df5eb8f6ba-m2.avro
-rw-r--r-- 2 root supergroup 3844 2022-01-27 14:38 /user/hive/warehouse/hive_catalog6/iceberg_db6.db/behavior_log_ib6/metadata/snap-7762404597294868190-1-6c6d7719-74a9-4817-914a-b0df5eb8f6ba.avro
大小格式化
7.7 M /user/hive/warehouse/hive_catalog6/iceberg_db6.db/behavior_log_ib6/metadata/08684-d4af58ae-4967-48a6-ac40-9308a075fe00.metadata.json
7.7 M /user/hive/warehouse/hive_catalog6/iceberg_db6.db/behavior_log_ib6/metadata/08685-89f09f2f-6cdf-43d8-acc2-79496dcaf18d.metadata.json
7.7 M /user/hive/warehouse/hive_catalog6/iceberg_db6.db/behavior_log_ib6/metadata/08686-9be5033f-2592-4696-9c2f-5d1d408910c6.metadata.json
7.7 M /user/hive/warehouse/hive_catalog6/iceberg_db6.db/behavior_log_ib6/metadata/08687-f111331a-599f-4068-9590-e57c76e46c31.metadata.json
7.7 M /user/hive/warehouse/hive_catalog6/iceberg_db6.db/behavior_log_ib6/metadata/08688-18779a1c-fd2d-43c2-9c62-4d1efb4caed2.metadata.json
7.7 M /user/hive/warehouse/hive_catalog6/iceberg_db6.db/behavior_log_ib6/metadata/08689-a1bfd5ea-23a1-431b-8208-a82f2561952e.metadata.json
7.7 M /user/hive/warehouse/hive_catalog6/iceberg_db6.db/behavior_log_ib6/metadata/08690-b9a3c862-443e-4f6b-a1fc-c17fe3e517dc.metadata.json
7.7 M /user/hive/warehouse/hive_catalog6/iceberg_db6.db/behavior_log_ib6/metadata/08691-34894f4a-d881-4b8f-b228-7adba992a08f.metadata.json
7.7 M /user/hive/warehouse/hive_catalog6/iceberg_db6.db/behavior_log_ib6/metadata/08692-1ce25766-4ca5-473e-945f-3fd848cae5e3.metadata.json
7.7 M /user/hive/warehouse/hive_catalog6/iceberg_db6.db/behavior_log_ib6/metadata/08693-4bd481a5-f32b-4f15-aad7-4cd3a5af6b39.metadata.json
7.7 M /user/hive/warehouse/hive_catalog6/iceberg_db6.db/behavior_log_ib6/metadata/08694-4f3554aa-4db7-443d-bbb9-ac0871ec02da.metadata.json
7.7 M /user/hive/warehouse/hive_catalog6/iceberg_db6.db/behavior_log_ib6/metadata/08695-e8bf9bda-44e7-4624-83a2-d64db09f5660.metadata.json
7.7 M /user/hive/warehouse/hive_catalog6/iceberg_db6.db/behavior_log_ib6/metadata/08696-2b95f1d4-6843-41e6-9e16-77bbe1875b7f.metadata.json
7.7 M /user/hive/warehouse/hive_catalog6/iceberg_db6.db/behavior_log_ib6/metadata/08697-f11c1b8f-f987-4589-8159-521c65328163.metadata.json
7.7 M /user/hive/warehouse/hive_catalog6/iceberg_db6.db/behavior_log_ib6/metadata/08698-fb8b744a-db03-4b80-8612-15de1d6278cc.metadata.json
7.8 M /user/hive/warehouse/hive_catalog6/iceberg_db6.db/behavior_log_ib6/metadata/08699-a6b6683d-d9f1-45a1-a09b-b242a8284b96.metadata.json
7.8 M /user/hive/warehouse/hive_catalog6/iceberg_db6.db/behavior_log_ib6/metadata/08700-cad78b24-8cd7-464f-95fe-296e96bfd648.metadata.json
7.8 M /user/hive/warehouse/hive_catalog6/iceberg_db6.db/behavior_log_ib6/metadata/08701-0f702902-b2ae-4029-b8cd-97b5df0474ff.metadata.json
7.8 M /user/hive/warehouse/hive_catalog6/iceberg_db6.db/behavior_log_ib6/metadata/08702-91dbcc1f-9d40-4662-874e-8f1091c0a52f.metadata.json
7.8 M /user/hive/warehouse/hive_catalog6/iceberg_db6.db/behavior_log_ib6/metadata/08703-2c78ad8f-69ff-408f-afec-8d707ff944e8.metadata.json
7.8 M /user/hive/warehouse/hive_catalog6/iceberg_db6.db/behavior_log_ib6/metadata/08704-84085a27-b185-468f-9c23-2984a9330762.metadata.json
7.8 M /user/hive/warehouse/hive_catalog6/iceberg_db6.db/behavior_log_ib6/metadata/08705-edc7f661-0ed2-4e46-82a0-a2006dd01ad5.metadata.json
7.8 M /user/hive/warehouse/hive_catalog6/iceberg_db6.db/behavior_log_ib6/metadata/08706-9c3378aa-21cb-48bf-be52-70b25ea59308.metadata.json
8.0 M /user/hive/warehouse/hive_catalog6/iceberg_db6.db/behavior_log_ib6/metadata/08707-afd79c3c-e280-45c4-9797-2fa9a4fa27f4.metadata.json
8.0 M /user/hive/warehouse/hive_catalog6/iceberg_db6.db/behavior_log_ib6/metadata/08708-75efd8f6-ba3f-47dc-8b89-b3177c477a62.metadata.json
8.0 M /user/hive/warehouse/hive_catalog6/iceberg_db6.db/behavior_log_ib6/metadata/08709-78209251-777c-4a4f-9292-64cf3f2190ae.metadata.json
22.7 K /user/hive/warehouse/hive_catalog6/iceberg_db6.db/behavior_log_ib6/metadata/08710-d69a0a2b-959e-488d-8443-471986f49e32.metadata.json
5.6 K /user/hive/warehouse/hive_catalog6/iceberg_db6.db/behavior_log_ib6/metadata/6c6d7719-74a9-4817-914a-b0df5eb8f6ba-m0.avro
6.3 K /user/hive/warehouse/hive_catalog6/iceberg_db6.db/behavior_log_ib6/metadata/6c6d7719-74a9-4817-914a-b0df5eb8f6ba-m1.avro
5.6 K /user/hive/warehouse/hive_catalog6/iceberg_db6.db/behavior_log_ib6/metadata/6c6d7719-74a9-4817-914a-b0df5eb8f6ba-m2.avro
3.8 K /user/hive/warehouse/hive_catalog6/iceberg_db6.db/behavior_log_ib6/metadata/snap-7762404597294868190-1-6c6d7719-74a9-4817-914a-b0df5eb8f6ba.avro
data目录:
[root@hadoop101 ~]# hadoop fs -du -h /user/hive/warehouse/hive_catalog6/iceberg_db6.db/behavior_log_ib6/data
169.1 K /user/hive/warehouse/hive_catalog6/iceberg_db6.db/behavior_log_ib6/data/00000-0-3c21e5b1-54e8-42b1-8bdc-a0b8f1514ee1-00001.parquet
169.0 K /user/hive/warehouse/hive_catalog6/iceberg_db6.db/behavior_log_ib6/data/00000-0-3c21e5b1-54e8-42b1-8bdc-a0b8f1514ee1-00002.parquet
169.1 K /user/hive/warehouse/hive_catalog6/iceberg_db6.db/behavior_log_ib6/data/00000-0-3c21e5b1-54e8-42b1-8bdc-a0b8f1514ee1-00003.parquet
3.1 M /user/hive/warehouse/hive_catalog6/iceberg_db6.db/behavior_log_ib6/data/00000-0-cdcc5019-0c59-41e4-80c6-1d4185455065-00001.parquet
508 /user/hive/warehouse/hive_catalog6/iceberg_db6.db/behavior_log_ib6/data/00000-0-dd8bc29f-831a-4904-830e-2ef56e4a4743-08707.parquet
169.0 K /user/hive/warehouse/hive_catalog6/iceberg_db6.db/behavior_log_ib6/data/00001-0-139af0f5-d3ee-4f35-bd2e-73ce2aaf4792-00001.parquet
169.1 K /user/hive/warehouse/hive_catalog6/iceberg_db6.db/behavior_log_ib6/data/00001-0-139af0f5-d3ee-4f35-bd2e-73ce2aaf4792-00002.parquet
169.1 K /user/hive/warehouse/hive_catalog6/iceberg_db6.db/behavior_log_ib6/data/00001-0-139af0f5-d3ee-4f35-bd2e-73ce2aaf4792-00003.parquet
552 /user/hive/warehouse/hive_catalog6/iceberg_db6.db/behavior_log_ib6/data/00001-0-e9e8a782-fa82-4c4d-9786-c05b8aab251a-08707.parquet
5.9 K /user/hive/warehouse/hive_catalog6/iceberg_db6.db/behavior_log_ib6/data/00002-0-a0f46641-b14d-4f8b-a16e-4c768bcba775-00109.parquet
169.1 K /user/hive/warehouse/hive_catalog6/iceberg_db6.db/behavior_log_ib6/data/00002-0-fe001b68-3753-44a7-adb4-63d43c8b3226-00001.parquet
164.7 K /user/hive/warehouse/hive_catalog6/iceberg_db6.db/behavior_log_ib6/data/00002-0-fe001b68-3753-44a7-adb4-63d43c8b3226-00002.parquet
169.2 K /user/hive/warehouse/hive_catalog6/iceberg_db6.db/behavior_log_ib6/data/00002-0-fe001b68-3753-44a7-adb4-63d43c8b3226-00003.parquet
169.0 K /user/hive/warehouse/hive_catalog6/iceberg_db6.db/behavior_log_ib6/data/00002-0-fe001b68-3753-44a7-adb4-63d43c8b3226-00004.parquet
169.2 K /user/hive/warehouse/hive_catalog6/iceberg_db6.db/behavior_log_ib6/data/00003-0-1d71db79-abf1-4088-9282-bc907e45e262-00001.parquet
169.0 K /user/hive/warehouse/hive_catalog6/iceberg_db6.db/behavior_log_ib6/data/00003-0-1d71db79-abf1-4088-9282-bc907e45e262-00002.parquet
168.9 K /user/hive/warehouse/hive_catalog6/iceberg_db6.db/behavior_log_ib6/data/00003-0-1d71db79-abf1-4088-9282-bc907e45e262-00003.parquet
168.9 K /user/hive/warehouse/hive_catalog6/iceberg_db6.db/behavior_log_ib6/data/00003-0-1d71db79-abf1-4088-9282-bc907e45e262-00004.parquet
527.5 K /user/hive/warehouse/hive_catalog6/iceberg_db6.db/behavior_log_ib6/data/00004-0-fea6f5d5-759f-4769-9ced-b3ecca214e36-00001.parquet
169.0 K /user/hive/warehouse/hive_catalog6/iceberg_db6.db/behavior_log_ib6/data/00004-0-fea6f5d5-759f-4769-9ced-b3ecca214e36-00002.parquet
168.8 K /user/hive/warehouse/hive_catalog6/iceberg_db6.db/behavior_log_ib6/data/00004-0-fea6f5d5-759f-4769-9ced-b3ecca214e36-00003.parquet
清理最后一个快照的5分钟前的所有快照代码
执行合并、清理代码
清理最后一个快照的5分钟前的所有快照
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment
import org.apache.hadoop.conf.Configuration
import org.apache.iceberg.catalog.{Namespace, TableIdentifier}
import org.apache.iceberg.flink.actions.Actions
import org.apache.iceberg.flink.{CatalogLoader, TableLoader}
import org.apache.log4j.{Level, Logger}
import org.slf4j.LoggerFactory
import java.util
import java.util.concurrent.TimeUnit
object FlinkDataStreamSmallFileCompactTest {
private var logger: org.slf4j.Logger = _
def main(args: Array[String]): Unit = {
logger = LoggerFactory.getLogger(this.getClass.getSimpleName)
Logger.getLogger("org.apache").setLevel(Level.INFO)
Logger.getLogger("hive.metastore").setLevel(Level.WARN)
Logger.getLogger("akka").setLevel(Level.WARN)
// hive catalog
val env = StreamExecutionEnvironment.getExecutionEnvironment
System.setProperty("HADOOP_USER_NAME", "root")
val map = new util.HashMap[String, String]()
map.put("type", "iceberg")
map.put("catalog-type", "hive")
map.put("property-version", "2")
map.put("/warehouse", "/user/hive/warehouse")
// map.put("datanucleus.schema.autoCreateTables", "true")
// 压缩小文件
// 快照过期处理
map.put("uri", "thrift://hadoop101:9083")
val iceberg_catalog = CatalogLoader.hive(
"hive_catalog6", //catalog名称
new Configuration(),
new util.HashMap()
)
// val identifier = TableIdentifier.of(Namespace.of("iceberg_db6"), //db名称
// "behavior_with_date_log_ib") //表名称 behavior_with_date_log_ib behavior_log_ib6
val identifier = TableIdentifier.of(Namespace.of("iceberg_db6"), //db名称
"behavior_log_ib6") //表名称 behavior_with_date_log_ib behavior_log_ib6
val loader = TableLoader.fromCatalog(iceberg_catalog, identifier)
loader.open()
val table = loader.loadTable()
Actions.forTable(env, table)
.rewriteDataFiles
.maxParallelism(5)
.targetSizeInBytes(128 * 1024 * 1024)
.execute
// 清除5分钟前历史快照
val snapshot = table.currentSnapshot
val old = snapshot.timestampMillis - TimeUnit.MINUTES.toMillis(5)
if (snapshot != null) {
table.expireSnapshots
.expireOlderThan(old)
.commit()
println(s" behavior_with_date_log_ib 表 清理完成!!!")
}
}
}
清理日志:
发现:没有数据被清理
22/02/10 19:48:51 INFO conf.HiveConf: Found configuration file file:/E:/workspace/jt_workspace/iceberg-learning/flink-iceberg-learning/target/classes/hive-site.xml
22/02/10 19:48:51 WARN conf.HiveConf: HiveConf of name hive.metastore.event.db.notification.api.auth does not exist
22/02/10 19:48:51 INFO security.JniBasedUnixGroupsMapping: Error getting groups for root: Unknown error.
22/02/10 19:48:51 WARN security.UserGroupInformation: No groups available for user root
22/02/10 19:48:51 INFO iceberg.BaseMetastoreTableOperations: Refreshing table metadata from new version: hdfs://ns/user/hive/warehouse/hive_catalog6/iceberg_db6.db/behavior_log_ib6/metadata/08710-d69a0a2b-959e-488d-8443-471986f49e32.metadata.json
22/02/10 19:48:56 INFO iceberg.BaseMetastoreCatalog: Table loaded by catalog: hive_catalog6.iceberg_db6.behavior_log_ib6
22/02/10 19:48:56 INFO iceberg.BaseTableScan: Scanning table hive_catalog6.iceberg_db6.behavior_log_ib6 snapshot 7762404597294868190 created at 2022-01-27 14:38:10.105 with filter true
22/02/10 19:48:56 INFO iceberg.RemoveSnapshots: Expiring snapshots older than: Thu Jan 27 14:33:10 CST 2022 (1643265190105)
22/02/10 19:48:56 INFO iceberg.BaseMetastoreTableOperations: Nothing to commit.
22/02/10 19:48:56 INFO iceberg.RemoveSnapshots: Committed snapshot changes
其他表删除的日志:
总结
iceberg的文件合并与快照删除特点:
合并:会生成新的文件
快照删除:会删除snap和Manifests 文件,metadata文件没有合并,并清理老metadata
内容来源于网络,如侵删。
扫码下载《数据治理行业实践白皮书》,下载地址:https://fs80.cn/4w2atu
想了解或咨询更多有关袋鼠云大数据产品、行业解决方案、客户案例的朋友,浏览袋鼠云官网:https://www.dtstack.com/?src=bbs
同时,欢迎对大数据开源项目有兴趣的同学加入「袋鼠云开源框架钉钉技术群」,交流最新开源技术信息,群号码:30537511,项目地址:https://github.com/DTStack