博客 实践数据湖iceberg:hadoop2.7,spark3 on yarn运行iceberg配置

实践数据湖iceberg:hadoop2.7,spark3 on yarn运行iceberg配置

   数栈君   发表于 2023-03-31 16:10  500  0

前言
spark版本: spark-3.2.0-bin-hadoop2.7
hadoop版本: hadoop2.7.2
1. hadoop2.7 上安装 spark3.2 报错java.lang.NoClassDefFoundError: com/sun/jersey/api/client/config/ClientConfig
直接解压,跑spark-shell --master yarn
说明: 已经配置了HADOOP_HOME,HADOOP_CONF_DIR, 解压后,spark-shell会自动找HADOOP_HOME

[root@hadoop101 spark]# spark-shell --master yarn
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
22/02/14 21:00:06 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
java.lang.NoClassDefFoundError: com/sun/jersey/api/client/config/ClientConfig
at org.apache.hadoop.yarn.client.api.TimelineClient.createTimelineClient(TimelineClient.java:55)
at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.createTimelineClient(YarnClientImpl.java:181)
at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.serviceInit(YarnClientImpl.java:168)
at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:175)
at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:62)
at org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:220)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:581)
at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2690)
at org.apache.spark.sql.SparkSession$Builder.$anonfun$getOrCreate$2(SparkSession.scala:949)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:943)
at org.apache.spark.repl.Main$.createSparkSession(Main.scala:106)
... 55 elided
Caused by: java.lang.ClassNotFoundException: com.sun.jersey.api.client.config.ClientConfig
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 68 more
<console>:14: error: not found: value spark
import spark.implicits._
^
<console>:14: error: not found: value spark
import spark.sql
^
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 3.2.0
/_/

Using Scala version 2.12.15 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_212)
Type in expressions to have them evaluated.
Type :help for more information


2. 问题分析
spark3的jersey版本:

[root@hadoop103 spark-3.2.0-bin-hadoop2.7]# ls jars/jersey-*
jars/jersey-client-2.34.jar jars/jersey-common-2.34.jar jars/jersey-container-servlet-2.34.jar jars/jersey-container-servlet-core-2.34.jar jars/jersey-hk2-2.34.jar jars/jersey-server-2.34.jar
[root@hadoop103 spark-3.2.0-bin-hadoop2.7]#
1
2
3
hadoop的jersey版本:

[root@hadoop103 spark-3.2.0-bin-hadoop2.7]# ls /opt/module/hadoop/share/hadoop/yarn/lib/jersey-*
/opt/module/hadoop/share/hadoop/yarn/lib/jersey-client-1.9.jar /opt/module/hadoop/share/hadoop/yarn/lib/jersey-guice-1.9.jar /opt/module/hadoop/share/hadoop/yarn/lib/jersey-server-1.9.jar
/opt/module/hadoop/share/hadoop/yarn/lib/jersey-core-1.9.jar /opt/module/hadoop/share/hadoop/yarn/lib/jersey-json-1.9.jar
[root@hadoop103 spark-3.2.0-bin-hadoop2.7]#

网上很多建议,把 jersey-core-1.9.jar ,jersey-client-1.9.jar ,jersey-guice-1.9.jar 放到$SPARK_HOME/jars 下面。
照做,发现不可行 (spark-shell --master yarn, spark-sql --master yarn都试试, 有时有一个行,奇怪)。

1.9和2.3的包,里面的类是不会冲突。


3. 解决方法
yarn-site.xml中,加上如下配置,重启yarn (不重启,不生效)。

<property>
<name>yarn.timeline-service.enabled</name>
<value>false</value>
</property>

4.spark3 on yarn + iceberg0.13启动
[root@hadoop101 spark]# bin/spark-sql --packages org.apache.iceberg:iceberg-spark-runtime-3.2_2.12:0.13.0 --conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions --conf spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog --conf spark.sql.catalog.spark_catalog.type=hive --conf spark.sql.catalog.local=org.apache.iceberg.spark.SparkCatalog --conf spark.sql.catalog.local.type=hadoop --conf spark.sql.catalog.local.warehouse=/tmp/iceberg/warehouse --master yarn
1
总结
至此,终于spark3 cluster模式运行iceberg环境准备完成

缺点:把yarn.timeline-service.enabled关闭了,先玩起来再说

内容来源于网络,如侵删。


近日,袋鼠云重磅发布《数据治理行业实践白皮书》,白皮书基于袋鼠云在数据治理领域的8年深厚积累与实践服务经验,从专业视角逐步剖析数据治理难题,阐述数据治理的概念内涵、目标价值、实施路线、保障体系与平台工具,并借助行业实践案例解析,为广大读者提供一种数据治理新思路。

扫码下载《数据治理行业实践白皮书》,下载地址:https://fs80.cn/4w2atuhttp://dtstack-static.oss-cn-hangzhou.aliyuncs.com/2021bbs/files_user1/article/d8554fbe158bcc3682cf2f1954368f55..png



想了解或咨询更多有关袋鼠云大数据产品、行业解决方案、客户案例的朋友,浏览袋鼠云官网:https://www.dtstack.com/?src=bbs

同时,欢迎对大数据开源项目有兴趣的同学加入「袋鼠云开源框架钉钉技术群」,交流最新开源技术信息,群号码:30537511,项目地址:
https://github.com/DTStack

0条评论
社区公告
  • 大数据领域最专业的产品&技术交流社区,专注于探讨与分享大数据领域有趣又火热的信息,专业又专注的数据人园地

最新活动更多
微信扫码获取数字化转型资料
钉钉扫码加入技术交流群