添加Cloudera maven镜像
在spark的pom文件中添加 CDH的maven镜像[1],并添加 Hadoop cdh5.6.1 的profile
<repository>
<id>cloudera</id>
<url>https://repository.cloudera.com/artifactory/cloudera-repos/</url>
<name>Cloudera Repositories</name>
<snapshots>
<enabled>true</enabled>
</snapshots>
</repository>
<pluginRepository>
<id>cloudera</id>
<name>Cloudera Repositories</name>
<url>https://repository.cloudera.com/artifactory/cloudera-repos/</url>
</pluginRepository>
<profile>
<id>hadoop-2.6</id>
<properties>
<hadoop.version>2.6.0-cdh5.16.1</hadoop.version>
</properties>
</profile>
具体添加配置的位置可参考这个commit
https://github.com/yangrong688/spark/commit/13c322ee32daaae7d4505fa676396be5254ecddf
修改make-distribution.sh(文件在./dev 里面)
注释掉SPARK_HIVE部分设置SPARK_HIVE=1
#SPARK_HIVE=$("$MVN" help:evaluate -Dexpression=project.activeProfiles -pl sql/hive $@ 2>/dev/null\
# | grep -v "INFO"\
# | grep -v "WARNING"\
# | fgrep --count "<id>hive</id>";\
# # Reset exit status to 0, otherwise the script stops here if the last grep finds nothing\
# # because we use "set -o pipefail"
# echo -n)
SPARK_HIVE=1
执行mvn编译
./make-distribution.sh --name 2.6.0-cdh5.16.1 --tgz -Phadoop-2.6 -Pyarn -Phive-1.2 -Phive-thriftserver -DskipTests
编译到yarn环节会报错 :
解决办法 修改源码:resource-managers\yarn\src\main\scala\org\apache\spark\deploy\yarn\Client.scala
修改295行 。由于这两个方法是hadoop2.6.4添加的,如果你的hadoop版本低于2.6.4,那么编译就会报错。
按如下方式修改即可解决问题
// sparkConf.get(ROLLED_LOG_INCLUDE_PATTERN).foreach { includePattern =>
// try {
// val logAggregationContext = Records.newRecord(classOf[LogAggregationContext])
// logAggregationContext.setRolledLogsIncludePattern(includePattern)
// sparkConf.get(ROLLED_LOG_EXCLUDE_PATTERN).foreach { excludePattern =>
// logAggregationContext.setRolledLogsExcludePattern(excludePattern)
// }
// appContext.setLogAggregationContext(logAggregationContext)
// } catch {
// case NonFatal(e) =>
// logWarning(s"Ignoring ${ROLLED_LOG_INCLUDE_PATTERN.key} because the version of YARN " +
// "does not support it", e)
// }
// }
// appContext.setUnmanagedAM(isClientUnmanagedAMEnabled)
//
// sparkConf.get(APPLICATION_PRIORITY).foreach { appPriority =>
// appContext.setPriority(Priority.newInstance(appPriority))
// }
// appContext
// }
sparkConf.get(ROLLED_LOG_INCLUDE_PATTERN).foreach { includePattern =>
try {
val logAggregationContext = Records.newRecord(classOf[LogAggregationContext])
// These two methods were added in Hadoop 2.6.4, so we still need to use reflection to
// avoid compile error when building against Hadoop 2.6.0 ~ 2.6.3.
val setRolledLogsIncludePatternMethod =
logAggregationContext.getClass.getMethod("setRolledLogsIncludePattern", classOf[String])
setRolledLogsIncludePatternMethod.invoke(logAggregationContext, includePattern)
sparkConf.get(ROLLED_LOG_EXCLUDE_PATTERN).foreach { excludePattern =>
val setRolledLogsExcludePatternMethod =
logAggregationContext.getClass.getMethod("setRolledLogsExcludePattern", classOf[String])
setRolledLogsExcludePatternMethod.invoke(logAggregationContext, excludePattern)
}
appContext.setLogAggregationContext(logAggregationContext)
} catch {
case NonFatal(e) =>
logWarning(s"Ignoring ${ROLLED_LOG_INCLUDE_PATTERN.key} because the version of YARN " +
"does not support it", e)
}
}
appContext
}
配置文件
将 core-site.xml、hive-site.xml、hdfs-site.xml 复制到 spark conf 目录
修改hive-site.xml配置文件,添加新内容
#说明:hive.server2.thrift.bind.host指定要启动thrift server的主机,hive.server2.thrift.port指定要打开的端口号。使用端口#10001是为了避免与Hive自己的hive.server2.thrift.port—10000产生冲突。
<property>
<name>hive.server2.thrift.min.worker.threads</name>
<value>5</value>
</property>
<property>
<name>hive.server2.thrift.max.worker.threads</name>
<value>500</value>
</property>
<property>
<name>hive.server2.thrift.port</name>
<value>10001</value>
</property>
<property>
<name>hive.server2.thrift.bind.host</name>
<value>local Thrift 的ip ,在集群模式不生效</value>
</property>
配置spark-env.sh
export SPARK_CLASSPATH=$SPARK_CLASSPATH:/usr/local/datacenter/spark-1-bin-2.6.0-cdh5.16.1/jars/jersey-client-2.30.jar
export DATA_HOME=/usr/local/datacenter
export JAVA_HOME=/usr/java/jdk1.8.0_71
export SCALA_HOME=$DATA_HOME/scala2.11
export HADOOP_HOME=$DATA_HOME/hadoop
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export HIVE_HOME=$DATA_HOME/hive
export HIVE_CONF_DIR=$HIVE_HOME/conf
export SPARK_HOME=/usr/local/datacenter/spark-1-bin-2.6.0-cdh5.16.1
export SPARK_LIBRARY_PATH=$HADOOP_HOME/lib/native
export JAVA_LIBRARY_PATH=$JAVA_LIBRARY_PATH:$HADOOP_HOME/lib/native
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$HADOOP_HOME/lib/native
export SPARK_YARN_USER_ENV="JAVA_LIBRARY_PATH=$JAVA_LIBRARY_PATH,LD_LIBRARY_PATH=$LD_LIBRARY_PATH"
export PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin:/usr/local/bin
配置spark-defaults.conf
spark.eventLog.enabled true
spark.eventLog.dir hdfs://mfwCluster/user/spark3/logs
spark.history.fs.logDirectory hdfs://mfwCluster/user/spark3/logs
spark.history.retainedApplications 500
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.executor.uri hdfs://mfwCluster/user/spark3/share/lib/spark2.tgz
spark.yarn.jars hdfs://mfwCluster/user/spark3/share/lib/spark2_jars/*
spark.sql.parquet.binaryAsString true
spark.mesos.coarse true
spark.driver.memory 5g
spark.executor.memory 4g
spark.executor.cores 8
spark.default.parallelism 400
spark.debug.maxToStringFields 300
上传依赖
hdfs dfs -mkdir -p /user/spark3/share/lib/spark2_jars
hdfs dfs -put $SPARK_HOME/jars/* /user/spark3/share/lib/spark2_jars/
测试spark安装成功
/usr/local/datacenter/spark2/bin/spark-submit --master yarn --executor-memory 20g --executor-cores 5 --class org.apache.spark.examples.JavaSparkPi spark-examples_2.12-3.0.1.jar 1
集群模式启动thriftserver
spark-1-bin-2.6.0-cdh5.16.1/sbin/start-thriftserver.sh --master yarn --executor-memory 8g --executor-cores 4 --num-executors 15 --queue hive
使用beeline连接thriftserver
spark-1-bin-2.6.0-cdh5.16.1/bin/beeline
!connect jdbc:hive2://<集群模式下spark driver ip|| local模式下thriftserver的ip >:10001/default;auth=noSasl
问题1
Caused by: java.lang.ClassNotFoundException: com.sun.jersey.api.client.config.ClientConfig
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:335)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 12 more
原因:Hadoop-2.x依赖jersey-1.9,而Spark-2.x依赖jersey-2.22,两个版本发生过重大重构,代码包路径已发生变更。
jersey-1.9: com.sun.jersey
jersey-2.22:org.glassfish.jersey
删除高本版的jersey-client、jersey-core.添加如下1.9依赖
jersey-client-1.19.jar
jersey-core-1.19.1.jar
问题2
21/09/28 11:17:29 WARN server.HttpChannel: /api/v1/applications/application_1578876640023_8997328/allexecutors
java.lang.NullPointerException
at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:366)
at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:319)
at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:205)
at org.sparkproject.jetty.servlet.ServletHolder.handle(ServletHolder.java:873)
at org.sparkproject.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1623)
at org.apache.spark.ui.HttpSecurityFilter.doFilter(HttpSecurityFilter.scala:95)
at org.sparkproject.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1610)
at org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.doFilter(AmIpFilter.java:182)
at org.sparkproject.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1610)
at org.sparkproject.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:540)
at org.sparkproject.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:255)
at org.sparkproject.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1345)
at org.sparkproject.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:203)
at org.sparkproject.jetty.servlet.ServletHandler.doScope(ServletHandler.java:480)
at org.sparkproject.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:201)
at org.sparkproject.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1247)
at org.sparkproject.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:144)
at org.sparkproject.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:753)
at org.sparkproject.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:220)
at org.sparkproject.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
at org.sparkproject.jetty.server.Server.handle(Server.java:505)
解决方法:报错是因为 Application#getProperties() 方法属于JAX-RS 2协议(javax.ws.rs-api-2.0.1.jar),而不是JAX-RS 1(jsr311-api.jar),所以项目中可能jar包版本冲突导致。
于是查找$SPARK_HOME/jars下的包,把jsr305-3.0.0.jar删掉就解决以上问题