1. 只使用Spark Shell
这里只需要下载任何Spark版本,不需要作任何配置,直接使用Spark shell(我这里为方便,把spark的bin目录加入到全局path中)
2. 使用Scala写一个独立的Application
2.1 安装SBT
A下载地址:https://sbt-downloads.cdnedge.bluemix.net/releases/v1.3.4/sbt-1.3.4.tgz
然后用tar -zxvf解压缩,解压缩结果如图:
B sbt换用华为源
SBT 下载依赖的速度极慢,换用华为源(路径是~/.sbt/repositories
)
[repositories]
local
huaweicloud-maven: https://repo.huaweicloud.com/repository/maven/
maven-central: https://repo1.maven.org/maven2/
huaweicloud-ivy: https://repo.huaweicloud.com/repository/ivy/, [organization]/[module]/(scala_[scalaVersion]/)(sbt_[sbtVersion]/)[revision]/[type]s/[artifact](-[classifier]).[ext]
C 设置所有项目均使用全局仓库配置,忽略项目自身仓库配置
-Dsbt.override.build.repos=true
运行命令检查是否可用2.2 例子
2.2.1 目录结构和文件内容
import org.apache.spark.sql.SparkSession
object SimpleApp
{
def main(args : Array[String])
{
val logFile ="/home/yay/software/spark-2.4.4-bin-hadoop2.7/README.md"
val spark = SparkSession.builder.appName("Simple Application").getOrCreate()
val logData = spark.read.textFile(logFile).cache()
val numAs = logData.filter(line => line.contains("a")).count()
val numBs = logData.filter(line => line.contains("b")).count()
println(s"Lines with a: $numAs, Lines with b: $numBs")
spark.stop()
}
}
build.sbt文件内容为:
name := "Simple Project"
version := "1.0"
scalaVersion := "2.11.12"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.4.4"
2.2.2 编译
2.2.3 使用 spark-submit script运行程序
问题说明: