先看一个例子:
/*
首先我们定义了一个Search对象,带有一个String类型的参数
该类拥有三个成员方法:
1)isMatch:判断参数字符串s是否包含子串query
2)getMatchRdd1:使用isMatch方法获取匹配结果后的RDD
3)getMatchRdd1:在filter中实现方法获取匹配结果后的RDD
*/
class Search(query: String) {
def isMatch(s: String): Boolean = {
s.contains(query)
}
def getMatchRdd1(rdd: RDD[String]): RDD[String] = {
rdd.filter(isMatch)
}
def getMatchRdd2(rdd: RDD[String]): RDD[String] = {
rdd.filter(_.contains(query))
}
}
object SerializableDemo {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setMaster("local[4]").setAppName("SerializableDemo")
val sc = new SparkContext(conf)
val rdd = sc.parallelize(Array("hello", "world", "hello", "spark"))
val search = new Search("h")
val matchRdd = search.getMatchRdd2(rdd)
matchRdd.collect().foreach(println)
sc.stop()
}
}
运行后结果:
Exception in thread "main" org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner.orgsparkClosureCleaneranonfun1.apply(RDD.scala:388)
at org.apache.spark.rdd.RDD$$anonfun1.apply(RDD.scala:387)
at org.apache.spark.rdd.RDDOperationScope.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at org.apache.spark.rdd.RDD.filter(RDD.scala:387)
at adamlee.spark.Search.getMatchRdd2(SerializableDemo.scala:34)
at adamlee.spark.SerializableDemo$.main(SerializableDemo.scala:16)
at adamlee.spark.SerializableDemo.main(SerializableDemo.scala)
Caused by: java.io.NotSerializableException: adamlee.spark.Search
Serialization stack:
- object not serializable (class: adamlee.spark.Search, value: adamlee.spark.Search@4cafa9aa)
- field (class: adamlee.spark.Search$$anonfun1, name: $outer, type: class adamlee.spark.Search)
- object (class adamlee.spark.Search$$anonfun1, <function1>)
at org.apache.spark.serializer.SerializationDebugger.ensureSerializable(ClosureCleaner.scala:342)
... 12 more
报错提示Task未能序列化,再看Caused By提示:object not serializable,告诉我们Search这个类的对象未能序列化。
原因就是search对象初始化是在Driver端进行的,当我们执行collect是,触发计算,Driver需要将任务下发至Executor,这时候就产生了进程间通信,Driver和Executor间通信是通过网络传输,网络上传输的是二进制的比特流,由于Search类并未继承Serializable类,所以这个类的对象就不能被序列化。
现在我们新建一个类Search1,继承了Serializable:
class Search1(query: String) extends Serializable {
def isMatch(s: String): Boolean = {
s.contains(query)
}
def getMatchRdd1(rdd: RDD[String]): RDD[String] = {
rdd.filter(isMatch)
}
def getMatchRdd2(rdd: RDD[String]): RDD[String] = {
rdd.filter(_.contains(query))
}
}
object SerializableDemo {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setMaster("local[4]").setAppName("SerializableDemo")
val sc = new SparkContext(conf)
val rdd = sc.parallelize(Array("hello", "world", "hello", "spark"))
val search1 = new Search1("h")
val matchRdd = search1.getMatchRdd2(rdd)
matchRdd.collect().foreach(println)
sc.stop()
}
}
运行后结果:
hello
hello