(六)SparkSQL读写本地外部数据源

https://spark-packages.org/里有很多third-party数据源的package,spark把包加载进来就可以使用了

csv格式在spark2.0版本之后是内置的,2.0之前属于第三方数据源
一、读取本地外部数据源
1.直接读取一个json文件

[hadoop@hadoop000 bin]$ ./spark-shell --master local[2] --jars ~/software/mysql-connector-java-5.1.27.jar 
scala> spark.read.load("file:///home/hadoop/app/spark-2.3.1-bin-2.6.0-cdh5.7.0/examples/src/main/resources/people.json").show

运行报错:

Caused by: java.lang.RuntimeException: file:/home/hadoop/app/spark-2.3.1-bin-2.6.0-cdh5.7.0/examples/src/main/resources/people.json is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but found [49, 57, 125, 10]
  at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:476)
  at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:445)
  at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:421)
  at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$readParquetFootersInParallel$1.apply(ParquetFileFormat.scala:519)
  ... 32 more

查看load方法的源码:

 /**
   * Loads input in as a `DataFrame`, for data sources that require a path (e.g. data backed by
   * a local or distributed file system).
   *
   * @since 1.4.0
   */
  def load(path: String): DataFrame = {
    option("path", path).load(Seq.empty: _*) // force invocation of `load(...varargs...)`
  }
---------------------------------------------------------
/**
   * Loads input in as a `DataFrame`, for data sources that support multiple paths.
   * Only works if the source is a HadoopFsRelationProvider.
   *
   * @since 1.6.0
   */
  @scala.annotation.varargs
  def load(paths: String*): DataFrame = {
    if (source.toLowerCase(Locale.ROOT) == DDLUtils.HIVE_PROVIDER) {
      throw new AnalysisException("Hive data source can only be used with tables, you can not " +
        "read files of Hive data source directly.")
    }

    val cls = DataSource.lookupDataSource(source, sparkSession.sessionState.conf)
    if (classOf[DataSourceV2].isAssignableFrom(cls)) {
      val ds = cls.newInstance()
      val options = new DataSourceOptions((extraOptions ++
        DataSourceV2Utils.extractSessionConfigs(
          ds = ds.asInstanceOf[DataSourceV2],
          conf = sparkSession.sessionState.conf)).asJava)

      // Streaming also uses the data source V2 API. So it may be that the data source implements
      // v2, but has no v2 implementation for batch reads. In that case, we fall back to loading
      // the dataframe as a v1 source.
      val reader = (ds, userSpecifiedSchema) match {
        case (ds: ReadSupportWithSchema, Some(schema)) =>
          ds.createReader(schema, options)

        case (ds: ReadSupport, None) =>
          ds.createReader(options)

        case (ds: ReadSupportWithSchema, None) =>
          throw new AnalysisException(s"A schema needs to be specified when using $ds.")

        case (ds: ReadSupport, Some(schema)) =>
          val reader = ds.createReader(options)
          if (reader.readSchema() != schema) {
            throw new AnalysisException(s"$ds does not allow user-specified schemas.")
          }
          reader

        case _ => null // fall back to v1
      }

      if (reader == null) {
        loadV1Source(paths: _*)
      } else {
        Dataset.ofRows(sparkSession, DataSourceV2Relation(reader))
      }
    } else {
      loadV1Source(paths: _*)
    }
  }

  private def loadV1Source(paths: String*) = {
    // Code path for data source v1.
    sparkSession.baseRelationToDataFrame(
      DataSource.apply(
        sparkSession,
        paths = paths,
        userSpecifiedSchema = userSpecifiedSchema,
        className = source,
        options = extraOptions.toMap).resolveRelation())
  }
------------------------------------------------------
private var source: String = sparkSession.sessionState.conf.defaultDataSourceName
-------------------------------------------------------
def defaultDataSourceName: String = getConf(DEFAULT_DATA_SOURCE_NAME)
--------------------------------------------------------
// This is used to set the default data source
  val DEFAULT_DATA_SOURCE_NAME = buildConf("spark.sql.sources.default")
    .doc("The default data source to use in input/output.")
    .stringConf
    .createWithDefault("parquet")

从源码中可以看出,如果不指定format,load默认读取的是parquet文件

scala> val users = spark.read.load("file:///home/hadoop/app/spark-2.3.1-bin-2.6.0-cdh5.7.0/examples/src/main/resources/users.parquet")
scala> users.show()
+------+--------------+----------------+                                        
|  name|favorite_color|favorite_numbers|
+------+--------------+----------------+
|Alyssa|          null|  [3, 9, 15, 20]|
|   Ben|           red|              []|
+------+--------------+----------------+

读取其他格式的文件,必须通过format指定文件格式,如下:

//windows idea环境下
val df1 = spark.read.format("json").option("timestampFormat", "yyyy/MM/dd HH:mm:ss ZZ").load("hdfs://192.168.137.141:9000/data/people.json")
df1.show()
+----+-------+
| age|   name|
+----+-------+
|null|Michael|
|  30|   Andy|
|  19| Justin|
+----+-------+

option("timestampFormat", "yyyy/MM/dd HH:mm:ss ZZ")必须带上,不然报错

Exception in thread "main" java.lang.IllegalArgumentException: Illegal pattern component: XXX

2.读取CSV格式文件

//源文件内容如下:
[hadoop@hadoop001 ~]$ hadoop fs -text /data/people.csv
name;age;job
Jorge;30;Developer
Bob;32;Developer
//windows idea环境下
val df2 = spark.read.format("csv")
      .option("timestampFormat", "yyyy/MM/dd HH:mm:ss ZZ")
      .option("sep",";")
      .option("header","true")     //use first line of all files as header
      .option("inferSchema","true")
      .load("hdfs://192.168.137.141:9000/data/people.csv")
df2.show()
df2.printSchema()
//输出结果:
+-----+---+---------+
| name|age|      job|
+-----+---+---------+
|Jorge| 30|Developer|
|  Bob| 32|Developer|
+-----+---+---------+

root
 |-- name: string (nullable = true)
 |-- age: integer (nullable = true)
 |-- job: string (nullable = true)
-----------------------------------------------------------
//如果不指定option("sep",";")
+------------------+
|      name;age;job|
+------------------+
|Jorge;30;Developer|
|  Bob;32;Developer|
+------------------+
//如果不指定option("header","true")
+-----+---+---------+
|  _c0|_c1|      _c2|
+-----+---+---------+
| name|age|      job|
|Jorge| 30|Developer|
|  Bob| 32|Developer|
+-----+---+---------+

读取csv格式文件还可以自定义schema

val peopleschema = StructType(Array(
StructField("hlwname",StringType,true), 
StructField("hlwage",IntegerType,true), 
StructField("hlwjob",StringType,true)))
val df2 = spark.read.format("csv").option("timestampFormat", "yyyy/MM/dd HH:mm:ss ZZ").option("sep",";")
        .option("header","true")
        .schema(peopleschema)
        .load("hdfs://192.168.137.141:9000/data/people.csv")
      //打印测试
      df2.show()
      df2.printSchema()
输出结果:
+-------+------+---------+
|hlwname|hlwage|   hlwjob|
+-------+------+---------+
|  Jorge|    30|Developer|
|    Bob|    32|Developer|
+-------+------+---------+

root
 |-- hlwname: string (nullable = true)
 |-- hlwage: integer (nullable = true)
 |-- hlwjob: string (nullable = true)

二、将读取的文件以其他格式写出

//将上文读取的users.parquet以json格式写出
scala> users.select("name","favorite_color").write.format("json").save("file:///home/hadoop/tmp/parquet2json/")
[hadoop@hadoop000 ~]$ cd /home/hadoop/tmp/parquet2json
[hadoop@hadoop000 parquet2json]$ ll
total 4
-rw-r--r--. 1 hadoop hadoop 56 Sep 24 10:15 part-00000-dfbd9ba5-598f-4e0c-8e81-df85120333db-c000.json
-rw-r--r--. 1 hadoop hadoop  0 Sep 24 10:15 _SUCCESS
[hadoop@hadoop000 parquet2json]$ cat part-00000-dfbd9ba5-598f-4e0c-8e81-df85120333db-c000.json 
{"name":"Alyssa"}
{"name":"Ben","favorite_color":"red"}
//将上文读取的people.json以csv格式写出
df1.write.format("csv")
     .mode("overwrite")
     .option("timestampFormat", "yyyy/MM/dd HH:mm:ss ZZ")
     .save("hdfs://192.168.137.141:9000/data/formatconverttest/")
------------------------------------------
[hadoop@hadoop001 ~]$ hadoop fs -text /data/formatconverttest/part-00000-6fd65eff-d0d3-43e5-9549-2b11bc3ca9de-c000.csv
,Michael
30,Andy
19,Justin
//发现若没有.option("header","true"),写出的csv丢失了首行的age,name信息
//若不指定.option("sep",";"),默认逗号为分隔符

此操作的目的在于学会类型转换,生产上最开始进来的数据大多都是text,json等行式存储的文件,一般都要转成ORC,parquet列式存储的文件,加上压缩,能把文件大小减小到10%左右,大幅度减小IO和数据处理量,提高性能
此时如果再执行一次save,路径不变,则会报错:

scala> users.select("name","favorite_color").write.format("json").save("file:///home/hadoop/tmp/parquet2json/")
org.apache.spark.sql.AnalysisException: path file:/home/hadoop/tmp/parquet2json already exists.;
  at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:109)
  at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
.........................................................

可以通过设置savemode来解决这个问题

默认是errorifexists

scala> users.select("name","favorite_color").write.format("json").mode("overwrite").save("file:///home/hadoop/tmp/parquet2json/")
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 218,451评论 6 506
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 93,172评论 3 394
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 164,782评论 0 354
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 58,709评论 1 294
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 67,733评论 6 392
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 51,578评论 1 305
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 40,320评论 3 418
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 39,241评论 0 276
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 45,686评论 1 314
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 37,878评论 3 336
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 39,992评论 1 348
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 35,715评论 5 346
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 41,336评论 3 330
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 31,912评论 0 22
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 33,040评论 1 270
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 48,173评论 3 370
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 44,947评论 2 355

推荐阅读更多精彩内容