1.通过集合生成
val data = Array(1, 2, 3, 4, 5)
val distData = sc.parallelize(data)
distData.collect
Spark will run one task for each partition of the cluster.
一个partition对应一个task
2.通过外部共享文件
scala> val distFile = sc.textFile("data.txt")
- If using a path on the local filesystem, the file must also be accessible at the same path on worker nodes. Either copy the file to all workers or use a network-mounted shared file system.
本地获取的话,每个节点下面都要有那个目录和文件
- All of Spark’s file-based input methods, including textFile, support running on directories, compressed files, and wildcards as well. For example, you can use textFile("/my/directory"), textFile("/my/directory/.txt"), and textFile("/my/directory/.gz").
hdfs下,可以获取一个文件夹下的所有文件,也可用通配符获取
- .The textFile method also takes an optional second argument for controlling the number of partitions of the file. By default, Spark creates one partition for each block of the file (blocks being 128MB by default in HDFS), but you can also ask for a higher number of partitions by passing a larger value. Note that you cannot have fewer partitions than blocks.
可以设置参数确定分区数,分区数不能小于block数量