learn spark

CodlifeIP属地: 北京

2016.09.07 13:46:31字数 154阅读 193

内容来源：spark source code
1: spark 输入数据的默认task 个数：
解答：分如下情况：
Rdd:
Hadoopfile 计算分片，传递了一个参数 parallelism
Sc.parallelize() 默认值是：spark.default.parallelism
Local mode: number of cores on the local machine

Paste_Image.png

Mesos fine grained mode: 8

Paste_Image.png

Others: total number of cores on all executor nodes or 2, whichever is larger
Because:YarnSchedulerBackend 继承自CoarseGrainedSchedulerBackend

Paste_Image.png

Spark 2.0 中大量使用的Dataset
ExecutedCommandExec

2：慎用 groupBykey ，可能导致oom

Note: As currently implemented, groupByKey must be able to hold all the key-value pairs for any
key in memory. If a key has too many values, it can result in an [[OutOfMemoryError]].
It’s recommended to use PairRDDFunctions.aggregateByKey

最后编辑于：2017.12.03 19:03:28

0人点赞

Spark

更多精彩内容，就在简书APP

"[No 打赏，最重要的是共享！](https://github.com/codlife)"

还没有人赞赏，支持一下

CodlifeSoftware engineer at Microsoft<br>Spark Contrib...

总资产2共写了4.3W字获得121个赞共97个粉丝

learn spark

推荐阅读更多精彩内容