RDD持久化

官网介绍

One of the most important capabilities in Spark is persisting (or caching) a dataset in memory across operations

spark的最重要的一个功能就是跨操作的在内存中持久化(缓存)一个数据集

When you persist an RDD, each node stores any partitions of it that it computes in memory and reuses them in other actions on that dataset (or datasets derived from it). This allows future actions to be much faster (often by more than 10x)

当你持久化一个RDD, 每一个node存储RDD的所有的分区信息，
这样就可以在以内存的方式进行计算并且在以后的作用在该dataset
(或者来源自该dataset的数据集)的action中进行重用。
这样以后再使用该action，该action执行的更快(通常超过原来的10倍)

Caching is a key tool for iterative algorithms and fast interactive use

Caching对于迭代算法和快速交互使用的关键工具

You can mark an RDD to be persisted using the persist() or cache() methods on it

可以使用persist()方法或者cache()方法来标识某个RDD是持久化的

cache()

2.1 源码

最后编辑于：2021.11.22 12:09:30

RDD持久化

推荐阅读更多精彩内容