Flink Checkpoint机制解析-代码走读

Flink的Checkpoint机制是Flink容错能力的基本保证，能够对流处理运行时的状态进行保存，当故障发生时，能够备份的状态中还原。例如，当Flink读取kafka时，将消费的kafka offset保存下来，如果任务失败，可以从上次消费的offset之后重新消费。

Flink的checkpoint从以下几方面着手理解。

Barrier

Barrier是一个轻量级的数据被按一定的规则（调度）插入到原始数据流中，这个数据不会影响原有数据处理的性能，不会改变原始数据的顺序。Barrier将数据分割成一段一段的，有点类似于Spark Streaming的micro batch中的批次数据。

Barrier.png

Barrier随着数据一起在各Task中流动，当Operator收到一个barrier时，会认为此barrier之前的所有数据应该已经得到了处理，这时候就会触发checkpoint。
如果一个Operator有多个并发的输入流，那么当它收到一个checkpoint的barrier时，需要等待其他所有的该checkpoint对应的额barrier到达，再进行处理，这个barrier对齐的步骤如下。

只要有operator从上游接收到一条barrier n，此时，该operator就不能处理这条流barrier以后的数据，直到该operator收到其他所有上游的barrier n。
此时上报barrier n的流暂时不做任何处理。从这些流里读到的数据也不被处理，而是被放置到input buffer中缓存。
直到最后一个上游的barrier n到达，operator会发送barrier n给下游。
之后，operator恢复从所有的上游中处理数据，在上游流数据处理之前先将input buffer中的数据处理。
整个过程如下图所示。

barrier.png

Barrier的产生

Flink的checkpoint是由JobMaster发起的，以一定的周期触发Source Task产生barrier。

JobMastert周期性触发checkpoint。

在CheckpointCoordinator类中startCheckpointScheduler() 方法

    public void startCheckpointScheduler() {
        synchronized (lock) {
            if (shutdown) {
                throw new IllegalArgumentException("Checkpoint coordinator is shut down");
            }

            // make sure all prior timers are cancelled
            stopCheckpointScheduler();
            //以特定的周期 baseInterval 触发checkpoint
            periodicScheduling = true;
            long initialDelay = ThreadLocalRandom.current().nextLong(
                minPauseBetweenCheckpointsNanos / 1_000_000L, baseInterval + 1L);
            currentPeriodicTrigger = timer.scheduleAtFixedRate(
                    new ScheduledTrigger(), initialDelay, baseInterval, TimeUnit.MILLISECONDS);
        }
    }

顺着ScheduledTrigger一直进入到jobMaster触发checkpoint的方法triggerCheckpoint. 在经过一系列的设置之后，该方法会生成一个唯一的checkpoint ID，并创建pending的checkpoint，调用rpc方法execution.triggerCheckpoint(checkpointID, timestamp, checkpointOptions); 触发Task侧生成barrier，进行checkpoint.

checkpointID = checkpointIdCounter.getAndIncrement();

final PendingCheckpoint checkpoint = new PendingCheckpoint(
                job,
                checkpointID,
                timestamp,
                ackTasks,
                props,
                checkpointStorageLocation,
                executor);

// send the messages to the tasks that trigger their checkpoint
        for (Execution execution: executions) {
                    execution.triggerCheckpoint(checkpointID, timestamp, checkpointOptions);
                }

    public void triggerCheckpoint(long checkpointId, long timestamp, CheckpointOptions checkpointOptions) {
        final LogicalSlot slot = assignedResource;

        if (slot != null) {
            final TaskManagerGateway taskManagerGateway = slot.getTaskManagerGateway();
//调用TaskManager RPC taskManagerGateway.triggerCheckpoint
            taskManagerGateway.triggerCheckpoint(attemptId, getVertex().getJobId(), checkpointId, timestamp, checkpointOptions);
        } else {
            LOG.debug("The execution has no slot assigned. This indicates that the execution is " +
                "no longer running.");
        }
    }

Source收到指令后生成barrier

TaskManager 接收到触发checkpoint的RPC后，在Source Task中触发生成checkpoint barrier, 在triggerCheckpointBarrier中会创建另一个线程专门做生成barrier的事情。

public void triggerCheckpointBarrier(
            final long checkpointID,
            long checkpointTimestamp,
            final CheckpointOptions checkpointOptions) {

    Runnable runnable = new Runnable() {
                @Override
                public void run() {
                    // set safety net from the task's context for checkpointing thread
                    LOG.debug("Creating FileSystem stream leak safety net for {}", Thread.currentThread().getName());
                    FileSystemSafetyNet.setSafetyNetCloseableRegistryForThread(safetyNetCloseableRegistry);

                    try {
                        boolean success = invokable.triggerCheckpoint(checkpointMetaData, checkpointOptions);
                        if (!success) {
                            checkpointResponder.declineCheckpoint(
                                    getJobID(), getExecutionId(), checkpointID,
                                    new CheckpointDeclineTaskNotReadyException(taskName));
                        }
                    }

在Source Task中真正执行的方法是StreamTask中的performCheckpoint，在此方法中，会进行两件事：首先生成携带checkpoint ID的barrier，并将此barrier发送到所有的下游。然后处理本Task的状态保存。这样，在整个流处理中就有了barrier传递。

    private boolean performCheckpoint(
            CheckpointMetaData checkpointMetaData,
            CheckpointOptions checkpointOptions,
            CheckpointMetrics checkpointMetrics) throws Exception {

        LOG.debug("Starting checkpoint ({}) {} on task {}",
            checkpointMetaData.getCheckpointId(), checkpointOptions.getCheckpointType(), getName());

        synchronized (lock) {
            if (isRunning) {
                // we can do a checkpoint

                // All of the following steps happen as an atomic step from the perspective of barriers and
                // records/watermarks/timers/callbacks.
                // We generally try to emit the checkpoint barrier as soon as possible to not affect downstream
                // checkpoint alignments

                // Step (1): Prepare the checkpoint, allow operators to do some pre-barrier work.
                //           The pre-barrier work should be nothing or minimal in the common case.
                operatorChain.prepareSnapshotPreBarrier(checkpointMetaData.getCheckpointId());

                // Step (2): Send the checkpoint barrier downstream
                operatorChain.broadcastCheckpointBarrier(
                        checkpointMetaData.getCheckpointId(),
                        checkpointMetaData.getTimestamp(),
                        checkpointOptions);

                // Step (3): Take the state snapshot. This should be largely asynchronous, to not
                //           impact progress of the streaming topology
                checkpointState(checkpointMetaData, checkpointOptions, checkpointMetrics);
                return true;
            }

Barrier传递

上面讲到对于Source Task，会根据JobMaster的指令周期性的在原始数据中插入barrier，并将barrier传递到下游Operator。
对于非Source Task，在处理数据中，并不是周期性触发checkpoint,而是当遇到Barrier数据时，触发一次checkpoint。
具体到代码中，是由BarrierBuffer中的getNextNonBlocked触发。

    public BufferOrEvent getNextNonBlocked() throws Exception {
        while (true) {
            // process buffered BufferOrEvents before grabbing new ones
            Optional<BufferOrEvent> next;
            if (currentBuffered == null) {
                next = inputGate.getNextBufferOrEvent();
            }
            else {
                next = Optional.ofNullable(currentBuffered.getNext());
                if (!next.isPresent()) {
                    completeBufferedSequence();
                    return getNextNonBlocked();
                }
            }

            if (!next.isPresent()) {
                if (!endOfStream) {
                    // end of input stream. stream continues with the buffered data
                    endOfStream = true;
                    releaseBlocksAndResetBarriers();
                    return getNextNonBlocked();
                }
                else {
                    // final end of both input and buffered data
                    return null;
                }
            }

            BufferOrEvent bufferOrEvent = next.get();
            if (isBlocked(bufferOrEvent.getChannelIndex())) {
                // if the channel is blocked we, we just store the BufferOrEvent
                bufferBlocker.add(bufferOrEvent);
                checkSizeLimit();
            }
            else if (bufferOrEvent.isBuffer()) {
                return bufferOrEvent;
            }
            else if (bufferOrEvent.getEvent().getClass() == CheckpointBarrier.class) {
                if (!endOfStream) {
                    // process barriers only if there is a chance of the checkpoint completing
                    processBarrier((CheckpointBarrier) bufferOrEvent.getEvent(), bufferOrEvent.getChannelIndex());
                }
            }
            else if (bufferOrEvent.getEvent().getClass() == CancelCheckpointMarker.class) {
                processCancellationBarrier((CancelCheckpointMarker) bufferOrEvent.getEvent());
            }
            else {
                if (bufferOrEvent.getEvent().getClass() == EndOfPartitionEvent.class) {
                    processEndOfPartition();
                }
                return bufferOrEvent;
            }
        }
    }

如果当前流尚未结束，则在方法processBarrier中处理该barrier，processBarrier会根据该Task是否有多个输入源判断是否需要对齐barrier。如果可以进行barrier了，则会调用notifyCheckpoint触发checkpoint,该方法会走到triggerCheckpointOnBarrier，后续过程和Source Task一致。

Operator checkpoint

上述讲到当Task执行checkpoint时，首先会生成该checkpoint的barrier广播出去，然后再执行该Task的checkpoint. 通过executeCheckpointing方法调用operator的snapshotState进行状态保存。不同的operator根据自己的需要实现snapshotState方法。
例如Flink提供的kafka consumer operator, KafkaConsumerBase的snapshotState就保存了当前Topic各partition消费到的offset.

    public final void snapshotState(FunctionSnapshotContext context) throws Exception {
        if (!running) {
            LOG.debug("snapshotState() called on closed source");
        } else {
            unionOffsetStates.clear();

            final AbstractFetcher<?, ?> fetcher = this.kafkaFetcher;
            if (fetcher == null) {
                // the fetcher has not yet been initialized, which means we need to return the
                // originally restored offsets or the assigned partitions
                for (Map.Entry<KafkaTopicPartition, Long> subscribedPartition : subscribedPartitionsToStartOffsets.entrySet()) {
                    unionOffsetStates.add(Tuple2.of(subscribedPartition.getKey(), subscribedPartition.getValue()));
                }

                if (offsetCommitMode == OffsetCommitMode.ON_CHECKPOINTS) {
                    // the map cannot be asynchronously updated, because only one checkpoint call can happen
                    // on this function at a time: either snapshotState() or notifyCheckpointComplete()
                    pendingOffsetsToCommit.put(context.getCheckpointId(), restoredState);
                }
            } else {
                HashMap<KafkaTopicPartition, Long> currentOffsets = fetcher.snapshotCurrentState();

                if (offsetCommitMode == OffsetCommitMode.ON_CHECKPOINTS) {
                    // the map cannot be asynchronously updated, because only one checkpoint call can happen
                    // on this function at a time: either snapshotState() or notifyCheckpointComplete()
                    pendingOffsetsToCommit.put(context.getCheckpointId(), currentOffsets);
                }

                for (Map.Entry<KafkaTopicPartition, Long> kafkaTopicPartitionLongEntry : currentOffsets.entrySet()) {
                    unionOffsetStates.add(
                            Tuple2.of(kafkaTopicPartitionLongEntry.getKey(), kafkaTopicPartitionLongEntry.getValue()));
                }
            }

            if (offsetCommitMode == OffsetCommitMode.ON_CHECKPOINTS) {
                // truncate the map of pending offsets to commit, to prevent infinite growth
                while (pendingOffsetsToCommit.size() > MAX_NUM_PENDING_CHECKPOINTS) {
                    pendingOffsetsToCommit.remove(0);
                }
            }
        }
    }

pending checkpoint到complete

当一个operator完成了checkpoint时，会向job master报告已经完成了，job master收到该operator报告的完成信息，会将此operator从未完成checkpoint移到已完成，当所有的operator都上报了完成信息时，job master会将此checkpoint从pending状态改变未complete状态。

Task侧：

executeCheckpointing -> asyncCheckpointRunnable -> reportCompletedSnapshotStates -> taskStateManager.reportTaskStateSnapshots -> TaskStateManagerImpl 最终会调用

    public void reportTaskStateSnapshots(
        @Nonnull CheckpointMetaData checkpointMetaData,
        @Nonnull CheckpointMetrics checkpointMetrics,
        @Nullable TaskStateSnapshot acknowledgedState,
        @Nullable TaskStateSnapshot localState) {

        long checkpointId = checkpointMetaData.getCheckpointId();

        localStateStore.storeLocalState(checkpointId, localState);

        checkpointResponder.acknowledgeCheckpoint(
            jobId,
            executionAttemptID,
            checkpointId,
            checkpointMetrics,
            acknowledgedState);
    }

通过actor通知job master。

Job master侧：

acknowledgeCheckpoint -> checkpointCoordinator.receiveAcknowledgeMessage(ackMessage) -> checkpoint.acknowledgeTask -> completePendingCheckpoint(checkpoint);

    private void completePendingCheckpoint(PendingCheckpoint pendingCheckpoint) throws CheckpointException {
        final long checkpointId = pendingCheckpoint.getCheckpointId();
        final CompletedCheckpoint completedCheckpoint;

        // As a first step to complete the checkpoint, we register its state with the registry
        Map<OperatorID, OperatorState> operatorStates = pendingCheckpoint.getOperatorStates();
        sharedStateRegistry.registerAll(operatorStates.values());

        try {
            try {
                completedCheckpoint = pendingCheckpoint.finalizeCheckpoint();
            }
            catch (Exception e1) {
                // abort the current pending checkpoint if we fails to finalize the pending checkpoint.
                if (!pendingCheckpoint.isDiscarded()) {
                    pendingCheckpoint.abortError(e1);
                }

                throw new CheckpointException("Could not finalize the pending checkpoint " + checkpointId + '.', e1);
            }

            // the pending checkpoint must be discarded after the finalization
            Preconditions.checkState(pendingCheckpoint.isDiscarded() && completedCheckpoint != null);

            try {
                completedCheckpointStore.addCheckpoint(completedCheckpoint);
            } catch (Exception exception) {
                // we failed to store the completed checkpoint. Let's clean up
                executor.execute(new Runnable() {
                    @Override
                    public void run() {
                        try {
                            completedCheckpoint.discardOnFailedStoring();
                        } catch (Throwable t) {
                            LOG.warn("Could not properly discard completed checkpoint {}.", completedCheckpoint.getCheckpointID(), t);
                        }
                    }
                });

                throw new CheckpointException("Could not complete the pending checkpoint " + checkpointId + '.', exception);
            }
        } finally {
            pendingCheckpoints.remove(checkpointId);

            triggerQueuedRequests();
        }

        rememberRecentCheckpointId(checkpointId);

        // drop those pending checkpoints that are at prior to the completed one
        dropSubsumedCheckpoints(checkpointId);

        // record the time when this was completed, to calculate
        // the 'min delay between checkpoints'
        lastCheckpointCompletionNanos = System.nanoTime();

        LOG.info("Completed checkpoint {} for job {} ({} bytes in {} ms).", checkpointId, job,
            completedCheckpoint.getStateSize(), completedCheckpoint.getDuration());

        if (LOG.isDebugEnabled()) {
            StringBuilder builder = new StringBuilder();
            builder.append("Checkpoint state: ");
            for (OperatorState state : completedCheckpoint.getOperatorStates().values()) {
                builder.append(state);
                builder.append(", ");
            }
            // Remove last two chars ", "
            builder.setLength(builder.length() - 2);

            LOG.debug(builder.toString());
        }

        // send the "notify complete" call to all vertices
        final long timestamp = completedCheckpoint.getTimestamp();

        for (ExecutionVertex ev : tasksToCommitTo) {
            Execution ee = ev.getCurrentExecutionAttempt();
            if (ee != null) {
                ee.notifyCheckpointComplete(checkpointId, timestamp);
            }
        }
    }
}

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 216,651评论 6赞 501
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 92,468评论 3赞 392
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 162,931评论 0赞 353
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 58,218评论 1赞 292
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 67,234评论 6赞 388
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 51,198评论 1赞 299
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 40,084评论 3赞 418
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 38,926评论 0赞 274
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 45,341评论 1赞 311
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 37,563评论 2赞 333
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 39,731评论 1赞 348
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 35,430评论 5赞 343
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 41,036评论 3赞 326
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 31,676评论 0赞 22
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 32,829评论 1赞 269
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 47,743评论 2赞 368
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 44,629评论 2赞 354