该系列文章主要记录阅读理解ceph代码时可能遇到的一些难点,可能跳跃比较大。如果有描述错误或任何疑问欢迎交流讨论。
首先上官方文档
docs.ceph.com/docs/master/dev/peering/
这里面基本把重点都讲出来了,尤其是记住那个Golden Rule,peering期间是不能处理io请求的,当然也是不能做recovery的。 在建立这些概念之后再结合peering状态机,对比代码,大体就能看明白了。
- 注意peering 和 recovery是有明确界限的,peering的结果是大家达成一致,但并未开始recovery。peering与recovery边界在activate,activate将peering达成一致的结果(pg log,last_update等)同步给actingbackfill中的副本,然后启动recovery。
这里记录几个关键点,对于理解peering非常重要。
interval
current interval or past interval
a sequence of OSD map epochs during which the acting set and up set for particular PG do not change
上面是官方文档中的描述,如果将osdmap的epoch看成时间,那么interval就是描述一个时间段,在这段时间内pg的up和acting位置均没发生变化。
那么interval是必须的吗?是一开始就有的吗?
不是的,interval是为了优化prior_set的计算而设计的。在该commit中引入:
SHA-1: 1c64db014c681f2746a43faceb4775ed897a3269
* osd: remember past intervals instead of recalculating each time
This _vastly_ improves the speed of build_prior (and thus activate_map).
There is no need to recalculate this information each time as it is fully
dependent on _old_ OSDMaps, not current cluster state.
pg的past_intervals也是这个patch引入的。后来interval结构从PG独立出来,移到osd_types.h中。
last_epoch_started
这是另一个非常重要的概念,简称 les,这个epoch是peering真正完成的里程碑。官方概念描述:
···
last epoch start
the last epoch at which all nodes in the acting set for a particular placement group agreed on an authoritative history. At this point, peering is deemed to have been successful.
···
这里应该是last epoch started,更详细的解释参考:
last_epoch_started
其中这段话可以作为关键函数 find_best_info 的部分逻辑依据:
info.history.last_epoch_started records a lower bound on the most recent interval in which the pg as a whole
went active and accepted writes. On a particular osd, it is also an upper bound on the activation epoch of intervals in which
writes in the local pg log occurred (we update it before accepting writes). Because all committed writes are committed by all
acting set osds, any non-divergent writes ensure that history.last_epoch_started was recorded by all acting set members in the
interval. Once peering has queried one osd from each interval back to some seen history.last_epoch_started, it follows that no
interval after the max history.last_epoch_started can have reported writes as committed (since we record it before recording
client writes in an interval). Thus, the minimum last_update across all infos with info.last_epoch_started >=
MAX(history.last_epoch_started) must be an upper bound on writes reported as committed to the client.
les的作用:
1,作为peering历史上的checkpoint,用来减少build_prior的时候需要检查的osdmap epoch(包括减少需要检查的past intervals)。
2,作为pg进入Active 提供服务的epoch,在选取auth log shard时排除部分peer。peering的时候必须选取les最大的peer作为auth,也就是要找到最后的历史见证人,否则就不是auth。
maybe_went_rw
表明一个interval是否有rw操作,可以过滤掉一些interval,在上述peering文档介绍up_thru中有说明。
- 那它与les有何区别呢?
les是去掉尾部的interval,而maybe_went_rw可以过滤掉les 到当前epoch之间的interval。
注意这个maybe,因为up_thru是记录在osdmap上的,是整个osd的状态,反应不了单个pg的状态,因而interval即便maybe_went_rw是true的也无法说明该interval完成了peering(activate成功才算完成peering)。
我们分析下maybe_went_rw是怎么设置的
if (num_acting &&
i.primary != -1 &&
num_acting >= old_pg_pool.min_size &&
(*could_have_gone_active)(old_acting_shards)) {
if (out)
*out << "generate_past_intervals " << i
<< ": not rw,"
<< " up_thru " << lastmap->get_up_thru(i.primary)
<< " up_from " << lastmap->get_up_from(i.primary)
<< " last_epoch_clean " << last_epoch_clean
<< std::endl;
if (lastmap->get_up_thru(i.primary) >= i.first &&
lastmap->get_up_from(i.primary) <= i.first) {
i.maybe_went_rw = true;
if (out)
*out << "generate_past_intervals " << i
<< " : primary up " << lastmap->get_up_from(i.primary)
<< "-" << lastmap->get_up_thru(i.primary)
<< " includes interval"
<< std::endl;
} else if (last_epoch_clean >= i.first &&
last_epoch_clean <= i.last) {
// If the last_epoch_clean is included in this interval, then
// the pg must have been rw (for recovery to have completed).
// This is important because we won't know the _real_
// first_epoch because we stop at last_epoch_clean, and we
// don't want the oldest interval to randomly have
// maybe_went_rw false depending on the relative up_thru vs
// last_epoch_clean timing.
i.maybe_went_rw = true;
if (out)
*out << "generate_past_intervals " << i
<< " : includes last_epoch_clean " << last_epoch_clean
<< " and presumed to have been rw"
<< std::endl;
} else {
i.maybe_went_rw = false;
if (out)
*out << "generate_past_intervals " << i
<< " : primary up " << lastmap->get_up_from(i.primary)
<< "-" << lastmap->get_up_thru(i.primary)
<< " does not include interval"
<< std::endl;
}
} else {
i.maybe_went_rw = false;
if (out)
*out << "generate_past_intervals " << i << " : acting set is too small" << std::endl;
}
根据up_thru设置很好理解,up_thru成功之后pg就进行activate,activate成功就能处理io请求,因而很maybe_went_rw。但是下面根据last_epoch_clean设置是啥意思呢? 查找提交记录 发现2011.10.23 的一个commit增加的:
SHA-1: 12b3b2d5af01be253980875b386b892b57f951bc
* osd: fix generate_past_intervals maybe_went_rw on oldest interval
We stop working backwards when we hit last_epoch_clean, which means for the
oldest interval first_epoch may not be the _real_ first_epoch. (We can't
continue working backward because we may have thrown out those maps
entirely.)
However, if the last_epoch_clean epoch is contained within that interval,
we know that the OSD did in fact go rw because it had to have completed
recovery (and thus peering) to set last_clean_epoch in the first place.
This fixes cases where two different nodes have slightly different
past intervals, generate different prior probe sets as a result, and
flip/flop on the acting set choice. (It may have eventually resolved when
the wrongly excluded node's notify races and arrives in time to be
considered, but that's still clearly no good.)
This does leave the start epoch for that oldest interval incorrect. That
doesn't currently matter except that it's confusing, but I'm not sure how
to mark it properly, or if it's worth the effort.
Signed-off-by: Sage Weil <sage@newdream.net>
这段是说计算past_intervals时到last_epoch_clean就结束了,为什么结束呢? 因为osd并不保留last_epoch_clean之前的osdmap。所以那个包含last_epoch_clean的interval可能是残缺(incorrect)的,该interval的起点就变成了last_epoch_clean,up_thru的epoch可能在last_epoch_clean之前,也可能在这之后。不过这都没关系,既然last_epoch_clean发生在那个interval,那么显然是went_rw的,因而也将maybe_went_rw设置为true。
pg_info_t
pg info也就是pg的元数据。其中last_update, last_backfill, log_tail,以及上面提到的les都和peering有关系。last_complete主要是跟recovery相关。
last_update
last_update 可以看成是pg log的头指针,指向log head。
- 注意last_update仅仅表明有对应的pg log,但是pg log对应的数据不一定有
last_backfill
last_backfill 跟last_complete一样主要是恢复用的,记录全量恢复的对象位置。
对于peering,backfill peer就是一个例外,这个例外相当于说那哥们已经落伍了,我们要特别照顾下它。
在peering的时候,尚未完成backfill的peer是无法作为auth log shard的。
- 为什么呢?
这是因为backfill的peer的pg log是 虚 的,primary给backfill的peer发送的sub op中 可能 并未携带数据。如果让backfill的peer作为auth log shard,可能会导致一些pg log对应的修改在任何pg副本都没有记录。所以Golden Rule里面提到的io是被每个acting set里面的成员记录的,但不包括backfill。backfill就是一个例外。
log_tail
log tail比较好理解,主要是用来判断能否做增量恢复。
pg_history_t
这个结构体里面有三类成员。
第一类是描述pg一些重要的历史事件。
epoch_t epoch_created; // epoch in which PG was created
epoch_t last_epoch_started; // lower bound on last epoch started (anywhere, not necessarily locally)
epoch_t last_epoch_clean; // lower bound on last epoch the PG was completely clean.
epoch_t last_epoch_split; // as parent
这些历史事件是不同pg副本所共有的,所以在merge的时候会做合并处理。
第二类是每个副本单独所有的,描述该副本的一些历史点
epoch_t same_up_since; // same acting set since
epoch_t same_interval_since; // same acting AND up set since
epoch_t same_primary_since; // same primary at least back through this epoch.
其中 same_interval_since用的多点,就像变量名的含义一样,它表示这个副本所经历的最后一个interval的起点。
第三类是scrub相关的,这个跟peering过程无直接关系。