块层介绍（二）请求层

出处： https://lwn.net/Articles/738449/
The Linux block layer provides an upstream interface to filesystems and block-special devices allowing them to access a multitude of storage backends in a uniform manner. It also provides downstream interfaces to device drivers and driver-support frameworks that allow those drivers and frameworks to receive requests in a manner most suitable to each. Some drivers do not benefit from preliminary (初步的)handling and just use the thin "bio layer" that we met previously. Other drivers benefit from some preprocessing that might detect batches of consecutive（连续的） requests, may reorder requests based on various criteria（标准、准则）, and which presents the requests as one or more well-defined streams. To service these drivers, there exists a section of the block layer that I refer to as the request layer.

There are currently two parallel parts to the request layer, sometimes described as "single-queue" and "multi-queue". The multi-queue part is relatively new (years, not decades) and may, one day, completely replace the single-queue version. That has not yet happened and both are in active use in current kernels. Having two different queuing approaches to examine provides opportunities to learn from the contrasts（反差、对比）, so we will spend a little time looking at each of these two parts and how they present requests to the underlying frameworks or drivers. First, though, we will look at what is common between them, which is best seen by examining two data structures, <tt>struct request_queue</tt> and <tt>struct request</tt>.

The <tt>request_queue</tt> and the <tt>request</tt>

The <tt>request_queue</tt> and <tt>request</tt> structures are closely related to the <tt>gendisk</tt> and <tt>bio</tt> structures used in the bio layer: one represents a particular device, while the other represents an I/O request. While <tt>request</tt> structures are only used for devices that make use of the request layer, the <tt>request_queue</tt> structure is allocated for every gendisk. Some of the fields, such as parts of the <tt>struct queue_limits</tt> substructure, apply equally well to all block devices and arguably should live in the <tt>gendisk</tt>. Other fields only apply to the queue management provided by the request layer, and really only need to be allocated for devices using this layer. The current arrangement is simply a historical accident that isn't worth fixing and is, in any case, externally visible through the contents of the <tt>/sys/block/*/queue/</tt> sysfs directories.

The request structure represents a single I/O request that will ultimately be delivered to the underlying device. It contains a list of one or more bios that must represent contiguous I/O operations, various fields to track its overall(完整的、全部的) status (such as timestamps and the originating CPU), and several anchors（锚点） so it can be included in larger data structures. It has a struct list_head queuelist so it can appear in a simple queue, a <tt>struct hlist_node hash</tt> so it can appear in a hash table (used to see if any outstanding requests are adjacent to a new bio), and a <tt>struct rb_node rb_node</tt> used to include the request in a red-black tree. When a <tt>request</tt> structure is allocated, some extra space is allocated at the end to be used by the underlying driver to store any additional per-request information. This space is sometimes used to store the command header that is sent to the underlying devices, such as a SCSI command descriptor block, but drivers are free to use it however they wish.

These requests are created by the appropriate <tt>make_request_fn()</tt> function (<tt>blk_queue_bio()</tt> for single-queue or <tt>blk_mq_make_request()</tt> for multi-queue) and handed to an I/O scheduler or "elevator", a name derived from the elevator algorithm that was once a cornerstone（基础、基石） of disk I/O scheduling and is now, at most, a minor detail. We will take a quick look at the single-queue approach, and then see how that compares to the newer multi-queue mechanism.

Scheduling requests for a single queue

Traditionally, most storage devices were made up of a set of spinning circular platters（旋转圆盘） with magnetic（磁力的） coating（涂层） and a single head (or set of heads, one per platter) that moved along the radius of the spinning disk to read or change the magnetic polarization at any location on any platter. Such a device can only process a single request at a time, and has a substantial cost in moving from one location on the platters to another. The single-queue implementation started out aimed at driving this sort of device and, while it has broadened in scope over the years, its structure still reflects the needs of rotating storage devices.

The three key tasks for a single-queue scheduler are:

To collect multiple bios representing contiguous operations into a smaller number of requests that are large enough to make best use of the hardware but not so large that they exceed any limitations of the device.
To queue these requests in an order that minimizes seek time while not delaying important requests unduly. Providing an optimal solution to this problem is the source of all the complexity. As it is impossible to know, in general, how important each request is or how much seek time will be wasted, we rely on heuristics to choose a good order, and heuristics are never perfect.
To make these requests available to the underlying driver so it can pluck them off the queue when it is ready and to provide a mechanism for notification when those requests are complete.

The last task is fairly straightforward. A driver registers a <tt>request_fn()</tt> using <tt>blk_init_queue_node()</tt> and this function is called whenever there are new requests on the queue that can usefully be processed. The driver is responsible for collecting a request using <tt>blk_peek_request()</tt> and processing it. When that request finishes (or earlier if the device can handle multiple requests in parallel), the driver should collect another request and submit it, without waiting for further calls to the <tt>request_fn()</tt>. Completion of each request is reported by calling <tt>blk_finish_request()</tt>.

The first task is partly handled by common code that performs a couple of quick checks in the queue to see if a request can easily be found to accept a new bio. On success, the scheduler will be asked to approve the merge; on failure, the scheduler will be given a chance to look for a merge itself. If no merge is possible, a new request is allocated and given to the scheduler. The scheduler may, at a later time, merge this request with another if either request ever grows large enough that they become contiguous.

The middle task is potentially more complex. Putting these requests in a suitable order depends a lot on how the word "suitable" is interpreted. The three different single-queue schedulers, noop, deadline, and cfq, use significantly different interpretations:

"noop" provides minimal sorting of requests, never allowing a read to be moved ahead of a write or vice-versa, but otherwise allowing one request to overtake another if, in line with the elevator algorithm, the I/O head is likely to reach the new one before the old one. Apart from this simple sorting, "noop" provides first-in-first-out queuing.
"deadline" collects batches of either read or write requests that were all submitted at close to the same time. Within a batch, requests are sorted and each batch is queued to the device either when it reaches the maximum size or when the expiry time is reached. This algorithm tries to put an upper limit on the delay imposed on any request while retaining the possibility of forming large batches.
"cfq" stands for "Complete Fairness Queueing"; this scheduler is substantially more complex than the others. It aims to provide fairness between different processes and, when configured with control groups, between different groups of processes as well. It maintains multiple queues internally; one for each process to hold the synchronous requests from the process (typically reads), and one for each priority level for the asynchronous requests (typically writes) with that priority. These queues each get a turn at submitting requests based on priority calculations. Each queue gets a time slice in which to submit a limited number of requests. When a synchronous queue has fewer requests than the limit, the device is allowed to go idle even though other queues might contain requests. Synchronous requests are often followed by other requests at a similar location, so keeping the device idle to wait for that next synchronous request can improve throughput.

This description barely scratches the surface of cfq. There is in-kernel documentation that provides more details and lists all the parameters that can be tuned to match different circumstances.

As already hinted, some devices can accept multiple requests at once, accepting new requests before earlier requests complete. This usually involves "tagging", adding a tag to each request so that completion notifications can be connected back to the appropriate request. The single-queue request layer contains support for keeping track of tags to whatever depth the device supports.

A device may internally support tagged commands by truly running some requests in parallel, such as by accessing an internal cache, by having multiple different components that each can handle a single request, or by providing its own internal queueing that works with more knowledge of the device internals than are available to the request layer. As the level of parallelism and the sophistication of the internal scheduling increases, the need for Linux to provide optimal scheduling for the request stream decreases. There will always be a need for some scheduling within Linux, to implement concepts such as control groups that the device hardware cannot know about, but when the device can do much of the scheduling itself, it makes sense for Linux to take a more hands-off approach. This is part of the motivation for the multi-queue side of the request layer.

Multiple queue and multiple CPUs

Another motivation for multi-queue support is that, as we get more and more processing cores in our systems, the locking overhead required to place requests from all cores into a single queue increases. The plugging infrastructure can help to some extent, but not as much as we would like. If we allocate a large number of queues: one per NUMA node or even one per CPU, then the locking overhead for putting a request onto the queue is substantially reduced. If the hardware allows multiple requests to be submitted in parallel from different CPUs, this is a clear net win. If the hardware only supports a single submission at a time, the multiple per-CPU queues still need to be merged. If they are merged in larger batches than the plugging functionality creates, this is again a net win. When there is no increase in batch size, careful coding should be able to ensure that, at least, there is no net loss.

As noted earlier, the cfq scheduler already contains multiple queues internally. These serve quite a different purpose than the queues provided by multi-queue. They associate requests with processes and priority levels whereas the multi-queue queues are tied to the details of the hardware. The multi-queue request layer maintains two groups of hardware-related queues: the software staging queues and the hardware dispatch queues.

Software staging queues (<tt>struct blk_mq_ctx</tt>) are allocated based on the CPU hardware, one per CPU or one per NUMA node. Requests are added to these queues, controlled by a spinlock that should mostly be uncontended, whenever the block I/O plug is unplugged (<tt>blk_mq_flush_plug_list()</tt>). These queues may optionally be managed by a separate multi-queue scheduler, of which there are three: bfq, kyber, and mq-deadline.

The hardware dispatch queues are allocated based on the target hardware, so there may be just one, or there may be as many as 2048 (or however many interrupt source the platform supports). Though the request layer does allocate a data structure (<tt>struct blk_mq_hw_ctx</tt>) for each hardware queue (or "hardware context"), and tracks the mapping from CPU to queue, the queue itself is the responsibility of the underlying driver. The request layer will, from time to time, pass requests together with a hardware context to the underlying driver. What happens next is up to the driver, but the expectation is that the driver will get the request to the hardware as quickly as possible, generally passing requests on in the order that they are received.

Another important difference between the multi-queue layer and the single-queue layer is that, for multi-queue, the <tt>request</tt> structures are all preallocated. Each <tt>request</tt> structure has an associated tag number that is unique among requests for that device, and that travels with the request all the way into the hardware — where supported — and back again. Allocating the tag number early achieves smoother transit for the request through lower layers once the request layer has decided to release it.

Rather than providing a single <tt>request_fn()</tt>, the underlying driver must provide a <tt>struct blk_mq_ops</tt> that lists up to eleven functions. Of these, the function of most interest here is <tt>queue_rq()</tt>. Other functions support timeouts, polling for completion, request initialization, and the like.

Once the configured scheduler decides that a request is ready and that it no longer wants to keep it on a queue to allow for re-ordering or extension, it will call the <tt>queue_rq()</tt> function. This design places the responsibility for handing off requests on the request layer itself, in contrast to single-queue where it was the responsibility of the driver to collect them. <tt>queue_rq()</tt> is given a hardware context and will normally place the request on some internal FIFO, or possibly handle it directly. The <tt>queue_rq()</tt> can refuse to accept a request by returning <tt>BLK_STS_RESOURCE</tt>, which causes the request to remain on the staging queue. Any return value other than <tt>BLK_STS_RESOURCE</tt> or <tt>BLK_STS_OK</tt> is treated as an I/O error.

Multi-queue scheduling

A multi-queue driver does not need to have a scheduler configured, in which case an approach similar to the "noop" single-queue scheduler is used. Consecutive bios are grouped into a single request, while non-consecutive bios each end up in their own request. These are queued in a simple FIFO order in the software staging queue, though when there are multiple submission queues, the default scheduler tries to submit new requests directly and only uses the staging queue after a <tt>BLK_STS_RESOURCE</tt> response. The software queues are then passed to the driver by calls to <tt>blk_mq_run_hw_queue()</tt> or <tt>blk_mq_delay_run_hw_queue()</tt>, which typically happens when the block device is unplugged.

The pluggable multi-queue schedulers have 22 entry points, two of which are <tt>insert_requests()</tt>, which should add a list of requests to the software staging queue, and <tt>dispatch_request()</tt>, which should choose a request to be passed to the given hardware queue. If <tt>insert_requests()</tt> is not provided, the requests are simply appended to the list. If <tt>dispatch_request()</tt> is not provided, requests will be collected from any staging queue that is configured to feed to this hardware queue, and these are passed down in arbitrary order. This "collect everything" step is the main point of cross-CPU locking and can hurt performance, so it is best if a device with a single hardware queue accepts all the requests it is given.

The mq-deadline scheduler provides much the same functionality as the single-queue deadline scheduler. It provides an <tt>insert_request()</tt> function that ignores the multiple staging queues and adds each request to one of two global, time-ordered queues — one for reads and one for writes. If a new request can be merged with an existing request that is done, else it is added to the end of the relevant queue. A <tt>dispatch_request()</tt> function is provided that returns the first of either of these queues, based on age, on batch size, and on not starving writes for too long.

The bfq scheduler is, to some extent, modeled after cfq, with the acronym standing for Budget Fair Queueing. There is some in-kernel documentation; it was covered in these pages a few years ago and further covered with the adaption to multi-queue more recently. bfq, much like mq-deadline, does not use multiple per-CPU staging queues. While it has multiple queues, they are accessed by all CPUs using a single spinlock.

Unlike mq-deadline and bfq, the kyber I/O scheduler, briefly discussed here earlier this year, does make use of the per-CPU (or per-NUMA-node) staging queues. It doesn't provide an <tt>insert_request()</tt> function but makes use of the default behavior. The <tt>dispatch_request()</tt> function maintains various internal queues on a per-hardware-context basis. If these queues are empty it will collect requests from all staging queues that map to the given hardware context, and will distribute them internally. When they are not empty, it delivers requests from the internal queues as appropriate. The intricacies of this strategy, how it distributes requests, and in what order they are processed, do not appear to be documented in the kernel.

End of the line for single-queue?

Having two different queuing systems, two sets of schedulers, and two driver interfaces is clearly not ideal. Can we expect to see the end of the single-queue code soon? How much better is multi-queue really?

Unfortunately these are questions that require lots of testing on hardware and this writer is mostly a software guy. From a software perspective, it is clear that, when used with hardware that supports parallel submission to multiple queues, multi-queue should bring significant benefits. When used with single-queue hardware, the multi-queue system should be able to, at least, achieve parity with single queue. It would not be reasonable to expect that parity to be achieved overnight though, as anything new must be expected to be imperfect.

An example of this imperfection is a set of patches that was recently accepted to be included in Linux 4.15. The patches make changes to the mq-deadline scheduler to address some performance issues that Red Hat's internal storage testing discovered. It is reasonable to expect that other players in the storage field are performing their own tests, and are likely to find regressions when switching to multi-queue. It is also reasonable to expect these regressions to be fixed over the coming months. 2017 might not be the year of the multi-queue-only Linux, but that year is not likely to be far away.