If you're looking to write fast code in Rust, good news! Rust makes it really easy to write really fast code. The focus on zero-cost abstractions, the lack of implicit boxing and the lifetime system that means memory is managed statically means that even naïve code is often faster than the equivalent in most other languages out there, and certainly faster than naïve code in any equivalently-safe language. Maybe, though, like most programmers you've spent your whole programming career safely insulated from having to think about any of this, and now you want to dig a little deeper and find out the real reason that Python script you rewrote in Rust runs 100x faster and uses a 10th of the memory. After all, they both do the same thing and run on the same CPU, right?
So, here's an optimization guide, aimed at those who know how to program but maybe don't know how it maps to real ones and zeroes being bandied around on the bare metal. I'll try to weave practical tips about optimizing Rust code with explanations of the reason behind why some code is faster than others, and we'll end with a case study from the Rust standard library.
This post assumes decent familiarity with programming, a beginner's familiarity with Rust and almost no familiarity with CPU architecture.
Ok, I'll start with a few disclaimers before I get into the meat. Firstly, unless you're running into performance problems in real-life usage, optimize your code for readability before you optimize it for runtime performance. Compilers and humans alike are better at understanding boring, straightforward code, so if you write that first you'll be likely to have "good enough" performance with the added benefit of being maintainable. If you write nice, refactorable code it'll be easy to change if you later realize that it's being wasteful.
That's not to say that performance doesn't matter. You shouldn't only optimize slow programs. A long-running program using high amounts of CPU but not causing visible slowness is just as bad as a program that takes 30s instead of 3s to process some data, it's just that the former is wasting battery and the latter is wasting time. Weigh up the time it would take to optimize the code against the benefit you would get from the optimization.
The reason maintainability is so important is that lot of optimizations are "try it and see" - if you're unable to make sweeping changes to your code without breaking it you're going to have a really bad time. Now, speaking of "try it and see"...
There are lots of performance-tracing tools out there. A famous set of tools for
C and other systems languages is valgrind
, which are extremely
powerful but can be scary to get started with, so if you just want to have a
quick overview of what your program is doing from a performance standpoint check
out this article on analyzing Rust with perf
, a fantastic and
easy-to-use performance tool for Linux. Unless there's a glaring flaw, like
pervasive clone
ing or a blatantly sub-optimal algorithm, perf
will likely
give better results than simply optimizing stuff that "looks slow".
Another tool that's good to help you avoid gotchas of all kinds (not just
performance) is clippy
, but you already knew that because you're
using it on all your code to begin with, right?
Your config-parsing code can be as slow as you like and it's unlikely to matter. Don't just optimize the slowest function in your program, optimize the one that takes up the most of your runtime. Those may be the same function, but if you get a 2 millisecond improvement on a function that's called 1000 times, that's better than a 1 second improvement on a function that's called once.
Now, as with every article on performance, this is where I add in the requisite
disclaimer of use better algorithms first. Don't invent new algorithms unless
you do that for a living, but in all likelihood if you're running into
performance problems it's more likely to be due to poor algorithms than to poor
implementations. Most programmers test their code on small datasets, but if you
have O(n²)
complexity that won't appear until you've tried it on a larger
dataset. If you don't know what your algorithm is, there is at least one tip I
can give to improve the complexity of the code as written.
Every use of collect
is a new intermediate container created, and unless
you're using a variable-length container that can be stored on the stack, it's
one or more allocations too. Use and abuse Rust's delicious new impl Trait
syntax to allow seperation of code into seperate functions without sacrificing
speed. itertools
is a great tool to do more of your work lazily, and if
there's one virtue that humanity is lacking, it's laziness.
This is especially true for reading data from a file or over a network - using
iterators allows you to do work in constant space and a minimum of O(n)
time,
using collect
forces you to allocate the whole file in memory and then iterate
over it again, using O(n)
space and a minimum of O(n²)
time.
One trick to help you doing this is collecting values into a tuple. For example,
if you need to average a collection, sum the length and the total into a
(length, total)
tuple instead of iterating twice (unless you have a concrete
iteratable type like Vec
for which getting the length is free). This also
allows you to avoid allocating an intermediate collection in the case of an
iterator that generates its data.
Avoiding collect
, for you C programmers out there, is essentially the Rust
version of loop jamming.
So maybe the complexity of your algorithm can't be improved, and to make any improvements you need to get into the down-to-the-metal stuff. This is where the difference comes in between languages like Rust and C and languages like Python and Ruby. It's entirely possible that you know this already, but it's worth going over to make sure we're all on the same page.
There are two parts to any computation, the stuff it does and the stuff it does it on. The instructions, and the data.
Instructions are stored in the instruction cache - a chunk of really, really fast memory that's directly readable by the CPU. Each instruction can put and/or take data from the CPU's registers, which is a small number of small pieces of memory, either 32 or 64 bits depending on your computer's word sizenote: floats. Only a small amount of data can be in registers at any one time, however, and data in registers cannot be accessed via pointer, so sometimes the CPU must access the computer's RAM. Since RAM is slow, the CPU tries to read in bulk and then store the result in increasingly small, increasingly fast caches. If it tries to access data that isn't in the smallest cache, it has to read the slightly larger cache, continuing up until it reaches RAMnote: swap space. The upshot is: you want to keep your data as small as possible, and for data that is accessed together to be close to each other so the CPU loads as much of it at once as possible.
Note: floats - The CPU has a separate set of registers purely for acting on floating point numbers too, but that doesn't really affect how you code so I'm only mentioning it for completeness.
Note: swap space - After RAM you have another layer of dynamic memory called swap space. This is part of your hard drive which acts as RAM in an emergency. In an ideal world you should have this disabled but unfortunately you probably have programs on your computer that don't or can't respect the amount of RAM you have. Memory allocations failing is still a problem we haven't solved as a community.
Anything that's not stored inside your program's memory is bad for performance. The very worst place for data to be is on a different computer. A less awful - but still very awful - place for data to be is on your hard drive. Better still is in your RAM but as mentioned before, RAM is slow. Almost the best possible place for your data is in CPU cache. You may have heard some folklore that allocating is bad, and this is the main reason why. Accessing two different locations one after another on the stack is fine, since they're likely to be on the same cache line. Accessing two different locations one after another on the heap is significantly slower, since it's much less likely they they're directly next to each other. We'll go into exactly why this is in a moment.
If you have to allocate, because you need variable-size containers, shared ownership or owned trait objects note: trait objects, try to put data that will be accessed in sequence in order in RAM, so that when the CPU reads one element it necessarily has to read the next few elements too, meaning it doesn't need to stall waiting for RAM in order to operate on them.
As a rule of thumb for whether something has to allocate: if you can tell me the amount of space the value will use up without running the program, it's stored on the stack. If you don't know something's size until runtime, it's allocated on the heap.
This means that String
, Vec
, HashMap
and Box<Trait>
/Box<[T]>
all
allocate, but any user-defined struct does not (it may contain something that
does allocate, but it doesn't require any extra allocation to construct if
you already have the allocated value to put inside of it). Box<T>
where T
has a statically-known size also allocates, but boxing statically-sized types is
very rare. The only use-case I can think of is sending huge objects between
threads.
This is why you may have heard some complaints about Haskell's use of a linked list of characters to represent a string. I'm not going to beat gankro's wonderful rant on the subject, but suffice to say that this is more-or-less the worst possible data structure to choose when each individual element is small and the number of elements is large, because it needs to store a pointer for every element, instead of a single integer for the entire array.
Not only that, but without some truly crazy compiler optimizations this means
that each element may not be in the same cache line as the one before it and
worse, may even be out of order (so you need to load cache line A to read
element 1, cache line B to read element 2, and then reload cache line A to read
element 3, to give a contrived example). This is why the performance-savvy in
languages such as Haskell and Lisp know to use Vec
-style constructs when
possible.
Back in the world of Rust, this means that you should avoid indirection-heavy
representations Vec<Vec<_>>
to represent a matrix, since this means that each
sub-vec will likely be in a different location. Use a data structure that uses a
flat Vec
as a backing array and builds on top of it, unless you really do need
to have the inner Vec
s be both jagged (each sub-vec is of different size) and
growable (you can change the size of each sub-vec independently at runtime). You
probably don't need either, let alone both. If you need them to be a uniform
size, store a Vec
and a dimension tuple, if you need them to be jagged but
don't need them to be growable, store a list of indices and return slices of the
flat Vec
using those. For an example of why this is good, let's dive into some
basic math. Let's assume a matrix backed by flat vector and a number of columns
(the number of rows can be inferred from columns + data length):
// This isn't how Vec is defined in the standard library but it's a simplified
// version with the same memory layout.
struct Vec<T> {
pointer: *mut T,
capacity: usize,
len: usize,
}
struct Matrix<T> {
data: Vec<T>,
num_columns: usize,
}
So a matrix with N rows and M columns needs N * M * size_of::<T>()
space for
the elements, plus size_of::<*mut T>() + 3 * size_of::<usize>()
for the
"metadata" (the vector's pointer and the capacity
, length
and num_columns
fields). If we're on a 64-bit CPU with 64 byte cache lines like the i7, we end
up with both *mut T
and usize
being 4 bytes each. If we had a 4x4 matrix of
f32
(also 4 bytes in size) this would mean:
Metadata size = 4 + 3 * 4 = 16
Maximum 2 cache misses
Data size = 4 * 4 * 4 = 64
Maximum 2 cache misses
Since the metadata and the data could be in separate parts of memory, we have to
calculate maximum cache misses separately. Both the metadata and the data could
cross a cache line and require two cache misses to load. This means that the
whole matrix would miss the cache 4 times in the worst case. If we had a
Vec<Vec<f32>>
representation that would mean the size of:
Matrix metadata size = 4 + 2 * 4 = 12
Maximum 2 cache misses
Inner vector metadata size = 4 * (4 + 2 * 4) = 48
Maximum 2 cache misses
Data size = 4 * 4 = 16
Maximum 2 cache misses per row (8 cache misses total)
This means that the Vec<Vec<f32>>
representation could miss the cache up to
12 times to read the whole array, much worse than the flat representation.
Even better, if you statically know the matrix's size you can use
statically-sized arrays like [[T; N]; N]
. These are even cheaper than flat
vectors, although you obviously can't use them for data with a variable size at
runtime. The 4x4 array in the previous example would be [[f32; 4]; 4]
and take
up 64 bytes, meaning that it would only take 2 cache misses to load in the worst
case.
Note: trait objects - See below for why you probably don't need owned trait objects plus a way to have better cache locality for variable-size containers
Now, the absolute best place for your data - registers. The more work you can do
without non-local writes the more that rustc
and LLVM can assume about your
data's access patterns. This is good because it means that data can be mapped to
the CPU's physical registers, which are the fastest memory on your entire
computer, but even better, if you make your data suitable for registers then
anything that happens to that data can be aggressively optimized. Writes and
reads through pointers have certain ordering restrictions on them that prevent
optimization, but there are no such restrictions on register-allocated data.
It's worth noting that since Rust restricts pointers more than C does, the ordering restrictions on pointers could be relaxed. This hasn't been implemented in LLVM yet since most of the optimization work is based on leveraging the rules of C and C-family languages. Even if they did implement relaxations on the reordering rules, however, storing data in registers will still be easier to optimize.
Essentially, if you can make your calculations pure, then do so. Writing to pointers is worse than writing to local variables. As much as possible, you should try to constrain mutable writes to the data that you have ownership over, rather than mutable references. So a mutable loop counter is fine, but passing a mutable reference to a loop counter through multiple layers of functions is not (unless they end up getting inlined, of course). This is really just an extension of one of my first points: clean, boring code is easier to optimize than spaghetti.
If you read the Accidentally Quadratic blog you'll know the number of times the
problem was using linear search to deduplicate data. HashSet
s have O(1)
amortized performance to query whether or not it contains an item. This means
that no matter how big the HashSet
gets it will still take the same amount of
time to check. With a Vec
checking membership is O(n)
, meaning that it takes
50x longer to check membership for a 50-element vector than it does for a
1-element vector. Even better, if you're only using deduplication as an
optimization instead of for correctness (i.e. "I want to process as little data
as possible so I'll only process each unique element once") you can use a
HashSet
with bounded memory usage in order to prevent the high memory usage
that using a HashSet
can cause. This is a valueless cache with a discard
policy of ignoring new values after a certain load is reached. You could also
use a valueless cache with a different discard policy, like LRU (least-recently
used) or something even more complex. A simple HashSet
can take you very far,
however, and you should only resort to other forms of cache if you are running
into unbounded memory use.
This is a smaller part of a wider point, which is an extension of the point
about algorithms above - choose the right data structure. Think of the
high-level operations you want to perform on the data structure (checking
membership, insertion, deletion, appending, sorting, etc.) and then choose a
structure with the best complexity over these operations. If you need to get the
lowest value, use a priority queue/heap. If you need the lowest value and you
need the elements to be unique, use a BTreeMap
/BTreeSet
. As mentioned above,
if you need fast membership querying use a HashSet
.
The standard library's HashSet
allows you to choose the hashing algorithm too,
so choose a fast hashing implementation like FNV
, as provided by the
rust-fnv
crate. One good trick is that if your data is the same size as
a u64
(for example, pointers), you can simply transmute the data to a u64
and use that as the hash. Don't do this blindly, though, because it opens you up
to denial-of-service attacks (which the standard library's default hasher
is designed to prevent) and may cause the HashSet
to get an uneven load,
which makes it as slow as using a Vec
. As with everything, you should try it
and run benchmarks to see if it makes a difference.
The canonical way to create trait objects is Box<Trait>
, but the majority of
code can get away with &mut Trait
, which saves an allocation. If you
absolutely need ownership then use a Box
, but most use-cases can use an
&Trait
or &mut Trait
. If possible, though, the best thing to do in these
cases is to avoid using a trait object all together. impl Trait
is the obvious
way to avoid them, but that doesn't allow dynamic dispatch since it's basically
type inference in a fancy hat. A good trick is, if you want to allow a variable
but finite number of implementors of a type because you want to choose between
them or iterate over them, use either a tuple or some kind of heterogenous list:
struct HList<Head, Tail>(Head, Tail);
struct HNil;
Then you can implement whatever trait you were going to turn into a trait object
for this datatype and allow users to pass this in instead. For example,
in combine
there's a choice
function that takes an array that could contain
trait objects, but it also supplies a .or()
combinator (and a choice!
macro
that expands to recursive .or
calls) that returns an Or<A, B>
that in turn
implements Parser
. This means that dispatch is static and the objects are all
stored in order in memory (because it's just a set of recursive structs). Using
traits and generics in this way means that you can only use dynamic dispatch for
rarest of cases - essentially only when you need to have a collection of
datatypes with heterogenous behaviour with a set of elements that isn't known
until runtime.
The more you can avoid dynamic dispatch, the better. It makes your code harder to reason about both by humans and by the compiler, and it causes each call to these functions to go through an indirection that messes with the CPU's instruction cache.
Fixed-length datatypes are trivially storable on the stack, but for
dynamically-sized data it's not so simple. However, smallvec
,
smallstring
and tendril
are all variable-length
datatypes that allow you to store small numbers of elements on the
stacknote: shameless plug.
Due to the law of small numbers, you are very likely to have more of these small
strings than larger ones. This is good because it reduces allocation, but it's
great if you're storing these in a Vec
or HashMap
, since you will have
less indirection and therefore better cache use. A good rule of thumb is to
never have more than one layer of pointers to dereference before you reach your
value (NASA enforces this rule in their C code, albeit for reliability and not
performance).
Libraries like smallvec
are great for cache locality, since an array of
SmallVec<[T; 4]>
will have exactly the same cache-locality as an array of just
T
- as long as the length of each SmallVec
is below 8 it just gets stored in
order. Going back to the cache-miss calculations from earlier:
// This is a gross oversimplification of how this type is implemented in the
// crate, but it's enough to explain how it works.
enum SmallVec<T> {
Small([T; 4]),
Big(Vec<T>),
}
type Matrix<T> = SmallVec<SmallVec<T>>;
As long as there are less than or equal to 4 elements in the SmallVec
, the
size of each instance is the size of the data plus the size of the tag, which
is:
let size_of_data = size_of::<T>() * 4;
let size_of_tag = max(size_of::<u8>(), align_of::<T>());
size_of_data + size_of_tag
The obvious question is why the size of the tag isn't just size_of::<u8>()
.
This is because if T
was more than 1 byte in size, this would mean that all of
the elements would all be unaligned by 1 byte, which is bad. CPUs work much
slower on unaligned data, but unless you write a compiler you will never have to
think about that. The size of the data and its alignment don't have to be the
same. For structs, for example, the alignment is typically the largest alignment
of any of its members. For primitive types like pointers, integers and floats
the alignment is the same as its size. The alignment and size of an f32
are
both 4. The alignment of a SmallVec<f32>
is the largest alignment of its
members, which is same as the alignment of [f32; 4]
, which is the same as the
alignment of f32
: 4.
Consider we had a 4x4 matrix of f32
, this would mean that the size of the
matrix would be:
Inner SmallVec size = 4 * 4 + 4
Matrix size = 4 * (4 * 4 + 4) + 4 = 84
Maximum 3 cache misses
We don't need to calculate the inner and outer cache misses seperately because they are guaranteed to be next to each other in memory.
From a cache standpoint this is as good as the flat vector representation, but there's nothing stopping you from accidentally making the inner vectors different lengths and breaking the invariant that an array's rows should be the same length.
I want to make something clear: you will never do these calculations in the
process of optimizing your program. This is merely some mathematical
justification for the voodoo folklore that "allocation is bad", since that is
often countered by "malloc
is fast". Both statements are true - the actual
process of allocating and deallocating memory is fast, but data structures that
allocate are worse for use-cases that require maximum speed.
Note: shameless plug alert - I wrote smallstring
, but
although it's similar to tendril
, it's not a strict subset. It's smaller and
more flexible with regards to the number of bytes it can store on the stack
before requiring an allocation.
Duff's device is fun, but array-length-generic unrolled loops are unlikely to be faster than the equivalent optimized naïve code nowadays, since any optimizing compiler worth its bits will do this kind of optimization without having to mangle your code and pissing of future-you.
Having said that, if you know that an array is likely to be a multiple of N
size, try making it a &[[T; N]]
and operating on a [T; N]
in each iteration.
This allows the compiler to batch instructions better including implicit
vectorization, and if and when Rust supports explicit vectorization it will
allow you to safely specify that you want these operations to be done in
parallel (see below for an explanation of vectorization). You might even be able
to convert a flat array into this form with an assert followed by a transmute if
you're writing an unsafe abstraction (the first rule of Rust is you never use
unsafe
, the second rule of Rust is that you never use unsafe
in application
code).
You can also use more classical loop unrolling if it allows you to reduce the strength of your operations. This means that if you have to calculate some value for each iteration of the loop and calculating this value takes longer than the body itself, manually unroll the body so you can calculate it less. Example: you can implement an integer logarithm function like so:
fn log_base(mut n: usize, base: usize) -> usize {
let mut out = 1;
loop {
if n < base { return out; }
out += 1;
n /= base;
}
}
However, n /= base; out += 1;
is slower to calculate than n < base
note: optimizing division. To take advantage of this fact, you can unroll the loop like
so:
fn log_base_unrolled(mut n: usize, base: usize) -> usize {
const UNROLL_COUNT: usize = 4;
// Use a fixed-size array because it still gets enregistered but we can
// ensure that we don't get the array count and the `out` skip value out of
// sync.
//
// A good solution would be to use a macro to generate this array and the
// `if` chain below. Tip: you can write macros inside functions and they
// are scoped to that function definition.
let premultiplied_base: [_; UNROLL_COUNT] = [
base,
base * base,
base * base * base,
base * base * base * base,
];
let mut out = 1;
loop {
if n < premultiplied_base[0] { return out; }
if n < premultiplied_base[1] { return out + 1; }
if n < premultiplied_base[2] { return out + 2; }
if n < premultiplied_base[3] { return out + 3; }
n /= precalculated_base[UNROLL_COUNT - 1];
out += UNROLL_COUNT;
}
}
This is inelegant and could be better written with iterators, but from checking
benchmarks it seems that this form is significantly (about 3x) faster than a
version that replaces the hand-written set of if
statements with a for
loop
over [n1, n2, n3, n4].iter().enumerate()
, and 3x faster than the naïve
implementation that doesn't unroll the loop at allnote: benchmarks. Besides,
that's what the abstraction boundary is for - to put a pretty dress on this
disfigured wretch so that it's acceptable to be seen in public. This is by far
the best kind of optimization: one that can be constrained to a single function
that callers don't have to know about the implementation of.
A small note about the iterator benchmarks above: you should always use iterators for variable-length data, but for fixed-length data it seems to mess with enregistering and so you should use indexing where possible. The compiler will tell you if you index out-of-bounds (although by default it just throws a warning and not an error, I don't know why).
The fastest method by far converts to an f64
, calls .log(base)
on that, and
then converts back. It doesn't work for large numbers, however, because of loss
of precision. An f128
type would probably fix that. This is probably a good
time to note that although adding and multiplying integers is faster than doing
the same for floats, for code that does division by a variable (i.e. not
division by a constant, since that can be optimized by the compiler into a
faster form, see below) or something more complex like trigonometry, you should
probably use floats. The compiler can't do that for you - it doesn't preserve
semantics since you can lose precision - but you can check for areas where this
would be an improvement and make the change manually.
Note: optimizing division - If this function is called
with a constant value for base
then the program will never do a division here,
assuming const-folding and inlining takes effect. There's a clever
transformation the compiler can do to convert an integer division by
a constant into a combination of a multiplication and a shift, taking advantage
of overflow. As with all mechanical optimizations, you don't want to write this
one yourself. Despite this, the n /= base
line is still significantly slower
than the comparison. Comparisons are really, really fast compared to other
calculations.
Note: benchmarks - The benchmarking functions I used are here:
#[bench]
fn log_base_bench(b: &mut Bencher) {
let input = test::black_box(5000000120510250);
b.iter(|| log_base(input, 10));
}
#[bench]
fn log_base_unrolled_bench(b: &mut Bencher) {
let input = test::black_box(5000000120510250);
b.iter(|| log_base_unrolled(input, 10));
}
test::black_box
a magic function that prevents rustc
and LLVM calculating
those function calls at compile-time and converting them into a constant, which
usually they would (actually, it's not magic, it's just some inline assembly
that doesn't do anything, since neither rustc
nor LLVM will try to optimize
anything that's been accessed by inline assembly).
This gives the following results:
test log_base_bench ... bench: 14 ns/iter (+/- 0)
test log_base_unrolled_bench ... bench: 5 ns/iter (+/- 0)
If you want to reduce the number of implicit asserts that get compiled into the code, then instead of this:
fn do_something_with_array(array: &[u8]) -> u8 {
array[0] + array[1] + array[2] + array[3] + array[4] + array[5]
}
Do this:
fn do_something_with_array(array: &[u8]) -> u8 {
assert!(array.len >= 5);
array[0] + array[1] + array[2] + array[3] + array[4] + array[5]
}
This allows LLVM to realize that the later asserts are unreachable and elides
them. This is useful for any code that may assert multiple different qualities
about the same data, but is especially useful for indexing since we know that
if array[n]
succeeds then array[n - 1]
will succeed too. This is similar to
the point about fixed-length arrays in the previous section.
Essentially, try to consolidate checks into a single assert!
. This means that
the later checks become statically unreachable.
Normally, Rust can only inline functions that are either defined in-crate or,
in the case of functions in other libraries, have #[inline]
specified. LTO
allows the compiler to inline cross-crate, at the cost of a compile-time speed
penalty. I am of the opinion that compile times only matter for debug builds, so
that's a tradeoff I'm willing to make. As with everything else here, profile and
check that the tradeoff is worthwhile.
#[inline]
feels good as a performance hint, but the truth is that optimizing
compilers are really good at working out when a function would benefit from
being inlined, and Rust isn't constrained to the slower standardized C calling
convention and can use fastcc
, making function calls extremely cheap. You're
more likely to cause the size of your executable to bloat. This takes up more
space on your hard drive, but who cares. If you have even a single
bundled asset like images or audio they will likely dwarf the size of your
executable.
The real issue here is that it can make your program no longer fit in the CPU's instruction cache. The CPU will only have to go to RAM for its instructions when functions are called with instructions outside of the current cache line. The larger your binary is, the more likely this is, and the more functions are inlined, the larger your binary is. It's not the end of the world to have a large binary, but unless you're really certain that something will be improved by manually marking it as inline, and you have benchmarks to back that up, it's just as likely that you'll slow a program down with careless inlining than to speed it up.
So now I've scared you off inlining, let's talk about when inlining actually
helps. Small functions that are called often are a good target for inlining.
Iterator::next
, for example, or Deref::deref
. These are likely to be
automatically inlined when called internally, but marking these as #[inline]
will ensure this behaviour, as well as allow users of your library to inline
them too, if they don't use LTO.
The other class of functions that are good targets for forced inlining are ones
that you know to often be called with constant parameters. We go into this
later on, but {integer}::from_str_radix
is an exellent example of this. Most
uses of this function will have a constant as the second parameter, and so by
judicious use of #[inline]
we can prevent branching in the code. I tried
benchmarking the difference between #[inline(never)]
and #[inline(always)]
on my reimplementation of from_str_radix
(spoilers) but they're close to 3ns,
which isn't within the error bars but is extremely minor.
Also, the compiler does really get it wrong sometimes, and can miss out on
inlining opportunities that would improve code speed. However, only add the
#[inline]
pragma if you can prove with benchmarks that it improves the speed,
and adding these pragmas is a bit of a dark art. You effort is probably better
spent elsewhere.
If you want to reduce the size of your code, you can try using
panic = "abort"
. This removes the "landing pads" that allow Rust to show a
nice stack trace after a panic, and causes any panic to end the program
instantly. I have legitimately seen non-trivial speedups on benchmarks for the
ion
shell after adding this option to the release build, and I can only
attribute it to making more code fit in the instruction cache. I have not tried
it with many other programs, but it would probably only affect medium to large
projects. Try it out on your codebase, it's as easy as adding one line to the
Cargo.toml
and it may improve your code's speed.
There's an absolutely amazing library for Haskell called Haxl, that automatically tracks the data dependencies of your network requests and batches them and runs them asynchronously as long as they don't overlap. It's something that shows the power of computational abstractions like monads and it's not something that has a, ahem, parallel in any other language, as far as I know. At least, not for IO. We've had this exact ability in the CPU for a long, long time. The CPU tracks the data dependencies of computations and will parallelize them wherever possible.
The reason data dependencies matter is that the CPU doesn't just execute one instruction at a time. As long as two instructions don't share a register they can safely be run simultaneously, so the CPU does so. This is essentially free parallelism without the need for locks, work queues or anything that affects your architecture at all, so you would be crazy not to take advantage of it.
Just like loop unrolling, parallelizable computation also lends itself well to autovectorization, which is the process where the compiler realizes that you're doing the same thing to multiple different values and converts it to a special instruction that, well, does the same thing to multiple different values.
You can translate the following numerical code:
(a1 + a2) + (b1 + b2) + (c1 + c2) + (d1 + d2)
into just one instruction that executes all four subexpressions as fast as a single addition.
%intermediate-value = add-vectors [%a1 %b1 %c1 %d1] [%a2 %b2 %c2 %d2]
sum-parts %intermediate-value
Let's write a version of usize::from_str_radix
that's about 30% faster than
the one in the standard library.
// We're redefining these here since they're private in the stdlib
#[derive(Debug, Clone, PartialEq, Eq)]
struct ParseIntError {
kind: IntErrorKind,
}
#[derive(Debug, Clone, PartialEq, Eq)]
enum IntErrorKind {
Empty,
InvalidDigit,
Overflow,
Underflow,
}
#[inline]
fn from_str_radix(input: &str, radix: usize) -> Result<usize, ParseIntError> {
#[inline]
fn to_digit_ascii(ascii: u8, radix: usize) -> Result<usize, ParseIntError> {
let decimal_digit = ascii.wrapping_sub(b'0');
let out = if radix > 10 && decimal_digit > 9 {
(ascii | 32).wrapping_sub(b'a') + 10
} else {
decimal_digit
} as usize;
if out > radix {
Err(ParseIntError { kind: IntErrorKind::InvalidDigit })
} else {
Ok(out)
}
}
if radix > 36 {
panic!("from_str_radix: radix is too high (maximum 36)");
}
let bytes = input.as_bytes();
if bytes.len() == 0 {
return Err(
ParseIntError { kind: IntErrorKind::Empty }
);
}
let bytes = match bytes[0] {
b'+' => { &bytes[1..] },
b'-' => { return Err(ParseIntError { kind: IntErrorKind::Underflow }) },
_ => bytes,
};
let mut mul = radix;
let mut index = bytes.len() - 1;
let mut output = to_digit_ascii(bytes[index], radix)?;
for &byte in bytes[..index].iter().rev() {
let digit = to_digit_ascii(byte, radix)?;
let next_output = output.wrapping_add(digit * mul);
if output > next_output {
return Err(
ParseIntError { kind: IntErrorKind::Overflow }
);
}
mul *= radix;
output = next_output;
}
Ok(output)
}
I explicitly use wrapping_*
functions not for optimization purposes (because
overflow checks are removed at runtime), but because overflow is required for
correct behaviour. You'll notice some optimizations here:
- We start at the end and work backwards, keeping a "mul" counter. Since we have
a finite number of supported radixes (radices?) and an upper bound on the size
of the output number we could almost certainly generate a list of all possible
values for
mul
and use indexing instead of multiplication here, but it's a lot of work for a speed boost that would almost certainly be minimal. Originally I wrote a version of this that works forwards and multipliesoutput
by radix each loop but the backwards method is 10% faster. This seems to be due to better instruction-level parallelism. The multiplications can be parallelized and only the addition relies on the previous iteration's value foroutput
, and addition is much faster. Additionally, we waste a cycle doing0 * radix
in the other version. The forward method is still 20% faster than the standard library version. Any fold or reducing operations can be improved in this way by making more of the operations parallizable. - We rely on overflow to test
A < x < B
comparisons. This is only useful here because we already need thex - A
value, so we're saving an extra comparison and a bitwise and. In most codeA < x && x < B
is as cheap or cheaper thanx - A < B
with overflow, since even though it's one instruction extra the comparisons can be done in parallel, plus comparisons and bitwise&
are incredibly cheap (technically&&
should result in a branch and therefore be slower, andrustc
does in fact emit a branch for that code, but after LLVM's optimizations it gets converted to a bitwise and). This is where most of our gains come from. The standard library's implementation doesn't unify the codepaths for upper- and lowercase letters like we do, and does a match on the entire unicode codepoint range. - We don't do
output + digit * mul
whenoutput == 0
andmul == 1
. This seems to be consistently 1ns faster, but it's possible that this doesn't make a difference and the 1ns speedup I'm seeing is pure luck. I reran the benchmarks both with and without the change multiple times and saw a consistent difference but this doesn't rule out luck. This is the problem with microbenchmarks, when the difference becomes small enough you can't tell whether you're really making it faster. - We use Rust's safe iterator interface, which is as fast as the C idiom of
storing a "start" and "end" pointer so you can check whether the loop is
finished with a simple
==
. If you ever hear someone say "Rust's safety guarantees are useless because you need to drop down to unsafe to get any real speed" (I've seen this almost verbatim on Hacker News before) you can show them this. - I tried to unroll the loop in this function, but the logic became too convoluted and would have taken up more instructions. Usually you would only use unrolling if you can reduce the strength of operations (in the previous unrolling example, replacing division with comparison), which this does not, but it's worth trying optimizations just to see if you end up with faster code. Most code relies on running the same thing hundreds or thousands of times over, if you're already optimizing an oft-used function it's worth trying out different techniques and rerunning benchmarks, since you might stumble across a small improvement that leads to huge benefits overall.
- We rely on const-folding for the comparison with
radix
to be optimized away, so we mark the function asinline
in order to allow external crates to apply the same optimization.
The method I used for this is the basic method you should use for any optimization work: write a representative benchmark and then progressively tweak and rerun benchmarks. Doing this for pure functions is much easier, so one of the first things you should do to optimize any function that's called in a tight loop is to make it pure. This avoids indirect writes and reads (reads and writes to places in memory that are likely to be outside cache lines) and makes benchmarking much, much easier. If you use test-driven development for reliability, this is the equivalent for performance.
Extending this to work on signed integer types is an exercise for the reader. Tip: unlike C, you can rely on signed integers overflowing with 2's complement arithmetic.
Here are the functions I used to benchmark the code:
#[bench]
fn bench_from_str(b: &mut Bencher) {
b.iter(|| {
let input = black_box("1235112512");
assert_eq!(from_str_radix(input, 10), Ok(1235112512));
let input = black_box("FFaf125A");
assert_eq!(from_str_radix(input, 16), Ok(0xFFaf125A));
});
}
#[bench]
fn bench_from_str_native(b: &mut Bencher) {
b.iter(|| {
let input = black_box("1235112512");
assert_eq!(usize::from_str_radix(input, 10), Ok(1235112512));
let input = black_box("FFaf125A");
assert_eq!(usize::from_str_radix(input, 16), Ok(0xFFaf125A));
});
}
#[bench]
fn bench_from_str_nonconstradix(b: &mut Bencher) {
b.iter(|| {
let input = black_box("1235112512");
let radix = black_box(10);
assert_eq!(from_str_radix(input, radix), Ok(1235112512));
let input = black_box("FFaf125A");
let radix = black_box(16);
assert_eq!(from_str_radix(input, radix), Ok(0xFFaf125A));
});
}
#[bench]
fn bench_from_str_native_nonconstradix(b: &mut Bencher) {
b.iter(|| {
let input = black_box("1235112512");
let radix = black_box(10);
assert_eq!(usize::from_str_radix(input, radix), Ok(1235112512));
let input = black_box("FFaf125A");
let radix = black_box(16);
assert_eq!(usize::from_str_radix(input, radix), Ok(0xFFaf125A));
});
}
#[bench]
fn bench_from_str_1char(b: &mut Bencher) {
b.iter(|| {
let input = black_box("1");
assert_eq!(from_str_radix(input, 10), Ok(1));
let input = black_box("F");
assert_eq!(from_str_radix(input, 16), Ok(0xF));
});
}
#[bench]
fn bench_from_str_native_1char(b: &mut Bencher) {
b.iter(|| {
let input = black_box("1");
assert_eq!(usize::from_str_radix(input, 10), Ok(1));
let input = black_box("F");
assert_eq!(usize::from_str_radix(input, 16), Ok(0xF));
});
}
Results:
test bench_from_str ... bench: 22 ns/iter (+/- 7)
test bench_from_str_native ... bench: 36 ns/iter (+/- 0)
test bench_from_str_nonconstradix ... bench: 26 ns/iter (+/- 0)
test bench_from_str_native_nonconstradix ... bench: 39 ns/iter (+/- 0)
test bench_from_str_1char ... bench: 5 ns/iter (+/- 0)
test bench_from_str_native_1char ... bench: 13 ns/iter (+/- 0)
Something I've noticed with benchmarks below 1ms is that it can take some time to "spin up" the CPU. Occasionally the first benchmark in a set will take 20-30ns longer than the ones after it. If you duplicate the benchmark verbatim and take the number with the lowest variance this avoids the issue. I think this is due to the CPU needing to gather information in order to do proper branch prediction. Ideally you'd just not do micro-benchmarks, but some functions do legitimately call for it. Don't trust a benchmark, especially a microbenchmark, until you've rerun it multiple times.
When I reran this particular benchmark (at least 10 times in total, not including the benchmarks I ran while editing the code) to ensure that the numbers were stable, and although the averages are extremely stable (the native one sometimes was slightly slower, the 36ns value above is what I see most of the time), the variances are mostly 0-3ns with spikes of 13-26ns. I don't have a good explanation for this, expect a follow-up post with tips on writing better benchmarks.
This is a perfect example of why low-level optimization is important, since this is exactly the kind of function that could be used hundreds of thousands of times in parsers of textual data and a 10ns speedup here could lead to meaningful improvements over the life of the program. It's also an example of why you should avoid low-level optimization when possible. The original stdlib implementation of this function isn't the most idiomatic code, but it's significantly more readable than this. Having said that, though, it's a testament to Rust's commitment to zero-cost abstractions that you can write mostly-idiomatic, safe code and have it perform as well as equivalent C/C++ code that would require use of unsafe pointer arithmetic.
If I had to sum the trick to optimization up in a pithy QotW-ready snippet it would be this:
The fastest code is code that doesn't run at all, the second-fastest code is code that never stops running.
Ideally, you want to do less work, and if you're doing the minimum amount of work you want to reduce the amount of time the CPU spends waiting around.
So why is Rust faster than Python? Because Python has dynamic typing, meaning
that it doesn't know the size of anything until runtime. It essentially uses
a variation on the smallvec
method of allocating small values on the stack
and then "upgrading" them to the heap at a certain size. This means that
although Python can be good for simple numeric calculations, anything larger
than numbers can be very cache-unfriendly. Additionally, since it's not
compiled, it can't make use of instruction-level parallelism, since every
instruction in Python's bytecode is essentially executed as a compare-and-branch
(which instruction is this?), a load (read the current values, which are
pointers and not stored in registers), and only then the "real" instruction.
PyPy fixes the second problem by converting bytecode into machine code, which
then means that it only has to execute the read and the "real" instruction. This
is much, much faster than the regular Python interpreter since
compare-and-branch is one of the slower instructions out therenote: branching, and it allows
other optimizations to be performed on the code that aren't possible with the
higher-level bytecode representation.
If you want more Rustic performance tips with more numbers I would consider
BurntSushi's excellent post-mortem of ripgrep required
reading for anyone wanting to write fast software (it was the thing that
originally sent me down this deep, deep rabbit hole). For more general systems
language-y points, check out Andrei Alexandrescu's talk "Fastware",
from which the from_str_radix
and log_base
code was adapted. A lot of the
points in this article are expansions upon points behind one of those two links.
I hope that whether you're a soft-shell Rustacean or a grizzled veteran, this has given you a better sense of when some code smells, from a performance perpective. Go make things go vroom.
Note: branching - I didn't go into this in the article proper because I can't think of any practical advice stemming from it apart from "if you're writing an interpreted language, use a JIT", which if you care about performance will be your second step after using bytecode instead of an AST-walking interpreter. I've tried to write Rust code that converts conditionals into mathematical operations but I've yet to find anything that gives better than parity with the original, optimized code. Compilers are really good at this stuff.