Wednesday, 18 January 2012

Iteration

++health;

A typical container iteration loop tends to be written like this:

for (auto i = v.begin(); i != v.end(); ++i)
    i->f();

You see this kind of code all over the place. You can't get any more efficient than that, can you? Let's look at the generated code (Visual Studio 2010, x86, optimised, security tests disabled, and rewritten for readability):

mov  edi,dword ptr [v]
  mov  esi,dword ptr [edi]
  cmp  esi,dword ptr [edi+4]
  je   finish
  nop
repeat:
  call T::f
  add  esi,4
  cmp  esi,dword ptr [edi+4]
  jne  repeat
finish:

So what do we have here? First we load our container's address into edi and then read a dword from it. This will be the 'begin iterator'. Then we compare it with a dword read from edi+4 (the end iterator) and terminate if they're equal. Then we go into our loop, calling our member function, adding 4 to esi (incrementing the iterator) and then doing the same comparison to repeat the loop.

Let's do the same thing using an algorithm:

std::for_each(v.begin(), v.end(), [](T& t){
    t.f();
});

This is the code produced by that algorithm:

mov  eax,dword ptr [v]
  mov  esi,dword ptr [eax]
  mov  edi,dword ptr [eax+4]
  cmp  esi,edi
  je   finish
repeat:
  call T::f
  add  esi,4
  cmp  esi,edi
  jne  repeat
finish:

It's similar, but subtly different. First is that v's address is begin stored in eax instead of edi. No big deal; compilers' register allocation algorithms often results in irrelevant differences like this. The significant difference occurs when it does the end-of-loop test. The end iterator is being read into a register, and it is that register which is being compared against for our end-of-loop condition. Compare this to the previous code where the end iterator is being read from memory every time round the loop. Comparing two registers is faster than comparing a register with a memory location.

So the simple 'for' case has a performance penalty compared to the algorithm case! How can this be? Surely the compiler has to do more work to inline the algorithm and the lambda? It does have to do that work, but the problem is not one of inlining. The problem is aliasing.

Aliasing is the issue of multiple pointers/references pointing to the same memory. In the case of the 'for' loop, we are calling end() each time round the loop, which the compiler translates into a read from memory. Why doesn't the compiler cache the result in a register? Well, it can't, not without violating the rules of C++, because it doesn't know that the end iterator won't change each time round the loop. That is, it doesn't know that the container isn't being an aliased by some code executing within the call to T::f(), so it's forced to re-evaluate the function - hence the reread from memory. It's entirely possible that, if it could see enough of the loop body to know everything it touches, it could cache the data that it knows won't change, but it can't do that in general.

The algorithm produces the better code because end() is only called once to produce the argument to the std::for_each function. Once the function executes, it only operates on those iterators, not the container.

It's simple enough to change the original for loop to avoid the problem:

for (auto i = v.begin(), e = v.end(); i != e; ++i)
    i->f();

You'll find that this now produces the same quality of code as the algorithm.

Is this kind of optimisation really significant, in the grant scheme of things? Well, probably not in most cases, but it depends on what the loop is doing, how your data is organised and what the rest of your program is doing. It's a prime candidate for false sharing to kick in, where the extra unnecessary read on a large multiprocessed algorithm could become a significant problem.

Also, as C++ programmers, we've already conditioned ourselves to use pre-increment on loop iteration instead of a post-increment because of the performance concerns of creating temporary iterators, so it's not much of a stretch to apply this optimisation too.

And all of this will be done for us when the range-based for loop syntax becomes widespread.

1 comment: