Discussion:
Apparent optimiser bug breaks compiled code (2.1.0-alpha)
Alexander Gall
2014-09-30 07:34:31 UTC
Permalink
I'm currently working on an implementation of a pure Lua Bloom filter for
the Snabb Switch project (https://github.com/SnabbCo/snabbswitch, my code
isn't public yet, though). I just spent a week puzzling over extremely
weird anomalies of my code that only occurs when it is compiled. I could
finally narrow it down to the attached simple standalone program, which is
a variant of the Murmur3 hash function. In particular, it's a simplified
version of MurmurHash3_x64_128 as per the reference implementation at
https://code.google.com/p/smhasher/source/browse/trunk/MurmurHash3.cpp,
applicable to one-byte values.

The code was distilled down to the bare minimum that exhibits a similar
anomaly as my Bloom filter code, namely that it behaves differently when it
is compiled as when it is not compiled. What it does is simply hashing the
one-byte value 0x01 15 times in a loop and comapring the lowest 32 bits of
the 128-bit hash value to the expected number 0x8374fe16 (the code is not
endian-save, so if you run it on a big-endian machine, that number will
differ).

The manner in which the hash value is stored and accessed from the loop may
look weird, but it should work correctly when compiled nevertheless, I
believe.

When executed with the compiler turned off, all is fine

$ luajit -v
LuaJIT 2.1.0-alpha -- Copyright (C) 2005-2014 Mike Pall. http://luajit.org/
$ luajit -joff murmur.lua
All fine

But with JIT turned on

$ luajit -jon murmur.lua
Expected 8374fe16, got 8374fe16 12 true deadbeef
Expected 8374fe16, got 8374fe16 13 true deadbeef
Expected 8374fe16, got 8374fe16 14 true deadbeef
Expected 8374fe16, got 8374fe16 15 true deadbeef

Note that I use hotloop=10. Changing this will cause the effect to be
delayed until the number of iterations exceeds the value of of that
parameter. This is clearly due to the fact the code is interpreted up to
that point and thus works as expected.

What seems to be happening is that as soon as the loop is compiled, the
assignment of the variable foo as well as the expression in the if
statement operate on the value that was set before the hash is calculated.
However, that value is always overwritten by the hash function.

The effect disappears if any of the following optimizations are turned off:
fold, cse, fwd. This is why I suspect a bug in the JIT optimizer code.

I checked the current HEAD of the v2.1 branch as well as random older
versions of that branch. All of them exhibit this problem.

Any insights would be highly appreciated.
--
Alex
Mike Pall
2014-09-30 12:16:27 UTC
Permalink
Post by Alexander Gall
The manner in which the hash value is stored and accessed from the loop may
look weird, but it should work correctly when compiled nevertheless, I
believe.
Nope, you're violating the strict aliasing rules. You're storing
via a 64 bit pointer and loading via another 32 bit pointer. The
compiler is free to optimze out your 32 bit check.

Use a union and consistently dereference from the _same_ variable
that holds the union. That's the only allowed exception from the
strict aliasing rules. E.g.: uu.u64[0] = x; y = uu.u32[0]


Also, you should read up on the Lua BitOp semantics. You never
need a 'band(expr, max64)'. You either pass the expression to the
next bit operation, which performs the coercion, anyway. Or use
tobit() if it's not a bit operation.

Also, since you're using 64 bit cdata numbers: they already wrap
around on overflow and they don't have the range problem with
multiplication, either. And you should initialize h1, h2, k1, k2
with 0ULL and not 0, which avoids coercion issues, too.

--Mike
Alexander Gall
2014-09-30 12:34:27 UTC
Permalink
The aliasing rule has indeed escaped me so far. This probably explains the
other issues I've been having and did not understand.

Thanks!
--
Alex
Post by Mike Pall
Post by Alexander Gall
The manner in which the hash value is stored and accessed from the loop
may
Post by Alexander Gall
look weird, but it should work correctly when compiled nevertheless, I
believe.
Nope, you're violating the strict aliasing rules. You're storing
via a 64 bit pointer and loading via another 32 bit pointer. The
compiler is free to optimze out your 32 bit check.
Use a union and consistently dereference from the _same_ variable
that holds the union. That's the only allowed exception from the
strict aliasing rules. E.g.: uu.u64[0] = x; y = uu.u32[0]
Also, you should read up on the Lua BitOp semantics. You never
need a 'band(expr, max64)'. You either pass the expression to the
next bit operation, which performs the coercion, anyway. Or use
tobit() if it's not a bit operation.
Also, since you're using 64 bit cdata numbers: they already wrap
around on overflow and they don't have the range problem with
multiplication, either. And you should initialize h1, h2, k1, k2
with 0ULL and not 0, which avoids coercion issues, too.
--Mike
Alexander Gall
2014-10-07 11:41:34 UTC
Permalink
This post might be inappropriate. Click to display it.
Mike Pall
2014-10-08 20:14:02 UTC
Permalink
Post by Alexander Gall
In the current state, the calculation of the hash of the input
data starts to fail when the code is optimizied, but strangely enough
I've found a problem with fused loads of constants under high
register pressure.

Thank you for the test case! Fixed in the git repository.
Post by Alexander Gall
A remark about the Bloom filter code: it contains loops with a low
number of iterations (4 in this example), which is fixed and known
when the filter is created. The code uses an automated "loop unroller"
to, well, unroll these loops. This hack has sped up processing
considerably (avoidance of trace aborts due to loop unrolling by the
compiler and improved optimizaton to eliminate GC overhead). I'd be
interested to learn whether that's actually a good thing to do or has
any drawbacks (like causing the effect I'm seeing here ;)
Well ... high register pressure, lots of spills etc.

But the code isn't optimal, anyway. Too many abstractions, lots of
tiny details (e.g. that multiply by 16 goes via FP). You should
always have a close look at the generated IR and assembler code,
if you're doing performance tuning at that level.

--Mike
Alexander Gall
2014-10-09 09:41:46 UTC
Permalink
Post by Mike Pall
Post by Alexander Gall
In the current state, the calculation of the hash of the input
data starts to fail when the code is optimizied, but strangely enough
I've found a problem with fused loads of constants under high
register pressure.
Thank you for the test case! Fixed in the git repository.
Thanks for the quick fix.
Post by Mike Pall
Post by Alexander Gall
A remark about the Bloom filter code: it contains loops with a low
number of iterations (4 in this example), which is fixed and known
when the filter is created. The code uses an automated "loop unroller"
to, well, unroll these loops. This hack has sped up processing
considerably (avoidance of trace aborts due to loop unrolling by the
compiler and improved optimizaton to eliminate GC overhead). I'd be
interested to learn whether that's actually a good thing to do or has
any drawbacks (like causing the effect I'm seeing here ;)
Well ... high register pressure, lots of spills etc.
Understood. I've actually encountered trace aborts due to exhausted spill
slots when I was playing around with this kind of thing.
Post by Mike Pall
But the code isn't optimal, anyway. Too many abstractions, lots of
tiny details (e.g. that multiply by 16 goes via FP). You should
always have a close look at the generated IR and assembler code,
if you're doing performance tuning at that level.
I certainly haven't quite developed an eye for these tiny details yet
(and this is my first project in Lua as well). All of your hints are highly
appreciated.

In fact, the main concern I currently have for my application is GC overhead.
I need to forward packets up to a rate of 10Gbps (or several 100K or a few
M packets per second) and I fear that GC can cause undesireable jitter and
packet drops. This is backed more by a gut feeling than hard evidence, though.
I'd be interested in your opinion on this.

I found that I can maintain a fairly high level of abstraction while
still having
enough overall performance. I suppose this coding style tends to generate
more garbage, which makes it more dependent on optimizations by the compiler.
So far, that has worked quite well, but I'm prepared to change things at a
fundamental level as I learn more about the system.
--
Alex
Loading...