Discussion:
int64 sum, python vs luajit interpreter
lex pops
2014-08-22 20:37:58 UTC
Permalink
Hello,

Why does the luajit 'interpreter' run slower than the python
interpreter in this example?

==== sum.lua ====
local ffi = require('ffi')

local function sum_int()
local sum = 0LL
for i=1,10000000 do
sum = sum + i
end
return sum
end

print(sum_int())

==== sum.py ====
def f():
sum = 0
for x in xrange(10000001):
sum += x
return sum

print(f())
====

$ time ./luajit -joff /tmp/sum.lua
50000005000000LL

real 0m0.923s
user 0m0.920s

$ time python /tmp/sum.py
50000005000000

real 0m0.354s
user 0m0.350s

====

Since python uses boxed integers, I expected luajit to meet or beat
python's time. Turning on the jit or using numbers instead of int64s
makes luajit faster by a wide margin.

lex
Florian Weimer
2014-08-22 20:45:40 UTC
Permalink
Post by lex pops
Since python uses boxed integers, I expected luajit to meet or beat
python's time.
IIRC, Python caches recently freed boxed integers for immediate
reduce. This is one of the cases where reference counting can be a
huge win because the interpreter immediately notices that the integer
object is unreachable and available for future reuse.

For LuaJIT, the interpreted code is essentially a garbage collector
benchmark.
Tudor Bosman
2014-08-22 20:51:38 UTC
Permalink
Also, ffi int64_t are boxed as well.

-Tudor.
Post by Florian Weimer
Post by lex pops
Since python uses boxed integers, I expected luajit to meet or beat
python's time.
IIRC, Python caches recently freed boxed integers for immediate
reduce. This is one of the cases where reference counting can be a
huge win because the interpreter immediately notices that the integer
object is unreachable and available for future reuse.
For LuaJIT, the interpreted code is essentially a garbage collector
benchmark.
lex pops
2014-08-22 21:54:26 UTC
Permalink
Post by Florian Weimer
Post by lex pops
Since python uses boxed integers, I expected luajit to meet or beat
python's time.
IIRC, Python caches recently freed boxed integers for immediate
reduce. This is one of the cases where reference counting can be a
huge win because the interpreter immediately notices that the integer
object is unreachable and available for future reuse.
For LuaJIT, the interpreted code is essentially a garbage collector
benchmark.
Interesting. I suspected something related to allocations. Is the lua
code the gc benchmark or rather the allocator benchmark? Running with
collectgarbage('stop') makes it even slower.

There is no way to avoid the allocation for the result of the 'sum +
i' expression, I suppose?

lex
Karel Tuma
2014-08-23 15:02:41 UTC
Permalink
Post by lex pops
Post by Florian Weimer
Post by lex pops
Since python uses boxed integers, I expected luajit to meet or beat
python's time.
IIRC, Python caches recently freed boxed integers for immediate
reduce. This is one of the cases where reference counting can be a
huge win because the interpreter immediately notices that the integer
object is unreachable and available for future reuse.
For LuaJIT, the interpreted code is essentially a garbage collector
benchmark.
Interesting. I suspected something related to allocations. Is the lua
code the gc benchmark or rather the allocator benchmark? Running with
collectgarbage('stop') makes it even slower.
Turning off GC actually helps, about 20%. You have to be careful to time
this correctly, don't forget to put os.exit() at the end of your script.

(or ffi.cdef("void exit(int);");ffi.C.exit(0) to be extra sure).

Allocator still amounts for a lot of time, but majority is the path
full of hardships taken:

Samples: 13K of event 'cycles', Event count (approx.): 9939866374
16.93% luajit-ljx luajit-ljx [.] lj_cconv_ct_ct
15.22% luajit-ljx luajit-ljx [.] carith_int64
11.85% luajit-ljx luajit-ljx [.] carith_checkarg
10.41% luajit-ljx luajit-ljx [.] mmcall
8.28% luajit-ljx luajit-ljx [.] lj_alloc_malloc

LuaJIT interpreter handles this case as an ordinary metamethod implemented
in C.

Python optimizes the common case (and has a dedicated boxed int allocator).
Post by lex pops
There is no way to avoid the allocation for the result of the 'sum +
i' expression, I suppose?
Correct, your ffi ctype is boxed, meaning every result allocates.
Post by lex pops
lex
Юрий Соколов
2014-08-24 14:13:13 UTC
Permalink
Did you tried

sum = ffi.new("uint64_t[1]")
for i = 1, 10000000 do
sum[0] = sum[0] + i
end
?
Post by Karel Tuma
Post by lex pops
Post by Florian Weimer
Post by lex pops
Since python uses boxed integers, I expected luajit to meet or beat
python's time.
IIRC, Python caches recently freed boxed integers for immediate
reduce. This is one of the cases where reference counting can be a
huge win because the interpreter immediately notices that the integer
object is unreachable and available for future reuse.
For LuaJIT, the interpreted code is essentially a garbage collector
benchmark.
Interesting. I suspected something related to allocations. Is the lua
code the gc benchmark or rather the allocator benchmark? Running with
collectgarbage('stop') makes it even slower.
Turning off GC actually helps, about 20%. You have to be careful to time
this correctly, don't forget to put os.exit() at the end of your script.
(or ffi.cdef("void exit(int);");ffi.C.exit(0) to be extra sure).
Allocator still amounts for a lot of time, but majority is the path
Samples: 13K of event 'cycles', Event count (approx.): 9939866374
16.93% luajit-ljx luajit-ljx [.] lj_cconv_ct_ct
15.22% luajit-ljx luajit-ljx [.] carith_int64
11.85% luajit-ljx luajit-ljx [.] carith_checkarg
10.41% luajit-ljx luajit-ljx [.] mmcall
8.28% luajit-ljx luajit-ljx [.] lj_alloc_malloc
LuaJIT interpreter handles this case as an ordinary metamethod implemented
in C.
Python optimizes the common case (and has a dedicated boxed int allocator).
Post by lex pops
There is no way to avoid the allocation for the result of the 'sum +
i' expression, I suppose?
Correct, your ffi ctype is boxed, meaning every result allocates.
Post by lex pops
lex
Karel Tuma
2014-08-24 14:29:05 UTC
Permalink
Post by Юрий Соколов
Did you tried
sum = ffi.new("uint64_t[1]")
for i = 1, 10000000 do
sum[0] = sum[0] + i
end
Yes. Makes it even worse (intermediate uint64_t result of dereferencing sum[0],
plus additional two MM invocations to access [0]).
lex pops
2014-08-25 17:45:23 UTC
Permalink
Post by Karel Tuma
Post by lex pops
Post by Florian Weimer
Post by lex pops
Since python uses boxed integers, I expected luajit to meet or beat
python's time.
IIRC, Python caches recently freed boxed integers for immediate
reduce. This is one of the cases where reference counting can be a
huge win because the interpreter immediately notices that the integer
object is unreachable and available for future reuse.
For LuaJIT, the interpreted code is essentially a garbage collector
benchmark.
Interesting. I suspected something related to allocations. Is the lua
code the gc benchmark or rather the allocator benchmark? Running with
collectgarbage('stop') makes it even slower.
Turning off GC actually helps, about 20%. You have to be careful to time
this correctly, don't forget to put os.exit() at the end of your script.
Ah that's true. Why does this make a difference?
Post by Karel Tuma
Allocator still amounts for a lot of time, but majority is the path
Samples: 13K of event 'cycles', Event count (approx.): 9939866374
16.93% luajit-ljx luajit-ljx [.] lj_cconv_ct_ct
15.22% luajit-ljx luajit-ljx [.] carith_int64
11.85% luajit-ljx luajit-ljx [.] carith_checkarg
10.41% luajit-ljx luajit-ljx [.] mmcall
8.28% luajit-ljx luajit-ljx [.] lj_alloc_malloc
LuaJIT interpreter handles this case as an ordinary metamethod implemented
in C.
Very interesting. Thank you for the profile. I looked at the source
for those functions and am starting to get a better idea of whats
going on.

lex

Loading...