int64 sum, python vs luajit interpreter

Discussion:

lex pops

2014-08-22 20:37:58 UTC

Hello,

Why does the luajit 'interpreter' run slower than the python
interpreter in this example?

==== sum.lua ====
local ffi = require('ffi')

local function sum_int()
local sum = 0LL
for i=1,10000000 do
sum = sum + i
end
return sum
end

print(sum_int())

==== sum.py ====
def f():
sum = 0
for x in xrange(10000001):
sum += x
return sum

print(f())
====

$ time ./luajit -joff /tmp/sum.lua
50000005000000LL

real 0m0.923s
user 0m0.920s

$ time python /tmp/sum.py
50000005000000

real 0m0.354s
user 0m0.350s

====

Since python uses boxed integers, I expected luajit to meet or beat
python's time. Turning on the jit or using numbers instead of int64s
makes luajit faster by a wide margin.

lex

Florian Weimer

2014-08-22 20:45:40 UTC

Permalink

Post by lex pops
Since python uses boxed integers, I expected luajit to meet or beat
python's time.

IIRC, Python caches recently freed boxed integers for immediate
reduce. This is one of the cases where reference counting can be a
huge win because the interpreter immediately notices that the integer
object is unreachable and available for future reuse.

For LuaJIT, the interpreted code is essentially a garbage collector
benchmark.

Tudor Bosman

2014-08-22 20:51:38 UTC

Permalink

Also, ffi int64_t are boxed as well.

-Tudor.

Post by Florian Weimer

Post by lex pops
Since python uses boxed integers, I expected luajit to meet or beat
python's time.

IIRC, Python caches recently freed boxed integers for immediate
reduce. This is one of the cases where reference counting can be a
huge win because the interpreter immediately notices that the integer
object is unreachable and available for future reuse.
For LuaJIT, the interpreted code is essentially a garbage collector
benchmark.

lex pops

2014-08-22 21:54:26 UTC

Permalink

Post by Florian Weimer

Post by lex pops
Since python uses boxed integers, I expected luajit to meet or beat
python's time.

IIRC, Python caches recently freed boxed integers for immediate
reduce. This is one of the cases where reference counting can be a
huge win because the interpreter immediately notices that the integer
object is unreachable and available for future reuse.
For LuaJIT, the interpreted code is essentially a garbage collector
benchmark.

Interesting. I suspected something related to allocations. Is the lua
code the gc benchmark or rather the allocator benchmark? Running with
collectgarbage('stop') makes it even slower.

There is no way to avoid the allocation for the result of the 'sum +
i' expression, I suppose?

lex

Karel Tuma

2014-08-23 15:02:41 UTC

Permalink

Post by lex pops

Post by Florian Weimer

Post by lex pops
Since python uses boxed integers, I expected luajit to meet or beat
python's time.

IIRC, Python caches recently freed boxed integers for immediate
reduce. This is one of the cases where reference counting can be a
huge win because the interpreter immediately notices that the integer
object is unreachable and available for future reuse.
For LuaJIT, the interpreted code is essentially a garbage collector
benchmark.

Interesting. I suspected something related to allocations. Is the lua
code the gc benchmark or rather the allocator benchmark? Running with
collectgarbage('stop') makes it even slower.

Turning off GC actually helps, about 20%. You have to be careful to time
this correctly, don't forget to put os.exit() at the end of your script.

(or ffi.cdef("void exit(int);");ffi.C.exit(0) to be extra sure).

Allocator still amounts for a lot of time, but majority is the path
full of hardships taken:

Samples: 13K of event 'cycles', Event count (approx.): 9939866374
16.93% luajit-ljx luajit-ljx [.] lj_cconv_ct_ct
15.22% luajit-ljx luajit-ljx [.] carith_int64
11.85% luajit-ljx luajit-ljx [.] carith_checkarg
10.41% luajit-ljx luajit-ljx [.] mmcall
8.28% luajit-ljx luajit-ljx [.] lj_alloc_malloc

LuaJIT interpreter handles this case as an ordinary metamethod implemented
in C.

Python optimizes the common case (and has a dedicated boxed int allocator).

Post by lex pops
There is no way to avoid the allocation for the result of the 'sum +
i' expression, I suppose?

Correct, your ffi ctype is boxed, meaning every result allocates.

Post by lex pops
lex

Юрий Соколов

2014-08-24 14:13:13 UTC

Permalink

Did you tried

sum = ffi.new("uint64_t[1]")
for i = 1, 10000000 do
sum[0] = sum[0] + i
end
?

Post by Karel Tuma

Post by lex pops

Post by Florian Weimer

Post by lex pops
Since python uses boxed integers, I expected luajit to meet or beat
python's time.

IIRC, Python caches recently freed boxed integers for immediate
reduce. This is one of the cases where reference counting can be a
huge win because the interpreter immediately notices that the integer
object is unreachable and available for future reuse.
For LuaJIT, the interpreted code is essentially a garbage collector
benchmark.

Interesting. I suspected something related to allocations. Is the lua
code the gc benchmark or rather the allocator benchmark? Running with
collectgarbage('stop') makes it even slower.

Turning off GC actually helps, about 20%. You have to be careful to time
this correctly, don't forget to put os.exit() at the end of your script.
(or ffi.cdef("void exit(int);");ffi.C.exit(0) to be extra sure).
Allocator still amounts for a lot of time, but majority is the path
Samples: 13K of event 'cycles', Event count (approx.): 9939866374
16.93% luajit-ljx luajit-ljx [.] lj_cconv_ct_ct
15.22% luajit-ljx luajit-ljx [.] carith_int64
11.85% luajit-ljx luajit-ljx [.] carith_checkarg
10.41% luajit-ljx luajit-ljx [.] mmcall
8.28% luajit-ljx luajit-ljx [.] lj_alloc_malloc
LuaJIT interpreter handles this case as an ordinary metamethod implemented
in C.
Python optimizes the common case (and has a dedicated boxed int allocator).

Post by lex pops
There is no way to avoid the allocation for the result of the 'sum +
i' expression, I suppose?

Correct, your ffi ctype is boxed, meaning every result allocates.

Post by lex pops
lex

Karel Tuma

2014-08-24 14:29:05 UTC

Permalink

Post by Ð®ÑÐ¸Ð¹ Ð¡Ð¾ÐºÐ¾Ð»Ð¾Ð²
Did you tried
sum = ffi.new("uint64_t[1]")
for i = 1, 10000000 do
sum[0] = sum[0] + i
end

Yes. Makes it even worse (intermediate uint64_t result of dereferencing sum[0],
plus additional two MM invocations to access [0]).

lex pops

2014-08-25 17:45:23 UTC

Permalink

Post by Karel Tuma

Post by lex pops

Post by Florian Weimer

Post by lex pops
Since python uses boxed integers, I expected luajit to meet or beat
python's time.

IIRC, Python caches recently freed boxed integers for immediate
reduce. This is one of the cases where reference counting can be a
huge win because the interpreter immediately notices that the integer
object is unreachable and available for future reuse.
For LuaJIT, the interpreted code is essentially a garbage collector
benchmark.

Interesting. I suspected something related to allocations. Is the lua
code the gc benchmark or rather the allocator benchmark? Running with
collectgarbage('stop') makes it even slower.

Turning off GC actually helps, about 20%. You have to be careful to time
this correctly, don't forget to put os.exit() at the end of your script.

Ah that's true. Why does this make a difference?

Post by Karel Tuma
Allocator still amounts for a lot of time, but majority is the path
Samples: 13K of event 'cycles', Event count (approx.): 9939866374
16.93% luajit-ljx luajit-ljx [.] lj_cconv_ct_ct
15.22% luajit-ljx luajit-ljx [.] carith_int64
11.85% luajit-ljx luajit-ljx [.] carith_checkarg
10.41% luajit-ljx luajit-ljx [.] mmcall
8.28% luajit-ljx luajit-ljx [.] lj_alloc_malloc
LuaJIT interpreter handles this case as an ordinary metamethod implemented
in C.

Very interesting. Thank you for the profile. I looked at the source
for those functions and am starting to get a better idea of whats
going on.

lex