Discussion:
LuaJIT Matrix Algebra Function
Beddell, Thomas Edmund
2014-08-07 12:54:52 UTC
Permalink
Hi,

I made a 3d math library in Lua and am very pleased with the performance in Lua JIT 2.0.3. In my test with LuaJIT 2.0.3 I can do 300,000 4x4 matrix multiply inversions in 0.36 seconds.
The same code took 6 seconds in vanilla Lua 5.0. However according to my co-worker "Equivalent" code takes 0.04 seconds in C++.
I also tried using ffi with float[16] arrays for the values and also a struct with metamethods but they were twice as slow as the lua table version.

Being within an order of magnitude as fast as C is not bad at all for a scripting language after all. Is there room for improvement in my code or should I consider it as good as can be?

I include the test program below.

Best regards,

Thomas Beddell

-- double 4x4, 1-based, column major
matrix = {}

-- Source for own metamethods
matrix.__index = matrix

setmetatable(matrix, matrix)


-- Create a matrix object. Tested OK
matrix.__call = function(self, ...)

-- Can initialize values from argument
local m = {...}

if #m == 0 then m = {0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0} end

-- Look in matrix for metamethods
setmetatable(m, matrix)

return m

end

-- Set matrix to identity matrix. Tested OK
matrix.identity = function(self)

self = matrix()

for i=1, 16, 5 do

self[i] = 1

end

return self

end

-- Inverse of matrix. Tested OK
matrix.inverse = function(self)

local out = {}

out[1] = self[6] * self[11] * self[16] -
self[6] * self[12] * self[15] -
self[10] * self[7] * self[16] +
self[10] * self[8] * self[15] +
self[14] * self[7] * self[12] -
self[14] * self[8] * self[11]

out[5] = -self[5] * self[11] * self[16] +
self[5] * self[12] * self[15] +
self[9] * self[7] * self[16] -
self[9] * self[8] * self[15] -
self[13] * self[7] * self[12] +
self[13] * self[8] * self[11]

out[9] = self[5] * self[10] * self[16] -
self[5] * self[12] * self[14] -
self[9] * self[6] * self[16] +
self[9] * self[8] * self[14] +
self[13] * self[6] * self[12] -
self[13] * self[8] * self[10]

out[13] = -self[5] * self[10] * self[15] +
self[5] * self[11] * self[14] +
self[9] * self[6] * self[15] -
self[9] * self[7] * self[14] -
self[13] * self[6] * self[11] +
self[13] * self[7] * self[10]

out[2] = -self[2] * self[11] * self[16] +
self[2] * self[12] * self[15] +
self[10] * self[3] * self[16] -
self[10] * self[4] * self[15] -
self[14] * self[3] * self[12] +
self[14] * self[4] * self[11]

out[6] = self[1] * self[11] * self[16] -
self[1] * self[12] * self[15] -
self[9] * self[3] * self[16] +
self[9] * self[4] * self[15] +
self[13] * self[3] * self[12] -
self[13] * self[4] * self[11]

out[10] = -self[1] * self[10] * self[16] +
self[1] * self[12] * self[14] +
self[9] * self[2] * self[16] -
self[9] * self[4] * self[14] -
self[13] * self[2] * self[12] +
self[13] * self[4] * self[10]

out[14] = self[1] * self[10] * self[15] -
self[1] * self[11] * self[14] -
self[9] * self[2] * self[15] +
self[9] * self[3] * self[14] +
self[13] * self[2] * self[11] -
self[13] * self[3] * self[10]

out[3] = self[2] * self[7] * self[16] -
self[2] * self[8] * self[15] -
self[6] * self[3] * self[16] +
self[6] * self[4] * self[15] +
self[14] * self[3] * self[8] -
self[14] * self[4] * self[7]

out[7] = -self[1] * self[7] * self[16] +
self[1] * self[8] * self[15] +
self[5] * self[3] * self[16] -
self[5] * self[4] * self[15] -
self[13] * self[3] * self[8] +
self[13] * self[4] * self[7]

out[11] = self[1] * self[6] * self[16] -
self[1] * self[8] * self[14] -
self[5] * self[2] * self[16] +
self[5] * self[4] * self[14] +
self[13] * self[2] * self[8] -
self[13] * self[4] * self[6]

out[15] = -self[1] * self[6] * self[15] +
self[1] * self[7] * self[14] +
self[5] * self[2] * self[15] -
self[5] * self[3] * self[14] -
self[13] * self[2] * self[7] +
self[13] * self[3] * self[6]

out[4] = -self[2] * self[7] * self[12] +
self[2] * self[8] * self[11] +
self[6] * self[3] * self[12] -
self[6] * self[4] * self[11] -
self[10] * self[3] * self[8] +
self[10] * self[4] * self[7]

out[8] = self[1] * self[7] * self[12] -
self[1] * self[8] * self[11] -
self[5] * self[3] * self[12] +
self[5] * self[4] * self[11] +
self[9] * self[3] * self[8] -
self[9] * self[4] * self[7]

out[12] = -self[1] * self[6] * self[12] +
self[1] * self[8] * self[10] +
self[5] * self[2] * self[12] -
self[5] * self[4] * self[10] -
self[9] * self[2] * self[8] +
self[9] * self[4] * self[6]

out[16] = self[1] * self[6] * self[11] -
self[1] * self[7] * self[10] -
self[5] * self[2] * self[11] +
self[5] * self[3] * self[10] +
self[9] * self[2] * self[7] -
self[9] * self[3] * self[6]

local det = self[1] * out[1] + self[2] * out[5] + self[3] * out[9] + self[4] * out[13]

if det == 0 then return self end

det = 1.0 / det

for i = 1, 16 do

out[i] = out[i] * det

end

return matrix(unpack(out))

end

-- Multiply matrix by a matrix. Tested OK
matrix.__mul = function(self, m)

local out = matrix()

for i=0, 12, 4 do

for j=1, 4 do

out[i+j] = m[j] * self[i+1] + m[j+4] * self[i+2] + m[j+8] * self[i+3] + m[j+12] * self[i+4]

end

end

return out

end


-- Test
local t = os.clock()
local mOut
local m = matrix():identity()

for i=1, 300000 do

mOut = m * m:inverse()

end

local time = os.clock() - t

print(time)
Javier Guerra Giraldez
2014-08-07 13:15:16 UTC
Permalink
On Thu, Aug 7, 2014 at 7:54 AM, Beddell, Thomas Edmund
Post by Beddell, Thomas Edmund
Is there room for improvement in my code or should I consider it as good as
can be?
where does the profiler tells you it's spending time?

at first sight, i'm guessing the allocate, resize, reallocate, discard
flow at matrix.inverse()

but as soon as it's working, you shouldn't touch a line until you profile it.
--
Javier
Beddell, Thomas Edmund
2014-08-07 15:41:52 UTC
Permalink
Thanks for the reply.

"where does the profiler tells you it's spending time?"

I am compiling from visual studio 2010 using a custom build tool to call luajit.exe v 2,03 and running the compiled bytecode from C++.
I could find no information about built-in profiling for LuaJIT 2.03 but have now compiled LuaJIT 2.1 which seems to have a built-in profiler.
I will post back when I get some data out of the profiler. If I misunderstood you and there is a way of profiling in 2.03 then please can you tell
Alex
2014-08-07 15:46:56 UTC
Permalink
The profiler is only in the 2.1 branch.
--
Sincerely,
Alex Parrill
Beddell, Thomas Edmund
2014-08-07 15:49:47 UTC
Permalink
Thanks, I am now using 2.1 branch. Hopefully I can get it to spit out some timing info.

From: luajit-***@freelists.org [mailto:luajit-***@freelists.org] On Behalf Of Alex
Sent: 7. august 2014 17:47
To: ***@freelists.org
Subject: Re: LuaJIT Matrix Algebra Function

The profiler is only in the 2.1 branch.
--
Sincerely,
Alex Parrill
Javier Guerra Giraldez
2014-08-07 16:43:09 UTC
Permalink
On Thu, Aug 7, 2014 at 10:49 AM, Beddell, Thomas Edmund
Post by Beddell, Thomas Edmund
Thanks, I am now using 2.1 branch. Hopefully I can get it to spit out some timing info.
i guess you might also have a significant extra boost. i think there
was some improvements to the allocation sink optimization. since
you're creating lots of small short-lived tables it can make a
difference.
--
Javier
Beddell, Thomas Edmund
2014-08-07 17:26:22 UTC
Permalink
Some improvement, it now takes 3.2 secs
________________________________
From: Javier Guerra Giraldez<mailto:javier-***@public.gmane.org>
Sent: ý07.ý08.ý2014 18:44
To: LuaJIT<mailto:luajit-***@public.gmane.org>
Subject: Re: LuaJIT Matrix Algebra Function

On Thu, Aug 7, 2014 at 10:49 AM, Beddell, Thomas Edmund
Post by Beddell, Thomas Edmund
Thanks, I am now using 2.1 branch. Hopefully I can get it to spit out some
timing info.
i guess you might also have a significant extra boost. i think there
was some improvements to the allocation sink optimization. since
you're creating lots of small short-lived tables it can make a
difference.

--
Javier
Aleksandar Kordic
2014-08-07 17:52:22 UTC
Permalink
Here is a FFI version
https://gist.github.com/AlexKordic/99113326daeeb33b584a

I am wondering what "unsupported C type conversion" means in this case ?

Output:

luajit.exe -jv c:\benchmark\matrix_bench_ffi.lua
LuaJIT 2.0.3 -- Copyright (C) 2005-2014 Mike Pall. http://luajit.org/
JIT: ON CMOV SSE2 SSE3 SSE4.1 fold cse dce fwd dse narrow loop abc sink fuse

[TRACE 1 matrix_bench_ffi.lua:152 loop]
[TRACE --- matrix_bench_ffi.lua:38 -- leaving loop in root trace at
matrix_bench_ffi.lua:37]
[TRACE 2 matrix_bench_ffi.lua:38 loop]
[TRACE 3 (2/3) matrix_bench_ffi.lua:37 -> 2]
[TRACE --- (1/3) matrix_bench_ffi.lua:153 -- NYI: unsupported C type
conversion at matrix_bench_ffi.lua:35]
[TRACE --- (1/3) matrix_bench_ffi.lua:153 -- NYI: unsupported C type
conversion at matrix_bench_ffi.lua:35]
[TRACE --- (1/3) matrix_bench_ffi.lua:153 -- NYI: unsupported C type
conversion at matrix_bench_ffi.lua:35]
[TRACE --- (1/3) matrix_bench_ffi.lua:153 -- NYI: unsupported C type
conversion at matrix_bench_ffi.lua:35]
[TRACE 4 (1/3) matrix_bench_ffi.lua:153 -- fallback to interpreter]
[TRACE --- (3/1) matrix_bench_ffi.lua:42 -- NYI: unsupported C type
conversion at matrix_bench_ffi.lua:50]
[TRACE --- (3/1) matrix_bench_ffi.lua:42 -- NYI: unsupported C type
conversion at matrix_bench_ffi.lua:50]
[TRACE --- (3/1) matrix_bench_ffi.lua:42 -- NYI: unsupported C type
conversion at matrix_bench_ffi.lua:50]
[TRACE --- (3/1) matrix_bench_ffi.lua:42 -- NYI: unsupported C type
conversion at matrix_bench_ffi.lua:50]
[TRACE 5 (3/1) matrix_bench_ffi.lua:42 -- fallback to interpreter]
[TRACE --- matrix_bench_ffi.lua:24 -- NYI: unsupported C type conversion at
matrix_bench_ffi.lua:50]
[TRACE --- matrix_bench_ffi.lua:166 -- NYI: unsupported C type conversion
at matrix_bench_ffi.lua:50]
[TRACE --- matrix_bench_ffi.lua:24 -- NYI: unsupported C type conversion at
matrix_bench_ffi.lua:50]
[TRACE --- matrix_bench_ffi.lua:166 -- NYI: unsupported C type conversion
at matrix_bench_ffi.lua:50]
[TRACE --- matrix_bench_ffi.lua:34 -- NYI: unsupported C type conversion at
matrix_bench_ffi.lua:35]
[TRACE --- matrix_bench_ffi.lua:49 -- NYI: unsupported C type conversion at
matrix_bench_ffi.lua:50]
[TRACE --- matrix_bench_ffi.lua:24 -- NYI: unsupported C type conversion at
matrix_bench_ffi.lua:50]
[TRACE --- matrix_bench_ffi.lua:166 -- NYI: unsupported C type conversion
at matrix_bench_ffi.lua:50]
[TRACE --- matrix_bench_ffi.lua:34 -- NYI: unsupported C type conversion at
matrix_bench_ffi.lua:35]
[TRACE --- matrix_bench_ffi.lua:49 -- NYI: unsupported C type conversion at
matrix_bench_ffi.lua:50]
[TRACE --- matrix_bench_ffi.lua:24 -- NYI: unsupported C type conversion at
matrix_bench_ffi.lua:50]
[TRACE --- matrix_bench_ffi.lua:34 -- NYI: unsupported C type conversion at
matrix_bench_ffi.lua:35]
[TRACE --- matrix_bench_ffi.lua:166 -- NYI: unsupported C type conversion
at matrix_bench_ffi.lua:50]
[TRACE --- matrix_bench_ffi.lua:49 -- NYI: unsupported C type conversion at
matrix_bench_ffi.lua:50]
[TRACE --- matrix_bench_ffi.lua:34 -- NYI: unsupported C type conversion at
matrix_bench_ffi.lua:35]
[TRACE --- matrix_bench_ffi.lua:24 -- NYI: unsupported C type conversion at
matrix_bench_ffi.lua:35]
[TRACE --- matrix_bench_ffi.lua:49 -- NYI: unsupported C type conversion at
matrix_bench_ffi.lua:50]
[TRACE --- matrix_bench_ffi.lua:166 -- NYI: unsupported C type conversion
at matrix_bench_ffi.lua:50]
[TRACE --- matrix_bench_ffi.lua:34 -- NYI: unsupported C type conversion at
matrix_bench_ffi.lua:35]
[TRACE --- matrix_bench_ffi.lua:24 -- NYI: unsupported C type conversion at
matrix_bench_ffi.lua:35]
[TRACE --- matrix_bench_ffi.lua:49 -- NYI: unsupported C type conversion at
matrix_bench_ffi.lua:50]
[TRACE --- matrix_bench_ffi.lua:166 -- NYI: unsupported C type conversion
at matrix_bench_ffi.lua:50]
[TRACE --- matrix_bench_ffi.lua:34 -- NYI: unsupported C type conversion at
matrix_bench_ffi.lua:35]
[TRACE --- matrix_bench_ffi.lua:24 -- NYI: unsupported C type conversion at
matrix_bench_ffi.lua:35]
[TRACE --- matrix_bench_ffi.lua:49 -- NYI: unsupported C type conversion at
matrix_bench_ffi.lua:50]
[TRACE --- matrix_bench_ffi.lua:166 -- NYI: unsupported C type conversion
at matrix_bench_ffi.lua:50]
[TRACE --- matrix_bench_ffi.lua:34 -- NYI: unsupported C type conversion at
matrix_bench_ffi.lua:35]
[TRACE --- matrix_bench_ffi.lua:24 -- NYI: unsupported C type conversion at
matrix_bench_ffi.lua:50]
[TRACE --- matrix_bench_ffi.lua:49 -- NYI: unsupported C type conversion at
matrix_bench_ffi.lua:50]
[TRACE --- matrix_bench_ffi.lua:166 -- NYI: unsupported C type conversion
at matrix_bench_ffi.lua:50]
[TRACE --- matrix_bench_ffi.lua:34 -- NYI: unsupported C type conversion at
matrix_bench_ffi.lua:35]
[TRACE --- matrix_bench_ffi.lua:24 -- NYI: unsupported C type conversion at
matrix_bench_ffi.lua:50]
[TRACE --- matrix_bench_ffi.lua:49 -- NYI: unsupported C type conversion at
matrix_bench_ffi.lua:50]
[TRACE --- matrix_bench_ffi.lua:166 -- NYI: unsupported C type conversion
at matrix_bench_ffi.lua:50]
[TRACE --- matrix_bench_ffi.lua:34 -- NYI: unsupported C type conversion at
matrix_bench_ffi.lua:35]
[TRACE --- matrix_bench_ffi.lua:24 -- NYI: unsupported C type conversion at
matrix_bench_ffi.lua:35]
[TRACE --- matrix_bench_ffi.lua:49 -- NYI: unsupported C type conversion at
matrix_bench_ffi.lua:50]
[TRACE --- matrix_bench_ffi.lua:166 -- NYI: unsupported C type conversion
at matrix_bench_ffi.lua:50]
[TRACE --- matrix_bench_ffi.lua:34 -- NYI: unsupported C type conversion at
matrix_bench_ffi.lua:35]
[TRACE --- matrix_bench_ffi.lua:24 -- NYI: unsupported C type conversion at
matrix_bench_ffi.lua:35]
[TRACE --- matrix_bench_ffi.lua:49 -- blacklisted at
matrix_bench_ffi.lua:24]
[TRACE --- matrix_bench_ffi.lua:166 -- blacklisted at
matrix_bench_ffi.lua:24]
[TRACE --- matrix_bench_ffi.lua:34 -- blacklisted at
matrix_bench_ffi.lua:24]
[TRACE --- matrix_bench_ffi.lua:49 -- blacklisted at
matrix_bench_ffi.lua:24]
time 8.625 iterations per sec: 34782.608695652
Beddell, Thomas Edmund
2014-08-07 19:25:50 UTC
Permalink
That is 20 times slower than using Lua tables! I tried it this way before with similar results.
Not Yet Implemented type conversion seems to be the cause of the problem. Blacklisting doesn’t sound good either ;)

Thanks for the hint!

From: luajit-***@freelists.org [mailto:luajit-***@freelists.org] On Behalf Of Aleksandar Kordic
Sent: 7. august 2014 19:52
To: ***@freelists.org
Subject: Re: LuaJIT Matrix Algebra Function

Here is a FFI version https://gist.github.com/AlexKordic/99113326daeeb33b584a<https://urldefense.proofpoint.com/v1/url?u=https://gist.github.com/AlexKordic/99113326daeeb33b584a&k=rYVT%2BQIbByDGfw04ADASxQ%3D%3D%0A&r=Z%2BQTzSXoYfn2aVXXze9h9xUjv%2BDjFRaIHobDQaXRmqQ%3D%0A&m=Sk84P%2BRiFgu1lRV5EZ5dPthq2dIX%2FH8lXK2zOEHb76E%3D%0A&s=a22a793fc7ed80d45709ec29f24f86b8e0aac2dec2ffac1e5da841ceea7d104b>

I am wondering what "unsupported C type conversion" means in this case ?

Output:

luajit.exe -jv c:\benchmark\matrix_bench_ffi.lua
LuaJIT 2.0.3 -- Copyright (C) 2005-2014 Mike Pall. http://luajit.org/<https://urldefense.proofpoint.com/v1/url?u=http://luajit.org/&k=rYVT%2BQIbByDGfw04ADASxQ%3D%3D%0A&r=Z%2BQTzSXoYfn2aVXXze9h9xUjv%2BDjFRaIHobDQaXRmqQ%3D%0A&m=Sk84P%2BRiFgu1lRV5EZ5dPthq2dIX%2FH8lXK2zOEHb76E%3D%0A&s=cd08d31af4ece6c479d74ac2228fc4cf9434367b0e829f49e979ff4fddd339d4>
JIT: ON CMOV SSE2 SSE3 SSE4.1 fold cse dce fwd dse narrow loop abc sink fuse

[TRACE 1 matrix_bench_ffi.lua:152 loop]
[TRACE --- matrix_bench_ffi.lua:38 -- leaving loop in root trace at matrix_bench_ffi.lua:37]
[TRACE 2 matrix_bench_ffi.lua:38 loop]
[TRACE 3 (2/3) matrix_bench_ffi.lua:37 -> 2]
[TRACE --- (1/3) matrix_bench_ffi.lua:153 -- NYI: unsupported C type conversion at matrix_bench_ffi.lua:35]
[TRACE --- (1/3) matrix_bench_ffi.lua:153 -- NYI: unsupported C type conversion at matrix_bench_ffi.lua:35]
[TRACE --- (1/3) matrix_bench_ffi.lua:153 -- NYI: unsupported C type conversion at matrix_bench_ffi.lua:35]
[TRACE --- (1/3) matrix_bench_ffi.lua:153 -- NYI: unsupported C type conversion at matrix_bench_ffi.lua:35]
[TRACE 4 (1/3) matrix_bench_ffi.lua:153 -- fallback to interpreter]
[TRACE --- (3/1) matrix_bench_ffi.lua:42 -- NYI: unsupported C type conversion at matrix_bench_ffi.lua:50]
[TRACE --- (3/1) matrix_bench_ffi.lua:42 -- NYI: unsupported C type conversion at matrix_bench_ffi.lua:50]
[TRACE --- (3/1) matrix_bench_ffi.lua:42 -- NYI: unsupported C type conversion at matrix_bench_ffi.lua:50]
[TRACE --- (3/1) matrix_bench_ffi.lua:42 -- NYI: unsupported C type conversion at matrix_bench_ffi.lua:50]
[TRACE 5 (3/1) matrix_bench_ffi.lua:42 -- fallback to interpreter]
[TRACE --- matrix_bench_ffi.lua:24 -- NYI: unsupported C type conversion at matrix_bench_ffi.lua:50]
[TRACE --- matrix_bench_ffi.lua:166 -- NYI: unsupported C type conversion at matrix_bench_ffi.lua:50]
[TRACE --- matrix_bench_ffi.lua:24 -- NYI: unsupported C type conversion at matrix_bench_ffi.lua:50]
[TRACE --- matrix_bench_ffi.lua:166 -- NYI: unsupported C type conversion at matrix_bench_ffi.lua:50]
[TRACE --- matrix_bench_ffi.lua:34 -- NYI: unsupported C type conversion at matrix_bench_ffi.lua:35]
[TRACE --- matrix_bench_ffi.lua:49 -- NYI: unsupported C type conversion at matrix_bench_ffi.lua:50]
[TRACE --- matrix_bench_ffi.lua:24 -- NYI: unsupported C type conversion at matrix_bench_ffi.lua:50]
[TRACE --- matrix_bench_ffi.lua:166 -- NYI: unsupported C type conversion at matrix_bench_ffi.lua:50]
[TRACE --- matrix_bench_ffi.lua:34 -- NYI: unsupported C type conversion at matrix_bench_ffi.lua:35]
[TRACE --- matrix_bench_ffi.lua:49 -- NYI: unsupported C type conversion at matrix_bench_ffi.lua:50]
[TRACE --- matrix_bench_ffi.lua:24 -- NYI: unsupported C type conversion at matrix_bench_ffi.lua:50]
[TRACE --- matrix_bench_ffi.lua:34 -- NYI: unsupported C type conversion at matrix_bench_ffi.lua:35]
[TRACE --- matrix_bench_ffi.lua:166 -- NYI: unsupported C type conversion at matrix_bench_ffi.lua:50]
[TRACE --- matrix_bench_ffi.lua:49 -- NYI: unsupported C type conversion at matrix_bench_ffi.lua:50]
[TRACE --- matrix_bench_ffi.lua:34 -- NYI: unsupported C type conversion at matrix_bench_ffi.lua:35]
[TRACE --- matrix_bench_ffi.lua:24 -- NYI: unsupported C type conversion at matrix_bench_ffi.lua:35]
[TRACE --- matrix_bench_ffi.lua:49 -- NYI: unsupported C type conversion at matrix_bench_ffi.lua:50]
[TRACE --- matrix_bench_ffi.lua:166 -- NYI: unsupported C type conversion at matrix_bench_ffi.lua:50]
[TRACE --- matrix_bench_ffi.lua:34 -- NYI: unsupported C type conversion at matrix_bench_ffi.lua:35]
[TRACE --- matrix_bench_ffi.lua:24 -- NYI: unsupported C type conversion at matrix_bench_ffi.lua:35]
[TRACE --- matrix_bench_ffi.lua:49 -- NYI: unsupported C type conversion at matrix_bench_ffi.lua:50]
[TRACE --- matrix_bench_ffi.lua:166 -- NYI: unsupported C type conversion at matrix_bench_ffi.lua:50]
[TRACE --- matrix_bench_ffi.lua:34 -- NYI: unsupported C type conversion at matrix_bench_ffi.lua:35]
[TRACE --- matrix_bench_ffi.lua:24 -- NYI: unsupported C type conversion at matrix_bench_ffi.lua:35]
[TRACE --- matrix_bench_ffi.lua:49 -- NYI: unsupported C type conversion at matrix_bench_ffi.lua:50]
[TRACE --- matrix_bench_ffi.lua:166 -- NYI: unsupported C type conversion at matrix_bench_ffi.lua:50]
[TRACE --- matrix_bench_ffi.lua:34 -- NYI: unsupported C type conversion at matrix_bench_ffi.lua:35]
[TRACE --- matrix_bench_ffi.lua:24 -- NYI: unsupported C type conversion at matrix_bench_ffi.lua:50]
[TRACE --- matrix_bench_ffi.lua:49 -- NYI: unsupported C type conversion at matrix_bench_ffi.lua:50]
[TRACE --- matrix_bench_ffi.lua:166 -- NYI: unsupported C type conversion at matrix_bench_ffi.lua:50]
[TRACE --- matrix_bench_ffi.lua:34 -- NYI: unsupported C type conversion at matrix_bench_ffi.lua:35]
[TRACE --- matrix_bench_ffi.lua:24 -- NYI: unsupported C type conversion at matrix_bench_ffi.lua:50]
[TRACE --- matrix_bench_ffi.lua:49 -- NYI: unsupported C type conversion at matrix_bench_ffi.lua:50]
[TRACE --- matrix_bench_ffi.lua:166 -- NYI: unsupported C type conversion at matrix_bench_ffi.lua:50]
[TRACE --- matrix_bench_ffi.lua:34 -- NYI: unsupported C type conversion at matrix_bench_ffi.lua:35]
[TRACE --- matrix_bench_ffi.lua:24 -- NYI: unsupported C type conversion at matrix_bench_ffi.lua:35]
[TRACE --- matrix_bench_ffi.lua:49 -- NYI: unsupported C type conversion at matrix_bench_ffi.lua:50]
[TRACE --- matrix_bench_ffi.lua:166 -- NYI: unsupported C type conversion at matrix_bench_ffi.lua:50]
[TRACE --- matrix_bench_ffi.lua:34 -- NYI: unsupported C type conversion at matrix_bench_ffi.lua:35]
[TRACE --- matrix_bench_ffi.lua:24 -- NYI: unsupported C type conversion at matrix_bench_ffi.lua:35]
[TRACE --- matrix_bench_ffi.lua:49 -- blacklisted at matrix_bench_ffi.lua:24]
[TRACE --- matrix_bench_ffi.lua:166 -- blacklisted at matrix_bench_ffi.lua:24]
[TRACE --- matrix_bench_ffi.lua:34 -- blacklisted at matrix_bench_ffi.lua:24]
[TRACE --- matrix_bench_ffi.lua:49 -- blacklisted at matrix_bench_ffi.lua:24]
time 8.625 iterations per sec: 34782.608695652
Alex
2014-08-07 19:48:45 UTC
Permalink
Here's a fully-compileable FFI version, based off of Kordic's:

https://gist.github.com/ColonelThirtyTwo/56a06902c49f57edd70d

It runs much faster. Primarily, large allocations (>128 bytes) aren't jit
compiled [1]. Calls to malloc are though, so this code allocates the
matrices with that. Additionally, caching the `entries` field in a local
actually hurts performance [2].

[1] http://luajit.org/ext_ffi_semantics.html in the Current Status section
[2] http://wiki.luajit.org/Numerical-Computing-Performance-Guide, confirmed
with my tests.
--
Sincerely,
Alex Parrill
Beddell, Thomas Edmund
2014-08-07 20:11:17 UTC
Permalink
Hi Alex,

I tried it and it says “cannot resolve symbol malloc. The specified procedure cannot be found”. Do I have to declare it on C side?

Best regards,

Thomas Beddell

From: luajit-***@freelists.org [mailto:luajit-***@freelists.org] On Behalf Of Alex
Sent: 7. august 2014 21:49
To: ***@freelists.org
Subject: Re: LuaJIT Matrix Algebra Function

Here's a fully-compileable FFI version, based off of Kordic's:

https://gist.github.com/ColonelThirtyTwo/56a06902c49f57edd70d<https://urldefense.proofpoint.com/v1/url?u=https://gist.github.com/ColonelThirtyTwo/56a06902c49f57edd70d&k=rYVT%2BQIbByDGfw04ADASxQ%3D%3D%0A&r=Z%2BQTzSXoYfn2aVXXze9h9xUjv%2BDjFRaIHobDQaXRmqQ%3D%0A&m=xC9egAR1WrU%2BIwktLIE9oPbTsIQIdQHT7%2B43FA8Gj5M%3D%0A&s=854392d4bf40bd86d842231145399494de1fd032d73ec84af8599f7081e89e54>

It runs much faster. Primarily, large allocations (>128 bytes) aren't jit compiled [1]. Calls to malloc are though, so this code allocates the matrices with that. Additionally, caching the `entries` field in a local actually hurts performance [2].

[1] http://luajit.org/ext_ffi_semantics.html<https://urldefense.proofpoint.com/v1/url?u=http://luajit.org/ext_ffi_semantics.html&k=rYVT%2BQIbByDGfw04ADASxQ%3D%3D%0A&r=Z%2BQTzSXoYfn2aVXXze9h9xUjv%2BDjFRaIHobDQaXRmqQ%3D%0A&m=xC9egAR1WrU%2BIwktLIE9oPbTsIQIdQHT7%2B43FA8Gj5M%3D%0A&s=ccf464722827355dcd96f18d60f2fb9e839a9fa2ff9642c6e7a7325ed4d47356> in the Current Status section
[2] http://wiki.luajit.org/Numerical-Computing-Performance-Guide<https://urldefense.proofpoint.com/v1/url?u=http://wiki.luajit.org/Numerical-Computing-Performance-Guide&k=rYVT%2BQIbByDGfw04ADASxQ%3D%3D%0A&r=Z%2BQTzSXoYfn2aVXXze9h9xUjv%2BDjFRaIHobDQaXRmqQ%3D%0A&m=xC9egAR1WrU%2BIwktLIE9oPbTsIQIdQHT7%2B43FA8Gj5M%3D%0A&s=a714796abfb857b1a87d509d9ebc0eabc196c88208f1f478213a17173d95a9f1>, confirmed with my tests.
--
Sincerely,
Alex Parrill
Aleksandar Kordic
2014-08-07 20:17:49 UTC
Permalink
Great pointers, thanks.

Do you know what conversion is present in my code ? Or "unsupported C type
conversion" refers to large allocation NYI ?
Post by Alex
https://gist.github.com/ColonelThirtyTwo/56a06902c49f57edd70d
It runs much faster. Primarily, large allocations (>128 bytes) aren't jit
compiled [1]. Calls to malloc are though, so this code allocates the
matrices with that. Additionally, caching the `entries` field in a local
actually hurts performance [2].
[1] http://luajit.org/ext_ffi_semantics.html in the Current Status section
[2] http://wiki.luajit.org/Numerical-Computing-Performance-Guide,
confirmed with my tests.
--
Sincerely,
Alex Parrill
Evan Wies
2014-08-07 20:17:46 UTC
Permalink
I don't think it is the large allocation... that struct is of size
128. I think the "NYI unsupported C type" is due to the fact that it
is an array within a struct?

[I've seen similar things with having small char[] within structs and
then either initializing or copying them, which I've talked about here:
http://www.freelists.org/post/luajit/Luajit-struct-vs-table-performance,2
"T he gotcha with this is that initialization of aggregates is NYI. To
get around this, I create one "seed" object, then construct new
aggregates by copying the seed (which is JIT-compiled).]

This gist removes the struct wrapping (and hence the metatable niceness
since you can't metatype primitives or arrays), leaving you just the
double[16] and some functions. But performance on my system with
LuaJIT 2.1 goes from 4.8 to 0.05 (61,789/sec to 5,442,571/sec).

https://gist.github.com/neomantra/0e0d57287f8ed9953353/e1fe5a0a311daf6c28b2af6e40ae537f4799736a

I think Alex's malloc approach achieves the similar goal (allowing full
compilation). The difference is that my gist uses an object that LuaJIT
can compile the initialization and copy of, whereas Alex's uses a struct
and manages the copy/initialization himself. The "let LuaJIT do it"
way is slightly faster on my system (Alex's is 4,345,118/sec for me).

Regards,
Evan
Post by Alex
https://gist.github.com/ColonelThirtyTwo/56a06902c49f57edd70d
It runs much faster. Primarily, large allocations (>128 bytes) aren't
jit compiled [1]. Calls to malloc are though, so this code allocates
the matrices with that. Additionally, caching the `entries` field in a
local actually hurts performance [2].
[1] http://luajit.org/ext_ffi_semantics.html in the Current Status section
[2] http://wiki.luajit.org/Numerical-Computing-Performance-Guide,
confirmed with my tests.
--
Sincerely,
Alex Parrill
Beddell, Thomas Edmund
2014-08-07 20:47:30 UTC
Permalink
Hi Evan,

I got 0.09 secs. That is 3 times faster than my lua arrays and half as fast as C++!
It is a shame to lose the metamethods though. I did read somewhere that metatype works with vectors. Does that mean std::vector?

I didn’t yet get Alex’s to run. I added malloc.h but still not working. Probably due to my project configuration.

Best regards,

Thomas Beddell

From: luajit-***@freelists.org [mailto:luajit-***@freelists.org] On Behalf Of Evan Wies
Sent: 7. august 2014 22:18
To: ***@freelists.org
Subject: Re: LuaJIT Matrix Algebra Function

I don't think it is the large allocation... that struct is of size 128. I think the "NYI unsupported C type" is due to the fact that it is an array within a struct?

[I've seen similar things with having small char[] within structs and then either initializing or copying them, which I've talked about here: http://www.freelists.org/post/luajit/Luajit-struct-vs-table-performance,2<https://urldefense.proofpoint.com/v1/url?u=http://www.freelists.org/post/luajit/Luajit-struct-vs-table-performance%2C2&k=rYVT%2BQIbByDGfw04ADASxQ%3D%3D%0A&r=Z%2BQTzSXoYfn2aVXXze9h9xUjv%2BDjFRaIHobDQaXRmqQ%3D%0A&m=bVeBicsTPfGeMx8gJWcdrywatA0o4xGR8UJ3G8ycRWU%3D%0A&s=6e035bcd5bae3127e4b98a7ec239f0aa42e3408d54680fe7dee487273cbd358c>
"T he gotcha with this is that initialization of aggregates is NYI. To get around this, I create one "seed" object, then construct new aggregates by copying the seed (which is JIT-compiled).]

This gist removes the struct wrapping (and hence the metatable niceness since you can't metatype primitives or arrays), leaving you just the double[16] and some functions. But performance on my system with LuaJIT 2.1 goes from 4.8 to 0.05 (61,789/sec to 5,442,571/sec).

https://gist.github.com/neomantra/0e0d57287f8ed9953353/e1fe5a0a311daf6c28b2af6e40ae537f4799736a<https://urldefense.proofpoint.com/v1/url?u=https://gist.github.com/neomantra/0e0d57287f8ed9953353/e1fe5a0a311daf6c28b2af6e40ae537f4799736a&k=rYVT%2BQIbByDGfw04ADASxQ%3D%3D%0A&r=Z%2BQTzSXoYfn2aVXXze9h9xUjv%2BDjFRaIHobDQaXRmqQ%3D%0A&m=bVeBicsTPfGeMx8gJWcdrywatA0o4xGR8UJ3G8ycRWU%3D%0A&s=220a0d639c00d95fd8168ee67ae91e3621a379e4f2416567efa4d61c169ccbf2>

I think Alex's malloc approach achieves the similar goal (allowing full compilation). The difference is that my gist uses an object that LuaJIT can compile the initialization and copy of, whereas Alex's uses a struct and manages the copy/initialization himself. The "let LuaJIT do it" way is slightly faster on my system (Alex's is 4,345,118/sec for me).

Regards,
Evan



On 08/07/2014 03:48 PM, Alex wrote:
Here's a fully-compileable FFI version, based off of Kordic's:

https://gist.github.com/ColonelThirtyTwo/56a06902c49f57edd70d<https://urldefense.proofpoint.com/v1/url?u=https://gist.github.com/ColonelThirtyTwo/56a06902c49f57edd70d&k=rYVT%2BQIbByDGfw04ADASxQ%3D%3D%0A&r=Z%2BQTzSXoYfn2aVXXze9h9xUjv%2BDjFRaIHobDQaXRmqQ%3D%0A&m=bVeBicsTPfGeMx8gJWcdrywatA0o4xGR8UJ3G8ycRWU%3D%0A&s=fefee6eb2c5d49dbdf88329bb148c99ad505d8b6346affde9f1528f57b174709>

It runs much faster. Primarily, large allocations (>128 bytes) aren't jit compiled [1]. Calls to malloc are though, so this code allocates the matrices with that. Additionally, caching the `entries` field in a local actually hurts performance [2].

[1] http://luajit.org/ext_ffi_semantics.html<https://urldefense.proofpoint.com/v1/url?u=http://luajit.org/ext_ffi_semantics.html&k=rYVT%2BQIbByDGfw04ADASxQ%3D%3D%0A&r=Z%2BQTzSXoYfn2aVXXze9h9xUjv%2BDjFRaIHobDQaXRmqQ%3D%0A&m=bVeBicsTPfGeMx8gJWcdrywatA0o4xGR8UJ3G8ycRWU%3D%0A&s=195ea54a22d9d0eb6f867ad4ab0140d223939eb59ef54e725f3d60f7819f1f6b> in the Current Status section
[2] http://wiki.luajit.org/Numerical-Computing-Performance-Guide<https://urldefense.proofpoint.com/v1/url?u=http://wiki.luajit.org/Numerical-Computing-Performance-Guide&k=rYVT%2BQIbByDGfw04ADASxQ%3D%3D%0A&r=Z%2BQTzSXoYfn2aVXXze9h9xUjv%2BDjFRaIHobDQaXRmqQ%3D%0A&m=bVeBicsTPfGeMx8gJWcdrywatA0o4xGR8UJ3G8ycRWU%3D%0A&s=7760a241727dda5cf2c00f90475528a6180a3a361d41f542c3145a0879b000e6>, confirmed with my tests.
--
Sincerely,
Alex Parrill
Evan Wies
2014-08-07 21:02:59 UTC
Permalink
I think vector in this context is referring to "vector types, declared
with the GCC mode or vector_size attribute", not anythinhg with STL.

https://gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.html

-Evan
Post by Beddell, Thomas Edmund
Hi Evan,
I got 0.09 secs. That is 3 times faster than my lua arrays and half as fast as C++!
It is a shame to lose the metamethods though. I did read somewhere
that metatype works with vectors. Does that mean std::vector?
I didn’t yet get Alex’s to run. I added malloc.h but still not
working. Probably due to my project configuration.
Best regards,
Thomas Beddell
*Sent:* 7. august 2014 22:18
*Subject:* Re: LuaJIT Matrix Algebra Function
I don't think it is the large allocation... that struct is of size
128. I think the "NYI unsupported C type" is due to the fact that
it is an array within a struct?
[I've seen similar things with having small char[] within structs and
http://www.freelists.org/post/luajit/Luajit-struct-vs-table-performance,2
<https://urldefense.proofpoint.com/v1/url?u=http://www.freelists.org/post/luajit/Luajit-struct-vs-table-performance%2C2&k=rYVT%2BQIbByDGfw04ADASxQ%3D%3D%0A&r=Z%2BQTzSXoYfn2aVXXze9h9xUjv%2BDjFRaIHobDQaXRmqQ%3D%0A&m=bVeBicsTPfGeMx8gJWcdrywatA0o4xGR8UJ3G8ycRWU%3D%0A&s=6e035bcd5bae3127e4b98a7ec239f0aa42e3408d54680fe7dee487273cbd358c>
"T he gotcha with this is that initialization of aggregates is NYI. To
get around this, I create one "seed" object, then construct new
aggregates by copying the seed (which is JIT-compiled).]
This gist removes the struct wrapping (and hence the metatable
niceness since you can't metatype primitives or arrays), leaving you
just the double[16] and some functions. But performance on my system
with LuaJIT 2.1 goes from 4.8 to 0.05 (61,789/sec to 5,442,571/sec).
https://gist.github.com/neomantra/0e0d57287f8ed9953353/e1fe5a0a311daf6c28b2af6e40ae537f4799736a
<https://urldefense.proofpoint.com/v1/url?u=https://gist.github.com/neomantra/0e0d57287f8ed9953353/e1fe5a0a311daf6c28b2af6e40ae537f4799736a&k=rYVT%2BQIbByDGfw04ADASxQ%3D%3D%0A&r=Z%2BQTzSXoYfn2aVXXze9h9xUjv%2BDjFRaIHobDQaXRmqQ%3D%0A&m=bVeBicsTPfGeMx8gJWcdrywatA0o4xGR8UJ3G8ycRWU%3D%0A&s=220a0d639c00d95fd8168ee67ae91e3621a379e4f2416567efa4d61c169ccbf2>
I think Alex's malloc approach achieves the similar goal (allowing
full compilation). The difference is that my gist uses an object that
LuaJIT can compile the initialization and copy of, whereas Alex's uses
a struct and manages the copy/initialization himself. The "let
LuaJIT do it" way is slightly faster on my system (Alex's is
4,345,118/sec for me).
Regards,
Evan
https://gist.github.com/ColonelThirtyTwo/56a06902c49f57edd70d
<https://urldefense.proofpoint.com/v1/url?u=https://gist.github.com/ColonelThirtyTwo/56a06902c49f57edd70d&k=rYVT%2BQIbByDGfw04ADASxQ%3D%3D%0A&r=Z%2BQTzSXoYfn2aVXXze9h9xUjv%2BDjFRaIHobDQaXRmqQ%3D%0A&m=bVeBicsTPfGeMx8gJWcdrywatA0o4xGR8UJ3G8ycRWU%3D%0A&s=fefee6eb2c5d49dbdf88329bb148c99ad505d8b6346affde9f1528f57b174709>
It runs much faster. Primarily, large allocations (>128 bytes)
aren't jit compiled [1]. Calls to malloc are though, so this code
allocates the matrices with that. Additionally, caching the
`entries` field in a local actually hurts performance [2].
[1] http://luajit.org/ext_ffi_semantics.html
<https://urldefense.proofpoint.com/v1/url?u=http://luajit.org/ext_ffi_semantics.html&k=rYVT%2BQIbByDGfw04ADASxQ%3D%3D%0A&r=Z%2BQTzSXoYfn2aVXXze9h9xUjv%2BDjFRaIHobDQaXRmqQ%3D%0A&m=bVeBicsTPfGeMx8gJWcdrywatA0o4xGR8UJ3G8ycRWU%3D%0A&s=195ea54a22d9d0eb6f867ad4ab0140d223939eb59ef54e725f3d60f7819f1f6b>
in the Current Status section
[2] http://wiki.luajit.org/Numerical-Computing-Performance-Guide
<https://urldefense.proofpoint.com/v1/url?u=http://wiki.luajit.org/Numerical-Computing-Performance-Guide&k=rYVT%2BQIbByDGfw04ADASxQ%3D%3D%0A&r=Z%2BQTzSXoYfn2aVXXze9h9xUjv%2BDjFRaIHobDQaXRmqQ%3D%0A&m=bVeBicsTPfGeMx8gJWcdrywatA0o4xGR8UJ3G8ycRWU%3D%0A&s=7760a241727dda5cf2c00f90475528a6180a3a361d41f542c3145a0879b000e6>,
confirmed with my tests.
--
Sincerely,
Alex Parrill
Beddell, Thomas Edmund
2014-08-08 08:36:02 UTC
Permalink
“I think vector in this context is referring to "vector types, declared with the GCC mode or vector_size attribute", not anythinhg with STL.”

It looks like this has been added to Visual Studio 2013.

http://blogs.msdn.com/b/vcblog/archive/2013/07/12/introducing-vector-calling-convention.aspx
Evan Wies
2014-08-07 21:18:56 UTC
Permalink
Oh, on the metatype tip, you could also put the double[16] cdata itself
inside a Lua table (perhaps as the sole element at t[1]) and then assign
that table a metatable. Something like:

local v_t = ffi.typeof("double[16]")
local function _new_matrix()
return setmetatable( {v_t()}, MatrixMetatable )
end
....
function MatrixMetatable._index.identity( self )
local m = self[1]
for i = 0, 15, 5 do m[i] = 1 end
end

I bet after JITing, the performance wouldn't be that much worse....

-Evan
Post by Beddell, Thomas Edmund
Hi Evan,
I got 0.09 secs. That is 3 times faster than my lua arrays and half as fast as C++!
It is a shame to lose the metamethods though. I did read somewhere
that metatype works with vectors. Does that mean std::vector?
I didn’t yet get Alex’s to run. I added malloc.h but still not
working. Probably due to my project configuration.
Best regards,
Thomas Beddell
*Sent:* 7. august 2014 22:18
*Subject:* Re: LuaJIT Matrix Algebra Function
I don't think it is the large allocation... that struct is of size
128. I think the "NYI unsupported C type" is due to the fact that
it is an array within a struct?
[I've seen similar things with having small char[] within structs and
http://www.freelists.org/post/luajit/Luajit-struct-vs-table-performance,2
<https://urldefense.proofpoint.com/v1/url?u=http://www.freelists.org/post/luajit/Luajit-struct-vs-table-performance%2C2&k=rYVT%2BQIbByDGfw04ADASxQ%3D%3D%0A&r=Z%2BQTzSXoYfn2aVXXze9h9xUjv%2BDjFRaIHobDQaXRmqQ%3D%0A&m=bVeBicsTPfGeMx8gJWcdrywatA0o4xGR8UJ3G8ycRWU%3D%0A&s=6e035bcd5bae3127e4b98a7ec239f0aa42e3408d54680fe7dee487273cbd358c>
"T he gotcha with this is that initialization of aggregates is NYI. To
get around this, I create one "seed" object, then construct new
aggregates by copying the seed (which is JIT-compiled).]
This gist removes the struct wrapping (and hence the metatable
niceness since you can't metatype primitives or arrays), leaving you
just the double[16] and some functions. But performance on my system
with LuaJIT 2.1 goes from 4.8 to 0.05 (61,789/sec to 5,442,571/sec).
https://gist.github.com/neomantra/0e0d57287f8ed9953353/e1fe5a0a311daf6c28b2af6e40ae537f4799736a
<https://urldefense.proofpoint.com/v1/url?u=https://gist.github.com/neomantra/0e0d57287f8ed9953353/e1fe5a0a311daf6c28b2af6e40ae537f4799736a&k=rYVT%2BQIbByDGfw04ADASxQ%3D%3D%0A&r=Z%2BQTzSXoYfn2aVXXze9h9xUjv%2BDjFRaIHobDQaXRmqQ%3D%0A&m=bVeBicsTPfGeMx8gJWcdrywatA0o4xGR8UJ3G8ycRWU%3D%0A&s=220a0d639c00d95fd8168ee67ae91e3621a379e4f2416567efa4d61c169ccbf2>
I think Alex's malloc approach achieves the similar goal (allowing
full compilation). The difference is that my gist uses an object that
LuaJIT can compile the initialization and copy of, whereas Alex's uses
a struct and manages the copy/initialization himself. The "let
LuaJIT do it" way is slightly faster on my system (Alex's is
4,345,118/sec for me).
Regards,
Evan
https://gist.github.com/ColonelThirtyTwo/56a06902c49f57edd70d
<https://urldefense.proofpoint.com/v1/url?u=https://gist.github.com/ColonelThirtyTwo/56a06902c49f57edd70d&k=rYVT%2BQIbByDGfw04ADASxQ%3D%3D%0A&r=Z%2BQTzSXoYfn2aVXXze9h9xUjv%2BDjFRaIHobDQaXRmqQ%3D%0A&m=bVeBicsTPfGeMx8gJWcdrywatA0o4xGR8UJ3G8ycRWU%3D%0A&s=fefee6eb2c5d49dbdf88329bb148c99ad505d8b6346affde9f1528f57b174709>
It runs much faster. Primarily, large allocations (>128 bytes)
aren't jit compiled [1]. Calls to malloc are though, so this code
allocates the matrices with that. Additionally, caching the
`entries` field in a local actually hurts performance [2].
[1] http://luajit.org/ext_ffi_semantics.html
<https://urldefense.proofpoint.com/v1/url?u=http://luajit.org/ext_ffi_semantics.html&k=rYVT%2BQIbByDGfw04ADASxQ%3D%3D%0A&r=Z%2BQTzSXoYfn2aVXXze9h9xUjv%2BDjFRaIHobDQaXRmqQ%3D%0A&m=bVeBicsTPfGeMx8gJWcdrywatA0o4xGR8UJ3G8ycRWU%3D%0A&s=195ea54a22d9d0eb6f867ad4ab0140d223939eb59ef54e725f3d60f7819f1f6b>
in the Current Status section
[2] http://wiki.luajit.org/Numerical-Computing-Performance-Guide
<https://urldefense.proofpoint.com/v1/url?u=http://wiki.luajit.org/Numerical-Computing-Performance-Guide&k=rYVT%2BQIbByDGfw04ADASxQ%3D%3D%0A&r=Z%2BQTzSXoYfn2aVXXze9h9xUjv%2BDjFRaIHobDQaXRmqQ%3D%0A&m=bVeBicsTPfGeMx8gJWcdrywatA0o4xGR8UJ3G8ycRWU%3D%0A&s=7760a241727dda5cf2c00f90475528a6180a3a361d41f542c3145a0879b000e6>,
confirmed with my tests.
--
Sincerely,
Alex Parrill
Beddell, Thomas Edmund
2014-08-08 08:38:33 UTC
Permalink
“Oh, on the metatype tip, you could also put the double[16] cdata itself inside a Lua table”

Thanks, I’ll try it.

From: luajit-***@freelists.org [mailto:luajit-***@freelists.org] On Behalf Of Evan Wies
Sent: 7. august 2014 23:19
To: ***@freelists.org
Subject: Re: LuaJIT Matrix Algebra Function

Oh, on the metatype tip, you could also put the double[16] cdata itself inside a Lua table (perhaps as the sole element at t[1]) and then assign that table a metatable. Something like:

local v_t = ffi.typeof("double[16]")
local function _new_matrix()
return setmetatable( {v_t()}, MatrixMetatable )
end
....
function MatrixMetatable._index.identity( self )
local m = self[1]
for i = 0, 15, 5 do m[i] = 1 end
end

I bet after JITing, the performance wouldn't be that much worse....

-Evan

On 08/07/2014 04:47 PM, Beddell, Thomas Edmund wrote:
Hi Evan,

I got 0.09 secs. That is 3 times faster than my lua arrays and half as fast as C++!
It is a shame to lose the metamethods though. I did read somewhere that metatype works with vectors. Does that mean std::vector?

I didn’t yet get Alex’s to run. I added malloc.h but still not working. Probably due to my project configuration.

Best regards,

Thomas Beddell

From: luajit-***@freelists.org<mailto:luajit-***@freelists.org> [mailto:luajit-***@freelists.org] On Behalf Of Evan Wies
Sent: 7. august 2014 22:18
To: ***@freelists.org<mailto:***@freelists.org>
Subject: Re: LuaJIT Matrix Algebra Function

I don't think it is the large allocation... that struct is of size 128. I think the "NYI unsupported C type" is due to the fact that it is an array within a struct?

[I've seen similar things with having small char[] within structs and then either initializing or copying them, which I've talked about here: http://www.freelists.org/post/luajit/Luajit-struct-vs-table-performance,2<https://urldefense.proofpoint.com/v1/url?u=http://www.freelists.org/post/luajit/Luajit-struct-vs-table-performance%2C2&k=rYVT%2BQIbByDGfw04ADASxQ%3D%3D%0A&r=Z%2BQTzSXoYfn2aVXXze9h9xUjv%2BDjFRaIHobDQaXRmqQ%3D%0A&m=bVeBicsTPfGeMx8gJWcdrywatA0o4xGR8UJ3G8ycRWU%3D%0A&s=6e035bcd5bae3127e4b98a7ec239f0aa42e3408d54680fe7dee487273cbd358c>
"T he gotcha with this is that initialization of aggregates is NYI. To get around this, I create one "seed" object, then construct new aggregates by copying the seed (which is JIT-compiled).]

This gist removes the struct wrapping (and hence the metatable niceness since you can't metatype primitives or arrays), leaving you just the double[16] and some functions. But performance on my system with LuaJIT 2.1 goes from 4.8 to 0.05 (61,789/sec to 5,442,571/sec).

https://gist.github.com/neomantra/0e0d57287f8ed9953353/e1fe5a0a311daf6c28b2af6e40ae537f4799736a<https://urldefense.proofpoint.com/v1/url?u=https://gist.github.com/neomantra/0e0d57287f8ed9953353/e1fe5a0a311daf6c28b2af6e40ae537f4799736a&k=rYVT%2BQIbByDGfw04ADASxQ%3D%3D%0A&r=Z%2BQTzSXoYfn2aVXXze9h9xUjv%2BDjFRaIHobDQaXRmqQ%3D%0A&m=bVeBicsTPfGeMx8gJWcdrywatA0o4xGR8UJ3G8ycRWU%3D%0A&s=220a0d639c00d95fd8168ee67ae91e3621a379e4f2416567efa4d61c169ccbf2>

I think Alex's malloc approach achieves the similar goal (allowing full compilation). The difference is that my gist uses an object that LuaJIT can compile the initialization and copy of, whereas Alex's uses a struct and manages the copy/initialization himself. The "let LuaJIT do it" way is slightly faster on my system (Alex's is 4,345,118/sec for me).

Regards,
Evan



On 08/07/2014 03:48 PM, Alex wrote:
Here's a fully-compileable FFI version, based off of Kordic's:

https://gist.github.com/ColonelThirtyTwo/56a06902c49f57edd70d<https://urldefense.proofpoint.com/v1/url?u=https://gist.github.com/ColonelThirtyTwo/56a06902c49f57edd70d&k=rYVT%2BQIbByDGfw04ADASxQ%3D%3D%0A&r=Z%2BQTzSXoYfn2aVXXze9h9xUjv%2BDjFRaIHobDQaXRmqQ%3D%0A&m=bVeBicsTPfGeMx8gJWcdrywatA0o4xGR8UJ3G8ycRWU%3D%0A&s=fefee6eb2c5d49dbdf88329bb148c99ad505d8b6346affde9f1528f57b174709>

It runs much faster. Primarily, large allocations (>128 bytes) aren't jit compiled [1]. Calls to malloc are though, so this code allocates the matrices with that. Additionally, caching the `entries` field in a local actually hurts performance [2].

[1] http://luajit.org/ext_ffi_semantics.html<https://urldefense.proofpoint.com/v1/url?u=http://luajit.org/ext_ffi_semantics.html&k=rYVT%2BQIbByDGfw04ADASxQ%3D%3D%0A&r=Z%2BQTzSXoYfn2aVXXze9h9xUjv%2BDjFRaIHobDQaXRmqQ%3D%0A&m=bVeBicsTPfGeMx8gJWcdrywatA0o4xGR8UJ3G8ycRWU%3D%0A&s=195ea54a22d9d0eb6f867ad4ab0140d223939eb59ef54e725f3d60f7819f1f6b> in the Current Status section
[2] http://wiki.luajit.org/Numerical-Computing-Performance-Guide<https://urldefense.proofpoint.com/v1/url?u=http://wiki.luajit.org/Numerical-Computing-Performance-Guide&k=rYVT%2BQIbByDGfw04ADASxQ%3D%3D%0A&r=Z%2BQTzSXoYfn2aVXXze9h9xUjv%2BDjFRaIHobDQaXRmqQ%3D%0A&m=bVeBicsTPfGeMx8gJWcdrywatA0o4xGR8UJ3G8ycRWU%3D%0A&s=7760a241727dda5cf2c00f90475528a6180a3a361d41f542c3145a0879b000e6>, confirmed with my tests.
--
Sincerely,
Alex Parrill
Beddell, Thomas Edmund
2014-08-08 11:14:31 UTC
Permalink
Hi Evan,

Your ffi array of doubles took 0.09 secs.
I added your trick of making local copies of variables and that took Lua table version down to 0.17 secs.
The hybrid approach with lua tables containing an array of doubles was exactly half way at 0.13 secs.

C++ “equivalent” program 0.043 secs
C++ “equivalent” program using DirectX functions (SSE2) 0.017 secs

Needless to say my co-worker was “Impressed”!

Thanks for your great input. I include the hybrid code.

Best regards,

Thomas Beddell


local ffi = require("ffi")

local mtype = ffi.typeof("double[16]")

-- double 4x4, 1-based, column major
matrix = {}

-- Source for own metamethods
matrix.__index = matrix

setmetatable(matrix, matrix)

-- Create a matrix object. Tested OK
matrix.__call = function(self, ...)

-- Look in matrix for metamethods
return setmetatable({mtype(...)}, matrix)

end

function copy(self)

local mout = matrix()
local out = mout[1]
local m = self[1]

for i=0, 15 do

out[i] = m[i]

end

return mout

end

-- Set matrix to identity matrix. Tested OK
matrix.identity = function(self)

self = matrix()
local m = self[1]

for i=0, 15, 5 do

m[i] = 1

end

return self

end

-- Inverse of matrix. Tested OK
matrix.inverse = function(self)

local m = self[1]
local outm = matrix()
local out = outm[1]

out[0] = m[5] * m[10] * m[15] -
m[5] * m[11] * m[14] -
m[9] * m[6] * m[15] +
m[9] * m[7] * m[14] +
m[13] * m[6] * m[11] -
m[13] * m[7] * m[10]

out[4] = -m[4] * m[10] * m[15] +
m[4] * m[11] * m[14] +
m[8] * m[6] * m[15] -
m[8] * m[7] * m[14] -
m[12] * m[6] * m[11] +
m[12] * m[7] * m[10]

out[8] = m[4] * m[9] * m[15] -
m[4] * m[11] * m[13] -
m[8] * m[5] * m[15] +
m[8] * m[7] * m[13] +
m[12] * m[5] * m[11] -
m[12] * m[7] * m[9]

out[12] = -m[4] * m[9] * m[14] +
m[4] * m[10] * m[13] +
m[8] * m[5] * m[14] -
m[8] * m[6] * m[13] -
m[12] * m[5] * m[10] +
m[12] * m[6] * m[9]

out[1] = -m[1] * m[10] * m[15] +
m[1] * m[11] * m[14] +
m[9] * m[2] * m[15] -
m[9] * m[3] * m[14] -
m[13] * m[2] * m[11] +
m[13] * m[3] * m[10]

out[5] = m[0] * m[10] * m[15] -
m[0] * m[11] * m[14] -
m[8] * m[2] * m[15] +
m[8] * m[3] * m[14] +
m[12] * m[2] * m[11] -
m[12] * m[3] * m[10]

out[9] = -m[0] * m[9] * m[15] +
m[0] * m[11] * m[13] +
m[8] * m[1] * m[15] -
m[8] * m[3] * m[13] -
m[12] * m[1] * m[11] +
m[12] * m[3] * m[9]

out[13] = m[0] * m[9] * m[14] -
m[0] * m[10] * m[13] -
m[8] * m[1] * m[14] +
m[8] * m[2] * m[13] +
m[12] * m[1] * m[10] -
m[12] * m[2] * m[9]

out[2] = m[1] * m[6] * m[15] -
m[1] * m[7] * m[14] -
m[5] * m[2] * m[15] +
m[5] * m[3] * m[14] +
m[13] * m[2] * m[7] -
m[13] * m[3] * m[6]

out[6] = -m[0] * m[6] * m[15] +
m[0] * m[7] * m[14] +
m[4] * m[2] * m[15] -
m[4] * m[3] * m[14] -
m[12] * m[2] * m[7] +
m[12] * m[3] * m[6]

out[10] = m[0] * m[5] * m[15] -
m[0] * m[7] * m[13] -
m[4] * m[1] * m[15] +
m[4] * m[3] * m[13] +
m[12] * m[1] * m[7] -
m[12] * m[3] * m[5]

out[14] = -m[0] * m[5] * m[14] +
m[0] * m[6] * m[13] +
m[4] * m[1] * m[14] -
m[4] * m[2] * m[13] -
m[12] * m[1] * m[6] +
m[12] * m[2] * m[5]

out[3] = -m[1] * m[6] * m[11] +
m[1] * m[7] * m[10] +
m[5] * m[2] * m[11] -
m[5] * m[3] * m[10] -
m[9] * m[2] * m[7] +
m[9] * m[3] * m[6]

out[7] = m[0] * m[6] * m[11] -
m[0] * m[7] * m[10] -
m[4] * m[2] * m[11] +
m[4] * m[3] * m[10] +
m[8] * m[2] * m[7] -
m[8] * m[3] * m[6]

out[11] = -m[0] * m[5] * m[11] +
m[0] * m[7] * m[9] +
m[4] * m[1] * m[11] -
m[4] * m[3] * m[9] -
m[8] * m[1] * m[7] +
m[8] * m[3] * m[5]

out[15] = m[0] * m[5] * m[10] -
m[0] * m[6] * m[9] -
m[4] * m[1] * m[10] +
m[4] * m[2] * m[9] +
m[8] * m[1] * m[6] -
m[8] * m[2] * m[5]

local det = m[0] * out[0] + m[1] * out[4] + m[2] * out[8] + m[3] * out[12]

if det == 0 then return self:copy() end

det = 1.0 / det

for i = 1, 15 do

out[i] = out[i] * det

end

return outm

end

-- Multiply matrix by a matrix. Tested OK
matrix.__mul = function(self, m)

local outm = matrix()
local out = outm[1]
local me, se = m[1], self[1]

for i=0, 12, 4 do

for j=0, 3 do

out[i+j] = me[j] * se[i] + me[j+4] * se[i+1] + me[j+8] * se[i+2] + me[j+12] * se[i+3]

end

end

return outm

end

-- Test
local mOut
local m = matrix():identity()
local ITERATIONS = 300000

local t = os.clock()
for i=1, ITERATIONS do

mOut = m * m:inverse()

end

local time = os.clock() - t
print("time", time)
Eliot Eikenberry
2014-09-08 01:34:52 UTC
Permalink
I just got my oculus rift dk2 and have been working on setting everything
up to run in lua and moving a bunch of my old games into lua to use with it
and had a reason to do the same with with matrix (and vector,
quaternion,etc...).

On my machine, the code in Evan's gist runs in 0.058 on average.
Your hybrid code above runs at about 0.088.
The code at this gist runs at about 0.038 :
https://gist.github.com/Wolftousen/99a0c6835a1130b96965

Unfortunately, since i'm using opengl and double type matrices aren't
supported unless you are running version 4, i have to use floats instead of
doubles which slows it back down to about 0.059. But regardless, awesome
we can get stuff so close to c speeds.

On Fri, Aug 8, 2014 at 7:14 AM, Beddell, Thomas Edmund <
Post by Beddell, Thomas Edmund
Hi Evan,
Your ffi array of doubles took 0.09 secs.
I added your trick of making local copies of variables and that took Lua
table version down to 0.17 secs.
The hybrid approach with lua tables containing an array of doubles was
exactly half way at 0.13 secs.
C++ “equivalent” program 0.043 secs
C++ “equivalent” program using DirectX functions (SSE2) 0.017 secs
Needless to say my co-worker was “Impressed”!
Thanks for your great input. I include the hybrid code.
Best regards,
Thomas Beddell
local ffi = require("ffi")
local mtype = ffi.typeof("double[16]")
-- double 4x4, 1-based, column major
matrix = {}
-- Source for own metamethods
matrix.__index = matrix
setmetatable(matrix, matrix)
-- Create a matrix object. Tested OK
matrix.__call = function(self, ...)
-- Look in matrix for metamethods
return setmetatable({mtype(...)}, matrix)
end
function copy(self)
local mout = matrix()
local out = mout[1]
local m = self[1]
for i=0, 15 do
out[i] = m[i]
end
return mout
end
-- Set matrix to identity matrix. Tested OK
matrix.identity = function(self)
self = matrix()
local m = self[1]
for i=0, 15, 5 do
m[i] = 1
end
return self
end
-- Inverse of matrix. Tested OK
matrix.inverse = function(self)
local m = self[1]
local outm = matrix()
local out = outm[1]
out[0] = m[5] * m[10] * m[15] -
m[5] * m[11] * m[14] -
m[9] * m[6] * m[15] +
m[9] * m[7] * m[14] +
m[13] * m[6] * m[11] -
m[13] * m[7] * m[10]
out[4] = -m[4] * m[10] * m[15] +
m[4] * m[11] * m[14] +
m[8] * m[6] * m[15] -
m[8] * m[7] * m[14] -
m[12] * m[6] * m[11] +
m[12] * m[7] * m[10]
out[8] = m[4] * m[9] * m[15] -
m[4] * m[11] * m[13] -
m[8] * m[5] * m[15] +
m[8] * m[7] * m[13] +
m[12] * m[5] * m[11] -
m[12] * m[7] * m[9]
out[12] = -m[4] * m[9] * m[14] +
m[4] * m[10] * m[13] +
m[8] * m[5] * m[14] -
m[8] * m[6] * m[13] -
m[12] * m[5] * m[10] +
m[12] * m[6] * m[9]
out[1] = -m[1] * m[10] * m[15] +
m[1] * m[11] * m[14] +
m[9] * m[2] * m[15] -
m[9] * m[3] * m[14] -
m[13] * m[2] * m[11] +
m[13] * m[3] * m[10]
out[5] = m[0] * m[10] * m[15] -
m[0] * m[11] * m[14] -
m[8] * m[2] * m[15] +
m[8] * m[3] * m[14] +
m[12] * m[2] * m[11] -
m[12] * m[3] * m[10]
out[9] = -m[0] * m[9] * m[15] +
m[0] * m[11] * m[13] +
m[8] * m[1] * m[15] -
m[8] * m[3] * m[13] -
m[12] * m[1] * m[11] +
m[12] * m[3] * m[9]
out[13] = m[0] * m[9] * m[14] -
m[0] * m[10] * m[13] -
m[8] * m[1] * m[14] +
m[8] * m[2] * m[13] +
m[12] * m[1] * m[10] -
m[12] * m[2] * m[9]
out[2] = m[1] * m[6] * m[15] -
m[1] * m[7] * m[14] -
m[5] * m[2] * m[15] +
m[5] * m[3] * m[14] +
m[13] * m[2] * m[7] -
m[13] * m[3] * m[6]
out[6] = -m[0] * m[6] * m[15] +
m[0] * m[7] * m[14] +
m[4] * m[2] * m[15] -
m[4] * m[3] * m[14] -
m[12] * m[2] * m[7] +
m[12] * m[3] * m[6]
out[10] = m[0] * m[5] * m[15] -
m[0] * m[7] * m[13] -
m[4] * m[1] * m[15] +
m[4] * m[3] * m[13] +
m[12] * m[1] * m[7] -
m[12] * m[3] * m[5]
out[14] = -m[0] * m[5] * m[14] +
m[0] * m[6] * m[13] +
m[4] * m[1] * m[14] -
m[4] * m[2] * m[13] -
m[12] * m[1] * m[6] +
m[12] * m[2] * m[5]
out[3] = -m[1] * m[6] * m[11] +
m[1] * m[7] * m[10] +
m[5] * m[2] * m[11] -
m[5] * m[3] * m[10] -
m[9] * m[2] * m[7] +
m[9] * m[3] * m[6]
out[7] = m[0] * m[6] * m[11] -
m[0] * m[7] * m[10] -
m[4] * m[2] * m[11] +
m[4] * m[3] * m[10] +
m[8] * m[2] * m[7] -
m[8] * m[3] * m[6]
out[11] = -m[0] * m[5] * m[11] +
m[0] * m[7] * m[9] +
m[4] * m[1] * m[11] -
m[4] * m[3] * m[9] -
m[8] * m[1] * m[7] +
m[8] * m[3] * m[5]
out[15] = m[0] * m[5] * m[10] -
m[0] * m[6] * m[9] -
m[4] * m[1] * m[10] +
m[4] * m[2] * m[9] +
m[8] * m[1] * m[6] -
m[8] * m[2] * m[5]
local det = m[0] * out[0] + m[1] * out[4] + m[2] * out[8] + m[3] * out[12]
if det == 0 then return self:copy() end
det = 1.0 / det
for i = 1, 15 do
out[i] = out[i] * det
end
return outm
end
-- Multiply matrix by a matrix. Tested OK
matrix.__mul = function(self, m)
local outm = matrix()
local out = outm[1]
local me, se = m[1], self[1]
for i=0, 12, 4 do
for j=0, 3 do
out[i+j] = me[j] * se[i] + me[j+4] * se[i+1] + me[j+8] *
se[i+2] + me[j+12] * se[i+3]
end
end
return outm
end
-- Test
local mOut
local m = matrix():identity()
local ITERATIONS = 300000
local t = os.clock()
for i=1, ITERATIONS do
mOut = m * m:inverse()
end
local time = os.clock() - t
print("time", time)
Josh Simmons
2014-09-08 02:02:46 UTC
Permalink
On Mon, Sep 8, 2014 at 11:34 AM, Eliot Eikenberry
I just got my oculus rift dk2 and have been working on setting everything up
to run in lua and moving a bunch of my old games into lua to use with it and
had a reason to do the same with with matrix (and vector,
quaternion,etc...).
On my machine, the code in Evan's gist runs in 0.058 on average.
Your hybrid code above runs at about 0.088.
https://gist.github.com/Wolftousen/99a0c6835a1130b96965
Unfortunately, since i'm using opengl and double type matrices aren't
supported unless you are running version 4, i have to use floats instead of
doubles which slows it back down to about 0.059. But regardless, awesome we
can get stuff so close to c speeds.
I didn't look too deeply so I might have missed something but I'm not
sure why you don't change your matrix to

struct mat4 { double data[16]; };

to avoid the extra allocations and indirection.
Eliot Eikenberry
2014-09-08 02:14:34 UTC
Permalink
There are pointer swaps going on to avoid doing array copies.
Post by Josh Simmons
On Mon, Sep 8, 2014 at 11:34 AM, Eliot Eikenberry
Post by Eliot Eikenberry
I just got my oculus rift dk2 and have been working on setting
everything up
Post by Eliot Eikenberry
to run in lua and moving a bunch of my old games into lua to use with it
and
Post by Eliot Eikenberry
had a reason to do the same with with matrix (and vector,
quaternion,etc...).
On my machine, the code in Evan's gist runs in 0.058 on average.
Your hybrid code above runs at about 0.088.
https://gist.github.com/Wolftousen/99a0c6835a1130b96965
Unfortunately, since i'm using opengl and double type matrices aren't
supported unless you are running version 4, i have to use floats instead
of
Post by Eliot Eikenberry
doubles which slows it back down to about 0.059. But regardless,
awesome we
Post by Eliot Eikenberry
can get stuff so close to c speeds.
I didn't look too deeply so I might have missed something but I'm not
sure why you don't change your matrix to
struct mat4 { double data[16]; };
to avoid the extra allocations and indirection.
Loading...