Discussion:
x64 outlining of tasks
soumith
2014-08-08 14:15:50 UTC
Permalink
Mike,

I understand that you dont want to port 2.1 to use more than 31-bit address
space because of the garbage collector, but there's several valid use cases
that dont need to use a lot of lua objects, but the memory itself is
allocated outside the first 2GB. Especially when LuaJIT is intertwined with
other applications.

I think several of us in the community are willing to work on it together
to at least fix the "lower 2GB only" issue, but it would be good to get an
outline of the task involved and a nice breakdown with the files/locations
that we have to watch out for.

If you could give us such an outline, I could start a github branch that we
can work on to fix this.

Thanks,
Soumith
Karel Tuma
2014-08-15 18:24:49 UTC
Permalink
Hi,
Post by soumith
I understand that you dont want to port 2.1 to use more than 31-bit address
space because of the garbage collector
Just heads up - modifying LuaJIT to support 4GB heap turns out not particularly
difficult, albeit a bit hackish [1]:

1. make GC accounting 64bit, as 4GB heaps wont fit signed 32 integers.

2. Modify the allocator. On Linux, MAP_32BIT is useless as it actually
provides only 1GB of heap (0x40000000-0x80000000)

Getting a good 32bit flat space allocator on Linux is non-trivial, as the kernel
is too smart for our own good. I've went with using system malloc() and convincing
glibc to never use mmap. brk() heap is lower 32bit. I'd welcome if someone knows of
better (but simple) approach.
Post by soumith
but there's several valid use cases
that dont need to use a lot of lua objects, but the memory itself is
allocated outside the first 2GB. Especially when LuaJIT is intertwined with
other applications.
If 3rd party code polluting lower 32bit address space is part of your problem,
perhaps you might want to investigate various tricks to avoid that. Either
linker options to have _bss/brk() above 32bit, or some brute approach, like
mapping a page in front of sbrk(0) and calling malloc (virtually all
implementations fall back to mmap in that case).
Post by soumith
I think several of us in the community are willing to work on it together
to at least fix the "lower 2GB only" issue,
If you want answer for 4GB beyond, there are couple of approaches, most
are rather complex.

One of the simpler ones would be to introduce level of indirection in
GCtab and GCstr.

This would incur runtime overhead, but as long it's configurable...
Post by soumith
but it would be good to get an
outline of the task involved and a nice breakdown with the files/locations
that we have to watch out for.
Why not just read the source code?
Post by soumith
If you could give us such an outline, I could start a github branch that we
can work on to fix this.
If you make it past 4GB (say, using the indirection approach), feel free to pullreq :)
Post by soumith
Thanks,
Soumith
-- k

[1] https://github.com/katlogic/ljx/commit/d2c8588b6b9acf94fe9c5718798b95ae1e6b9321
Coda Highland
2014-08-15 19:41:37 UTC
Permalink
Post by Karel Tuma
If 3rd party code polluting lower 32bit address space is part of your problem,
perhaps you might want to investigate various tricks to avoid that. Either
linker options to have _bss/brk() above 32bit, or some brute approach, like
mapping a page in front of sbrk(0) and calling malloc (virtually all
implementations fall back to mmap in that case).
If the third party code is the process responsible for LOADING your
code, those options don't work. It's the unfortunate reason why I had
to drop LuaJIT in a previous commercial project. :(

/s/ Adam
Karel Tuma
2014-08-15 20:37:25 UTC
Permalink
Post by Coda Highland
If the third party code is the process responsible for LOADING your
code, those options don't work. It's the unfortunate reason why I had
to drop LuaJIT in a previous commercial project. :(
/s/ Adam
Care to elaborate? In the event significant chunk of low 32bits was allocated already
at the time of luaL_newstate(), there is indeed nothing that can be done, it's gone for good.

However I am convinced it possible to fairly portably trick *anything* else into stopping using,
precious lowmem after the time of luaL_newstate() when allocator initializes... mmap(PROT_NONE)
or VirtualAlloc(MEM_RESERVE) to the rescue.
Coda Highland
2014-08-15 20:49:55 UTC
Permalink
Post by Karel Tuma
Post by Coda Highland
If the third party code is the process responsible for LOADING your
code, those options don't work. It's the unfortunate reason why I had
to drop LuaJIT in a previous commercial project. :(
/s/ Adam
Care to elaborate? In the event significant chunk of low 32bits was allocated already
at the time of luaL_newstate(), there is indeed nothing that can be done, it's gone for good.
However I am convinced it possible to fairly portably trick *anything* else into stopping using,
precious lowmem after the time of luaL_newstate() when allocator initializes... mmap(PROT_NONE)
or VirtualAlloc(MEM_RESERVE) to the rescue.
OSX adds an additional wrinkle -- there's a linker flag you have to
set on the application to define the lowest virtual memory address
available to the application. This, like the calling process having
already loaded a ton of stuff before you ever get instantiated, is
absolutely fatal, with no workaround available.

Yes, your workaround would work IN THEORY, but there's absolutely no
guarantee that there's going to be any available memory in that
address range, either because the OS won't give it to you, or because
other plugins will have already seized it from you.

/s/ Adam
Karel Tuma
2014-08-15 21:27:10 UTC
Permalink
Post by Coda Highland
OSX adds an additional wrinkle -- there's a linker flag you have to
set on the application to define the lowest virtual memory address
available to the application. This, like the calling process having
already loaded a ton of stuff before you ever get instantiated, is
absolutely fatal, with no workaround available.
Agreed, simple hacks wont cut it since those are inherently OS/libc specific.
Post by Coda Highland
Yes, your workaround would work IN THEORY, but there's absolutely no
guarantee that there's going to be any available memory in that
address range, either because the OS won't give it to you, or because
other plugins will have already seized it from you.
What about IN PRACTICE? What is the output of the following program:

https://gist.github.com/katlogic/4721e91ff5a845d2b939

I'm also curious if you know of the failure modes which could happen
when allocator uses chunks preallocated like that.

This approach also means the allocator would need to be thread aware
so the reserve routine is called exactly once per process (and then
distributed to per-lua_State dlmallocs in atomic fashion). Chances are
mere CAS atomics would do.

PS:

I'm not planning on implementing this since I dont care about nor have OSX,
however this could server as a blueprint how to go about it,
as per OP's request.
Coda Highland
2014-08-15 21:35:32 UTC
Permalink
Post by Karel Tuma
Post by Coda Highland
OSX adds an additional wrinkle -- there's a linker flag you have to
set on the application to define the lowest virtual memory address
available to the application. This, like the calling process having
already loaded a ton of stuff before you ever get instantiated, is
absolutely fatal, with no workaround available.
Agreed, simple hacks wont cut it since those are inherently OS/libc specific.
Post by Coda Highland
Yes, your workaround would work IN THEORY, but there's absolutely no
guarantee that there's going to be any available memory in that
address range, either because the OS won't give it to you, or because
other plugins will have already seized it from you.
https://gist.github.com/katlogic/4721e91ff5a845d2b939
I'm also curious if you know of the failure modes which could happen
when allocator uses chunks preallocated like that.
This approach also means the allocator would need to be thread aware
so the reserve routine is called exactly once per process (and then
distributed to per-lua_State dlmallocs in atomic fashion). Chances are
mere CAS atomics would do.
I'm not planning on implementing this since I dont care about nor have OSX,
however this could server as a blueprint how to go about it,
as per OP's request.
I no longer have access to the codebase in question, as I've moved to
a new job since then. However, my understanding of the relevant
mechanics in OSX say that if you don't pass the aforementioned linker
flag to the host application (remember, I'm talking about being a
plugin where you don't have access to do so) then line 24 will ALWAYS
evaluate to true.

/s/ Adam

Continue reading on narkive:
Loading...