memory allocation policy

Discussion:

Laurent Deniau

2014-09-22 14:37:58 UTC

Dear All,

We are facing performance problem with memory management shared between LJ and C that is actually not clearly understood by us. We try to figure out what is the best policy for intensive memory allocations.

Basically, we are implementing some algebra for complex objects with sizes ranging from a simple 6D vector to some complex objects of few megabytes manipulated in C (for performance and parallel computation reasons) that share complex and large descriptors of their internal representation (hooked pointers). To simplify the problem you can consider these objects as matrices of user-defined sizes ranging from 2x2 to 500x500, and used in hundreds of formula (i.e. operator overloading is not an option). We use the c = a*b on small objects for benchmarking the memory allocation overhead and consider only the speed ratio R between the different scenarii (small R = fast, high R = slow):

1- result is pre-allocated:
1.1 mul(a,b,c) in LJ : R=8
1.2 mul(a,b,c) in C : R=1

C is faster for this kind of calculation by a factor 8 and speed of mul is critical for us. Hence in the following, c = a*b is always handled by LJ __mul and calls the C version of mul (i.e. R=1) for the underlying multiplication. The remaining choice is the pre-allocation of the results in a*b:

2- result allocated in LJ: R=2
3- result allocated in C (malloc+finaliser): R=30
4- result allocated in C (pool+finaliser): R=27

case 2 shows that LJ is very fast to handle the allocations and R=2 would be acceptable, that is half of the time would be spent in allocation for small objects (negligible for large objects). But this case quickly reaches the memory limit of LJ for large objects because all the memory is managed by LJ.

case 4 vs case 3 shows that even highly optimised pool allocator in (3e8 malloc+free/sec) does not solve the problem, so the problem is not the C malloc+free cycle itself.

case 3 shows the problem under investigation:
C malloc+free combined with __gc lead to R=30 (97% of the time), which is not acceptable. Such slowdown could potentially be explained only by two things to me:
- the gc collects (calls the finaliser) too lately as it is not aware that objects waiting for destruction sill consume a significant amount of memory.
- we fall back on the interpreter at some level
- the gc uses a slow mechanism to invoke the finaliser
- ???

I suspect that this problem is not new and some solutions have been already found by others. It would be nice if someone could explain me why the delegation of memory management to C gives such bad results and how to solve the problem. Thanks.

Best regards,
Laurent.

Mike Pall

2014-09-24 10:07:08 UTC

Permalink

Post by Laurent Deniau
- the gc collects (calls the finaliser) too lately as it is not
aware that objects waiting for destruction sill consume a
significant amount of memory.

That's certainly the case as it only sees tiny pointer objects.
Log the memory consumption over time and compare the actual need
vs. the use.

Post by Laurent Deniau
- we fall back on the interpreter at some level

You can check with -jv, -jdump and the profiler in v2.1 (-jp=vl).

Post by Laurent Deniau
- the gc uses a slow mechanism to invoke the finaliser

A call to a GC finalizer is not compiled. You can find out if this
is the problem by profiling with the GC running vs. stopped. You
probably need to shorten your test to avoid running out of memory.

Post by Laurent Deniau
I suspect that this problem is not new and some solutions have
been already found by others. It would be nice if someone could
explain me why the delegation of memory management to C gives
such bad results and how to solve the problem. Thanks.

It's not hard to do a pool allocator in Lua, using the FFI.
Allocate a big pool with mmap or malloc, then deal out chunks on
new and put them back on the free list in the finalizer. The
finalizer is a Lua function in this case, so it can be compiled.

But better check the new and free functions in isolation for any
inefficiencies by looking at the machine code with -jdump.

--Mike

Laurent Deniau

2014-09-24 19:37:49 UTC

Permalink

Post by Mike Pall

Post by Laurent Deniau
- the gc collects (calls the finaliser) too lately as it is not
aware that objects waiting for destruction sill consume a
significant amount of memory.

That's certainly the case as it only sees tiny pointer objects.
Log the memory consumption over time and compare the actual need
vs. the use.

Yes, this was already observed by counting on the C side of ctor vs dtor calls. It there a way to tune the gc (some api?)

Post by Mike Pall

Post by Laurent Deniau
- we fall back on the interpreter at some level

You can check with -jv, -jdump and the profiler in v2.1 (-jp=vl).

We will investigate on that side, thanks.

Post by Mike Pall

Post by Laurent Deniau
- the gc uses a slow mechanism to invoke the finaliser

A call to a GC finalizer is not compiled. You can find out if this
is the problem by profiling with the GC running vs. stopped. You
probably need to shorten your test to avoid running out of memory.

This is basically what we do, except that the finaliser is the C function itself. Do you mean that if I wrap the call the the C function in a lua function, both - the finaliser and the ffi call - will be compiled?

Post by Mike Pall
But better check the new and free functions in isolation for any
inefficiencies by looking at the machine code with -jdump.

Yes, we will investigate. In the mean time, we will use a 2-level memory management, one high-level using LJ GC when the object is exposed to the LJ world (user-defined lifetime) and one low-level in C for local temporaries (short lifetime).

Best,
Laurent.

Cosmin Apreutesei

2014-09-24 20:50:03 UTC

Permalink

Post by Laurent Deniau
This is basically what we do, except that the finaliser is the C function itself. Do you mean that if I wrap the call the the C function in a lua function, both - the finaliser and the ffi call - will be compiled?

I think he means implementing a pool in Lua, so the function you pass
to ffi.gc() is Lua all the way down. Assigning ffi.C.free() on
ffi.gc(), with or without wrapping it into a Lua function, won't make
the call compiled, but I think more importantly, will leave you out of
memory pretty fast, because the gc doesn't know how much memory you've
allocated on a single object (and unfortunately there's not API to
tell it), and it needs that information in order to decide on the
number of objects to be collected on a single step.

Javier Guerra Giraldez

2014-09-24 21:05:41 UTC

Permalink

On Wed, Sep 24, 2014 at 3:50 PM, Cosmin Apreutesei

Post by Cosmin Apreutesei
I think he means implementing a pool in Lua, so the function you pass
to ffi.gc() is Lua all the way down.

i think the solution has more than one part (both are in Mike's answer)

1.- release resources as soon as possible. The issue (and this
happens both in Lua and LuaJIT), is that Lua only sees very small
objects, barely bigger than a pointer; so there's not much pressure to
collect garbage. The solution is to add some 'release' method to your
objects and call them the moment they're not needed.

2.- (de)allocators are slow, and LuaJIT don't compile calls to __gc.
But if you don't wait for the GC to release, you can do much faster.
What we do in SnabbSwitch is to allocate a big FFI array and then
handle freelists (just an array of pointers to the elements). just
getting an element from the freelist and returning it later is _much_
faster than any general-purpose allocator, no garbage is generated and
there's no fragmentation.

--
Javier

Laurent Deniau

2014-09-25 12:43:30 UTC

Permalink

Post by Javier Guerra Giraldez
On Wed, Sep 24, 2014 at 3:50 PM, Cosmin Apreutesei

Post by Cosmin Apreutesei
I think he means implementing a pool in Lua, so the function you pass
to ffi.gc() is Lua all the way down.

It was my first observation in my post. Our first try was to bet on LJ GC speed (with success) until we had to scale to larger problems.

If I could call a release method explicitly, I would not have the problem even with a pool managed in C (I consider the cost of ffi calls to be negligible).

An intermediate approach would be to tag explicitly temporaries for immediate destruction or stealing, but it does not work without explicit intervention of the _user_. The common problem to this approach is:

assume a, b are matrices

c = a*b -- c refer to a temporary created by *
d = 2*c -- * steals the temporary referenced by c
e = 3*c -- boom, c is not valid

in C++, we can overload the operator= (+copy-ctor+move-ctor+move-assign) to clear the tmp flag, but not in Lua AFAIK...

A possible (error-prone) solution would be to force the use of:
c = get(a*b)
d = get(2*c)
e = get(3*c)

In LuaJIT, we can use __index to write:
c = (a*b).get
d = (2*c).get
e = (3*c).get

where get is not defined
__index = function (self, key)
if key == "get" then
self.tmp = false
return self
end
error(…)
end

Post by Javier Guerra Giraldez
2.- (de)allocators are slow, and LuaJIT don't compile calls to __gc.
But if you don't wait for the GC to release, you can do much faster.

If I know when to release, I don't have the problem...

Post by Javier Guerra Giraldez
What we do in SnabbSwitch is to allocate a big FFI array and then
handle freelists (just an array of pointers to the elements). just
getting an element from the freelist and returning it later is _much_
faster than any general-purpose allocator, no garbage is generated and
there's no fragmentation.

This is what test with R=22 does, but in C.

I don't see the difference between managing it in C or LuaJIT for this purpose. I do see why Mike propose to manage it in LuaJIT, but it's only if the bottleneck is coming from the speed of the interpreter. But I suspect that it would not kill the performance by a factor 30, the problem is elsewhere.

Best,
Laurent.

Юрий Соколов

2014-09-26 04:09:19 UTC

Permalink

Have you tried some kind of "autorelease" pool: allocation function puts
all allocated objects into list (or pointers into array), when calculation
finished, you mark result as "needed" then call "free all objects in a list
that doesn't marked as needed".

It is how people in Objective C used to simplify reference counting for
years: autorelease pool just decrements reference count, and incrementing
reference count once more works as "mark as needed".

Post by Laurent Deniau

Post by Javier Guerra Giraldez
On Wed, Sep 24, 2014 at 3:50 PM, Cosmin Apreutesei

Post by Cosmin Apreutesei
I think he means implementing a pool in Lua, so the function you pass
to ffi.gc() is Lua all the way down.

It was my first observation in my post. Our first try was to bet on LJ GC
speed (with success) until we had to scale to larger problems.
If I could call a release method explicitly, I would not have the problem
even with a pool managed in C (I consider the cost of ffi calls to be
negligible).
An intermediate approach would be to tag explicitly temporaries for
immediate destruction or stealing, but it does not work without explicit
assume a, b are matrices
c = a*b -- c refer to a temporary created by *
d = 2*c -- * steals the temporary referenced by c
e = 3*c -- boom, c is not valid
in C++, we can overload the operator= (+copy-ctor+move-ctor+move-assign)
to clear the tmp flag, but not in Lua AFAIK...
c = get(a*b)
d = get(2*c)
e = get(3*c)
c = (a*b).get
d = (2*c).get
e = (3*c).get
where get is not defined
__index = function (self, key)
if key == "get" then
self.tmp = false
return self
end
error(âŠ)
end

Post by Javier Guerra Giraldez
2.- (de)allocators are slow, and LuaJIT don't compile calls to __gc.
But if you don't wait for the GC to release, you can do much faster.

If I know when to release, I don't have the problem...

This is what test with R=22 does, but in C.
I don't see the difference between managing it in C or LuaJIT for this
purpose. I do see why Mike propose to manage it in LuaJIT, but it's only if
the bottleneck is coming from the speed of the interpreter. But I suspect
that it would not kill the performance by a factor 30, the problem is
elsewhere.
Best,
Laurent.

Laurent Deniau

2014-09-26 06:05:30 UTC

Permalink

On Sep 26, 2014, at 6:09 AM, Ð®ÑÐžÐ¹ Ð¡ÐŸÐºÐŸÐ»ÐŸÐ² <***@gmail.com<mailto:***@gmail.com>> wrote:

Have you tried some kind of "autorelease" pool: allocation function puts all allocated objects into list (or pointers into array),

There are usually implemented as stacks (growing dynamic array) since their purpose is to have a user-defined stack of deferred released objects. It is also an elegant library solution to manage stack unwinding with exception (i.e. non-local jumps).

when calculation finished,

From the point of view of the library, how do you know that the calculation is finished?

you mark result as "needed" then call "free all objects in a list that doesn't marked as needed".

Semantically, it is not different from the "get" below, just more conservative. You can use Autorelease pools internally to the lib to group the "get" over a set of statements, but this is not their main purpose and may be very conservative (like the GC) or error prone.

It is how people in Objective C used to simplify reference counting for years: autorelease pool just decrements reference count, and incrementing reference count once more works as "mark as needed".

I have implemented a DSL in C similar to Objective-C (but better ;-) call C Object System which uses intensively autorelease pools. It is very useful, but the class of problem it solves is different: it's an explicit deferred non-local release.

Best,
Laurent.

Post by Javier Guerra Giraldez
On Wed, Sep 24, 2014 at 3:50 PM, Cosmin Apreutesei

Post by Cosmin Apreutesei
I think he means implementing a pool in Lua, so the function you pass
to ffi.gc() is Lua all the way down.

Post by Javier Guerra Giraldez
2.- (de)allocators are slow, and LuaJIT don't compile calls to __gc.
But if you don't wait for the GC to release, you can do much faster.

If I know when to release, I don't have the problem...

Юрий Соколов

2014-09-28 21:57:04 UTC

Permalink

C Object System looks interesting. Is it used in production? Why
development stops (at least on github)?

Post by Ð®ÑÐ¸Ð¹ Ð¡Ð¾ÐºÐ¾Ð»Ð¾Ð²
Have you tried some kind of "autorelease" pool: allocation function puts
all allocated objects into list (or pointers into array),
There are usually implemented as stacks (growing dynamic array) since
their purpose is to have a user-defined stack of deferred released objects.
It is also an elegant library solution to manage stack unwinding with
exception (i.e. non-local jumps).
when calculation finished,
From the point of view of the library, how do you know that the calculation is finished?
you mark result as "needed" then call "free all objects in a list that
doesn't marked as needed".
Semantically, it is not different from the "get" below, just more
conservative. You can use Autorelease pools internally to the lib to group
the "get" over a set of statements, but this is not their main purpose and
may be very conservative (like the GC) or error prone.
It is how people in Objective C used to simplify reference counting for
years: autorelease pool just decrements reference count, and incrementing
reference count once more works as "mark as needed".
I have implemented a DSL in C similar to Objective-C (but better ;-) call
C Object System which uses intensively autorelease pools. It is very
useful, but the class of problem it solves is different: it's an explicit
deferred non-local release.
Best,
Laurent.

Post by Laurent Deniau

Post by Javier Guerra Giraldez
On Wed, Sep 24, 2014 at 3:50 PM, Cosmin Apreutesei

Post by Cosmin Apreutesei
I think he means implementing a pool in Lua, so the function you pass
to ffi.gc() is Lua all the way down.

It was my first observation in my post. Our first try was to bet on LJ GC
speed (with success) until we had to scale to larger problems.
If I could call a release method explicitly, I would not have the problem
even with a pool managed in C (I consider the cost of ffi calls to be
negligible).
An intermediate approach would be to tag explicitly temporaries for
immediate destruction or stealing, but it does not work without explicit
assume a, b are matrices
c = a*b -- c refer to a temporary created by *
d = 2*c -- * steals the temporary referenced by c
e = 3*c -- boom, c is not valid
in C++, we can overload the operator= (+copy-ctor+move-ctor+move-assign)
to clear the tmp flag, but not in Lua AFAIK...
c = get(a*b)
d = get(2*c)
e = get(3*c)
c = (a*b).get
d = (2*c).get
e = (3*c).get
where get is not defined
__index = function (self, key)
if key == "get" then
self.tmp = false
return self
end
error(âŠ)
end

Post by Javier Guerra Giraldez
2.- (de)allocators are slow, and LuaJIT don't compile calls to __gc.
But if you don't wait for the GC to release, you can do much faster.

If I know when to release, I don't have the problem...

This is what test with R=22 does, but in C.
I don't see the difference between managing it in C or LuaJIT for this
purpose. I do see why Mike propose to manage it in LuaJIT, but it's only if
the bottleneck is coming from the speed of the interpreter. But I suspect
that it would not kill the performance by a factor 30, the problem is
elsewhere.
Best,
Laurent.

Laurent Deniau

2014-09-29 08:12:38 UTC

Permalink

On Sep 28, 2014, at 11:57 PM, Ð®ÑÐžÐ¹ Ð¡ÐŸÐºÐŸÐ»ÐŸÐ² <***@gmail.com<mailto:***@gmail.com>> wrote:

C Object System looks interesting. Is it used in production? Why development stops (at least on github)?

[off topic]
It was used in production by 4 companies around 2009, looking for an OO system in C (COS is much more). I discussed also with a couple of engineers developing Objective-C at Apple and they concluded that my approach (unification of all concepts involved) was the right way. Then I started to develop the standard library and the documentation system (DocStr was the last module under development). Finally, I stopped to develop everything for personal and professional reasons. I use it from time to time for rapid development in C, nothing serious. Since 2013-2014, I started to use LuaJIT because it is a nice competitor of COS and it has a community! If you want to discuss about COS, I suggest to it elsewhere.
[off topic]

Best laurent.

26.09.2014 10:06 Ð¿ÐŸÐ»ÑÐ·ÐŸÐ²Ð°ÑÐµÐ»Ñ "Laurent Deniau" <***@cern.ch<mailto:***@cern.ch>> ÐœÐ°Ð¿ÐžÑÐ°Ð»:
On Sep 26, 2014, at 6:09 AM, Ð®ÑÐžÐ¹ Ð¡ÐŸÐºÐŸÐ»ÐŸÐ² <***@gmail.com<mailto:***@gmail.com>> wrote:

Have you tried some kind of "autorelease" pool: allocation function puts all allocated objects into list (or pointers into array),

There are usually implemented as stacks (growing dynamic array) since their purpose is to have a user-defined stack of deferred released objects. It is also an elegant library solution to manage stack unwinding with exception (i.e. non-local jumps).

when calculation finished,

From the point of view of the library, how do you know that the calculation is finished?

you mark result as "needed" then call "free all objects in a list that doesn't marked as needed".

Semantically, it is not different from the "get" below, just more conservative. You can use Autorelease pools internally to the lib to group the "get" over a set of statements, but this is not their main purpose and may be very conservative (like the GC) or error prone.

It is how people in Objective C used to simplify reference counting for years: autorelease pool just decrements reference count, and incrementing reference count once more works as "mark as needed".

I have implemented a DSL in C similar to Objective-C (but better ;-) call C Object System which uses intensively autorelease pools. It is very useful, but the class of problem it solves is different: it's an explicit deferred non-local release.

Best,
Laurent.

Post by Javier Guerra Giraldez
On Wed, Sep 24, 2014 at 3:50 PM, Cosmin Apreutesei

Post by Cosmin Apreutesei
I think he means implementing a pool in Lua, so the function you pass
to ffi.gc() is Lua all the way down.

Post by Javier Guerra Giraldez
2.- (de)allocators are slow, and LuaJIT don't compile calls to __gc.
But if you don't wait for the GC to release, you can do much faster.

If I know when to release, I don't have the problem...

Юрий Соколов

2014-09-24 18:33:23 UTC

Permalink

Can you test with changing HASH_BIAS to zero in a src/lj_tab.h ?
It will not completely repair R=27, but still interesting to know result.

Laurent Deniau

2014-09-24 19:15:25 UTC

Permalink

Post by Ð®ÑÐ¸Ð¹ Ð¡Ð¾ÐºÐ¾Ð»Ð¾Ð²
Can you test with changing HASH_BIAS to zero in a src/lj_tab.h ?
It will not completely repair R=27, but still interesting to know result.

I get R=22 (25% improvement). Is this zero value better for all cases or you wanted to know if I was facing a special case of high collision rate?

Best,
Laurent

Юрий Соколов

2014-09-24 21:20:14 UTC

Permalink

HASH_BIAS (0) is certainly better for ffi.gc when many small allocation are
put in. I suppose that most of the time HASH_BIAS (0) is not worse than
current value, but I have no proof. Perhaps, @agentzh could test it more
thoroughly, and Mike will concern his conclusion.

Perhaps it could be better to use other structure than lua table for
internals of ffi.gc, but I doubt whether Mike will accept it.

Post by Laurent Deniau

Post by Ð®ÑÐ¸Ð¹ Ð¡Ð¾ÐºÐ¾Ð»Ð¾Ð²
Can you test with changing HASH_BIAS to zero in a src/lj_tab.h ?
It will not completely repair R=27, but still interesting to know result.

I get R=22 (25% improvement). Is this zero value better for all cases or
you wanted to know if I was facing a special case of high collision rate?
Best,
Laurent.

Continue reading on narkive:

Search results for 'memory allocation policy' (Questions and Answers)

replies

Physical Memory?

started 2008-05-06 06:37:35 UTC

desktops

replies

Microsoft Windows XP SP3 - Virtual Memory Minimum Too Low ?

started 2008-11-19 17:56:35 UTC

computers & internet

replies

Do they have a yahoo Answers where you live?