Laurent Deniau
2014-09-22 14:37:58 UTC
Dear All,
We are facing performance problem with memory management shared between LJ and C that is actually not clearly understood by us. We try to figure out what is the best policy for intensive memory allocations.
Basically, we are implementing some algebra for complex objects with sizes ranging from a simple 6D vector to some complex objects of few megabytes manipulated in C (for performance and parallel computation reasons) that share complex and large descriptors of their internal representation (hooked pointers). To simplify the problem you can consider these objects as matrices of user-defined sizes ranging from 2x2 to 500x500, and used in hundreds of formula (i.e. operator overloading is not an option). We use the c = a*b on small objects for benchmarking the memory allocation overhead and consider only the speed ratio R between the different scenarii (small R = fast, high R = slow):
1- result is pre-allocated:
1.1 mul(a,b,c) in LJ : R=8
1.2 mul(a,b,c) in C : R=1
C is faster for this kind of calculation by a factor 8 and speed of mul is critical for us. Hence in the following, c = a*b is always handled by LJ __mul and calls the C version of mul (i.e. R=1) for the underlying multiplication. The remaining choice is the pre-allocation of the results in a*b:
2- result allocated in LJ: R=2
3- result allocated in C (malloc+finaliser): R=30
4- result allocated in C (pool+finaliser): R=27
case 2 shows that LJ is very fast to handle the allocations and R=2 would be acceptable, that is half of the time would be spent in allocation for small objects (negligible for large objects). But this case quickly reaches the memory limit of LJ for large objects because all the memory is managed by LJ.
case 4 vs case 3 shows that even highly optimised pool allocator in (3e8 malloc+free/sec) does not solve the problem, so the problem is not the C malloc+free cycle itself.
case 3 shows the problem under investigation:
C malloc+free combined with __gc lead to R=30 (97% of the time), which is not acceptable. Such slowdown could potentially be explained only by two things to me:
- the gc collects (calls the finaliser) too lately as it is not aware that objects waiting for destruction sill consume a significant amount of memory.
- we fall back on the interpreter at some level
- the gc uses a slow mechanism to invoke the finaliser
- ???
I suspect that this problem is not new and some solutions have been already found by others. It would be nice if someone could explain me why the delegation of memory management to C gives such bad results and how to solve the problem. Thanks.
Best regards,
Laurent.
We are facing performance problem with memory management shared between LJ and C that is actually not clearly understood by us. We try to figure out what is the best policy for intensive memory allocations.
Basically, we are implementing some algebra for complex objects with sizes ranging from a simple 6D vector to some complex objects of few megabytes manipulated in C (for performance and parallel computation reasons) that share complex and large descriptors of their internal representation (hooked pointers). To simplify the problem you can consider these objects as matrices of user-defined sizes ranging from 2x2 to 500x500, and used in hundreds of formula (i.e. operator overloading is not an option). We use the c = a*b on small objects for benchmarking the memory allocation overhead and consider only the speed ratio R between the different scenarii (small R = fast, high R = slow):
1- result is pre-allocated:
1.1 mul(a,b,c) in LJ : R=8
1.2 mul(a,b,c) in C : R=1
C is faster for this kind of calculation by a factor 8 and speed of mul is critical for us. Hence in the following, c = a*b is always handled by LJ __mul and calls the C version of mul (i.e. R=1) for the underlying multiplication. The remaining choice is the pre-allocation of the results in a*b:
2- result allocated in LJ: R=2
3- result allocated in C (malloc+finaliser): R=30
4- result allocated in C (pool+finaliser): R=27
case 2 shows that LJ is very fast to handle the allocations and R=2 would be acceptable, that is half of the time would be spent in allocation for small objects (negligible for large objects). But this case quickly reaches the memory limit of LJ for large objects because all the memory is managed by LJ.
case 4 vs case 3 shows that even highly optimised pool allocator in (3e8 malloc+free/sec) does not solve the problem, so the problem is not the C malloc+free cycle itself.
case 3 shows the problem under investigation:
C malloc+free combined with __gc lead to R=30 (97% of the time), which is not acceptable. Such slowdown could potentially be explained only by two things to me:
- the gc collects (calls the finaliser) too lately as it is not aware that objects waiting for destruction sill consume a significant amount of memory.
- we fall back on the interpreter at some level
- the gc uses a slow mechanism to invoke the finaliser
- ???
I suspect that this problem is not new and some solutions have been already found by others. It would be nice if someone could explain me why the delegation of memory management to C gives such bad results and how to solve the problem. Thanks.
Best regards,
Laurent.