Ticket #1853 (closed defect: fixed)
mallopt hint does not always work
|Reported by:||jsquyres||Owned by:||jsquyres|
|Priority:||blocker||Milestone:||Open MPI 1.3.2|
It seems that the mallopt() hint we use to ensure that memory is never given back to the OS does not always work. Yes, we were warned that this was just a hint, but in all of our testing, we didn't find places where it didn't work. IBM and LANL have uncovered cases where the mallopt hint does not work (I don't know if they have a simple reproducer, though). This has also been confirmed with others. In the worst case, this can lead to silent data corruption. Much sadness.
The problem occurs in this scenario only:
- Open MPI v1.3.0 and v1.3.1
- Using the openib BTL
- Using mpi_leave_pinned=1 (which is the default)
Other scenarios are unaffected by this bug.
Note that this issue can be worked around in 1.3.0 and 1.3.1 by either one of the following methods:
- Set the MCA parameter mpi_leave_pinned to 0, which will likely have a negative performance impact if your application re-uses the same communication buffers repeatedly, such as most popular MPI performance benchmarks.
- Link in the openmpi-malloc library when creating MPI applications (e.g., add "-lopenmpi-malloc" to the link line) and leave mpi_leave_pinned to its default value (likely to be 1 when using OpenFabrics-based networks). Linking in this library should have no noticeable performance impacts.
It is only necessary to do one of the above methods, not both.
The issue is fairly complex; the short version is that if the internal allocator ever allocates a 2nd heap, the mallopt hint will only apply to the 1st heap. Hence, memory from the 2nd (or Nth) heap may actually get returned to the OS, which can then effectively corrupt OMPI's internal registration cache.
Brian, Brad, and I have talked about this at length -- it seems we need to start using ptmalloc by default again. The problem is how to enable ptmalloc by default without affecting users who don't use OpenFabrics networks. Brian came up with a good compromise -- using a weak symbol in glibc (I don't know the name offhand -- I need to look it up), you can have some code run before the very first allocation completes in a process. Using this hook, we can use the built-in glibc memory allocator hooks to effectively replace the entire built-in allocator to our internal ptmalloc.
Additionally, our internal ptmalloc can be name-shifted to ompi_<foo>(), so it can be safe to link into any MPI application -- it won't be activated unless we use the setup hooks during process startup.
In the process startup, we're effectively in signal context (can't call malloc, etc.), but we can probably do something like this:
- look for an environment variable (getenv); if it's 1, enable the hook. If it's 0, disable the hook.
- if the env variable is not found, see if /sys/class/infiniband exists (e.g., via stat()). If so, enable the hook.
Additionally, since the mallopt hint does not work reliably, we might as well rip out all the mallopt code.
I have created an hg to do this work: