Ticket #1853 (closed defect: fixed)

Opened 8 years ago

Last modified 8 years ago

mallopt hint does not always work

Reported by: jsquyres Owned by: jsquyres
Priority: blocker Milestone: Open MPI 1.3.2
Version: trunk Keywords:
Cc: brad.benton@…,lenny.verkhovsky@…,mike.ompi@…,terry.dontje@…,don.kerr@…,pasha@…,jon@…,bwbarre@…,rhc@…

Description

It seems that the mallopt() hint we use to ensure that memory is never given back to the OS does not always work. Yes, we were warned that this was just a hint, but in all of our testing, we didn't find places where it didn't work. IBM and LANL have uncovered cases where the mallopt hint does not work (I don't know if they have a simple reproducer, though). This has also been confirmed with others. In the worst case, this can lead to silent data corruption. Much sadness.

The problem occurs in this scenario only:

  • Open MPI v1.3.0 and v1.3.1
  • Using the openib BTL
  • Using mpi_leave_pinned=1 (which is the default)

Other scenarios are unaffected by this bug.

Note that this issue can be worked around in 1.3.0 and 1.3.1 by either one of the following methods:

  1. Set the MCA parameter mpi_leave_pinned to 0, which will likely have a negative performance impact if your application re-uses the same communication buffers repeatedly, such as most popular MPI performance benchmarks.
  2. Link in the openmpi-malloc library when creating MPI applications (e.g., add "-lopenmpi-malloc" to the link line) and leave mpi_leave_pinned to its default value (likely to be 1 when using OpenFabrics-based networks). Linking in this library should have no noticeable performance impacts.

It is only necessary to do one of the above methods, not both.


The issue is fairly complex; the short version is that if the internal allocator ever allocates a 2nd heap, the mallopt hint will only apply to the 1st heap. Hence, memory from the 2nd (or Nth) heap may actually get returned to the OS, which can then effectively corrupt OMPI's internal registration cache.

Brian, Brad, and I have talked about this at length -- it seems we need to start using ptmalloc by default again. The problem is how to enable ptmalloc by default without affecting users who don't use OpenFabrics networks. Brian came up with a good compromise -- using a weak symbol in glibc (I don't know the name offhand -- I need to look it up), you can have some code run before the very first allocation completes in a process. Using this hook, we can use the built-in glibc memory allocator hooks to effectively replace the entire built-in allocator to our internal ptmalloc.

Additionally, our internal ptmalloc can be name-shifted to ompi_<foo>(), so it can be safe to link into any MPI application -- it won't be activated unless we use the setup hooks during process startup.

In the process startup, we're effectively in signal context (can't call malloc, etc.), but we can probably do something like this:

  • look for an environment variable (getenv); if it's 1, enable the hook. If it's 0, disable the hook.
  • if the env variable is not found, see if /sys/class/infiniband exists (e.g., via stat()). If so, enable the hook.

Additionally, since the mallopt hint does not work reliably, we might as well rip out all the mallopt code.

I have created an hg to do this work:

http://www.open-mpi.org/hg/hgwebdir.cgi/jsquyres/mallopt/

Change History

comment:1 Changed 8 years ago by jsquyres

A possible solution is now committed to the mercurial repo noted above. Further testing is required (e.g., ensure --without-memory-manager still works, etc.), and some more tweaks are possible. But the basics are there and running through MTT right now.

comment:2 Changed 8 years ago by jsquyres

Ok, I've tweaked the hg a bit (as of http://www.open-mpi.org/hg/hgwebdir.cgi/jsquyres/mallopt/rev/97e63a378368 ) and I'm fairly convinced that it's all working properly.

If I hear nothing back from other testers by noonish US eastern tomorrow, I'll commit it to the trunk.

comment:3 Changed 8 years ago by jsquyres

  • Status changed from new to closed
  • Resolution set to fixed

(In [20921]) Per http://www.open-mpi.org/community/lists/announce/2009/03/0029.php and https://svn.open-mpi.org/trac/ompi/ticket/1853, mallopt() hints do not always work -- it is possible for memory to be returned to the OS and therefore OMPI's registration cache becomes invalid.

This commit removes all use of mallopt() and uses a different way to integrate ptmalloc2 than we have done in the past. In particular, we use almost exactly the same technique as MX:

  • Remove all uses of mallopt, to include the opal/memory mallopt component.
  • Name-shift all of OMPI's internal ptmalloc2 public symbols (e.g., malloc -> opal_memory_ptmalloc2_malloc).
  • At run-time, use the existing glibc allocator malloc hook function pointers to fully hijack the glibc allocator with our own name-shifted ptmalloc2.
  • Make the decision whether to hijack the glibc allocator at run time (vs. at link time, as previous ptmalloc2 integration attempts have done). Look at the OMPI_MCA_mpi_leave_pinned and OMPI_MCA_mpi_leave_pinned_pipeline environment variables and the existence of /sys/class/infiniband to determine if we should install the hooks or not.
  • As an added bonus, we can now tell if libopen-pal is linked statically or dynamically, and if we're linked statically, we assume that munmap intercept support doesn't work.

See the opal/mca/memory/ptmalloc2/README-open-mpi.txt file for all the gory details about the implementation.

Fixes #1853.

Note: See TracTickets for help on using tickets.