Changeset 22592


Ignore:
Timestamp:
02/10/10 16:53:26 (7 years ago)
Author:
jsquyres
Message:

After a lot of discussion and testing, this commit fixes some
long-standing bugs (see trac ticket list below). They're currently
somewhat obscure bugs, but are becoming much more relevant in a world
where OpenFabrics? devices fail and you replace them with a newer model
(i.e., the cluster is homogeneous... except for where you had to
replace one or two OpenFabrics? devices, and the same model is no
longer available).

This commit includes a lengthy comment (that we spent a lot of
time writing!) about what exactly it does and does not do. The
previous code was rather short and incredibly subtle. The new
code is slightly longer, but is both much more explicit and much more
painstakingly documented.

This commit fixes multiple trac tickets. The real one that we fix is
#1707; the others are fixed as a side-effect. In short: fixing #1707
prevents Bad Things from happening later in the startup sequence.

Fixes #1707, #2164, #1574.

cmr:v1.4.2:reviewer=pasha
cmr:v1.5:reviewer=pasha

Location:
trunk/ompi/mca/btl/openib
Files:
2 edited

Legend:

Unmodified
Added
Removed
  • trunk/ompi/mca/btl/openib/btl_openib_component.c

    r22320 r22592  
    10811081    } 
    10821082 
    1083     mca_btl_openib_component.devices_count++; 
    10841083    return OMPI_SUCCESS; 
    10851084} 
     
    18041803        } 
    18051804 
    1806         /* If the user specified btl_openib_receive_queues MCA param, it 
    1807            overrides all device INI params */ 
    1808         if (BTL_OPENIB_RQ_SOURCE_MCA != 
    1809             mca_btl_openib_component.receive_queues_source && 
    1810             NULL != values.receive_queues) { 
    1811             /* If a prior device's INI values set a different value for 
    1812                receive_queues, this is unsupported (see 
    1813                https://svn.open-mpi.org/trac/ompi/ticket/1285) */ 
    1814             if (BTL_OPENIB_RQ_SOURCE_DEVICE_INI == 
    1815                 mca_btl_openib_component.receive_queues_source) { 
    1816                 if (0 != strcmp(values.receive_queues, 
     1805        /* Check to ensure that all devices used in this process have 
     1806           compatible receive_queues values (we check elsewhere to see 
     1807           if all devices used in other processes in this job have 
     1808           compatible receive_queues values). 
     1809 
     1810           Not only is the check complex, but the reasons behind what 
     1811           it does (and does not do) are complex.  Before explaining 
     1812           the code below, here's some notes: 
     1813 
     1814           1. The openib BTL component only supports 1 value of the 
     1815              receive_queues between all of its modules. 
     1816 
     1817              --> This could be changed to allow every module to have 
     1818                  its own receive_queues.  But that would be a big 
     1819                  deal; no one has time to code this up right now. 
     1820 
     1821           2. The receive_queues value can be specified either as an 
     1822              MCA parameter or in the INI file.  Specifying the value 
     1823              as an MCA parameter overrides all INI file values 
     1824              (meaning: that MCA param value will be used for all 
     1825              openib BTL modules in the process). 
     1826 
     1827           Effectively, the first device through init_one_device() 
     1828           gets to decide what the receive_queues will be for the all 
     1829           modules in this process.  This is an unfortunate artifact 
     1830           of the openib BTL startup sequence (see below for more 
     1831           details).  The first device will choose the receive_queues 
     1832           value from: (in priority order):  
     1833 
     1834           1. If the btl_openib_receive_queues MCA param was 
     1835              specified, use that. 
     1836           2. If this device has a receive_queues value specified in 
     1837              the INI file, use that. 
     1838           3. Otherwise, use the default MCA param value for 
     1839              btl_openib_receive_queues. 
     1840 
     1841           If any successive device has a different value specified in 
     1842           the INI file, we show_help and return up the stack that 
     1843           this device failed. 
     1844 
     1845           In the case that the user does not specify a 
     1846           mca_btl_openib_receive_queues value, the short description 
     1847           of what is allowed is that either a) no devices specify a 
     1848           receive_queues value in the INI file (in which case we use 
     1849           the default MCA param value), b) all devices specify the 
     1850           same receive_queues value in the INI value, or c) some/all 
     1851           devices specify the same receive_queues value in the INI 
     1852           value as the default MCA param value. 
     1853 
     1854           Let's take some sample cases to explain this more clearly... 
     1855 
     1856           THESE ARE THE "GOOD" CASES 
     1857           -------------------------- 
     1858 
     1859           Case 1: no INI values 
     1860           - MCA parameter: not specified 
     1861           - default receive_queues: value A 
     1862           - device 0: no receive_queues in INI file 
     1863           - device 1: no receive_queues in INI file 
     1864           - device 2: no receive_queues in INI file 
     1865           --> use receive_queues value A with all devices 
     1866 
     1867           Case 2: all INI values the same (same as default) 
     1868           - MCA parameter: not specified 
     1869           - default receive_queues: value A 
     1870           - device 0: receive_queues value A in the INI file 
     1871           - device 1: receive_queues value A in the INI file 
     1872           - device 2: receive_queues value A in the INI file 
     1873           --> use receive_queues value A with all devices 
     1874 
     1875           Case 3: all INI values the same (but different than default) 
     1876           - MCA parameter: not specified 
     1877           - default receive_queues: value A 
     1878           - device 0: receive_queues value B in the INI file 
     1879           - device 1: receive_queues value B in the INI file 
     1880           - device 2: receive_queues value B in the INI file 
     1881           --> use receive_queues value B with all devices 
     1882 
     1883           Case 4: some INI unspecified, but rest same as default 
     1884           - MCA parameter: not specified 
     1885           - default receive_queues: value A 
     1886           - device 0: receive_queues value A in the INI file 
     1887           - device 1: no receive_queues in INI file 
     1888           - device 2: receive_queues value A in the INI file 
     1889           --> use receive_queues value A with all devices 
     1890 
     1891           Case 5: some INI unspecified (including device 0), but rest same as default 
     1892           - MCA parameter: not specified 
     1893           - default receive_queues: value A 
     1894           - device 0: no receive_queues in INI file 
     1895           - device 1: no receive_queues in INI file 
     1896           - device 2: receive_queues value A in the INI file 
     1897           --> use receive_queues value A with all devices 
     1898 
     1899           Case 6: different default/INI values, but MCA param is specified 
     1900           - MCA parameter: value D 
     1901           - default receive_queues: value A 
     1902           - device 0: no receive_queues in INI file 
     1903           - device 1: receive_queues value B in INI file 
     1904           - device 2: receive_queues value C in INI file 
     1905           --> use receive_queues value D with all devices 
     1906 
     1907           What this means is that this selection process is 
     1908           unfortunately tied to the order of devices.  :-( Device 0 
     1909           effectively sets what the receive_queues value will be for 
     1910           that process.  If any later device disagrees, that's 
     1911           problematic and we have to error/abort. 
     1912 
     1913           ALL REMAINING CASES WILL FAIL 
     1914           ----------------------------- 
     1915 
     1916           Case 7: one INI value (different than default) 
     1917           - MCA parameter: not specified 
     1918           - default receive_queues: value A 
     1919           - device 0: receive_queues value B in INI file 
     1920           - device 1: no receive_queues in INI file 
     1921           - device 2: no receive_queues in INI file 
     1922           --> Jeff thinks that it would be great to use 
     1923               receive_queues value B with all devices.  However, it 
     1924               shares one of the problems cited in case 8, below.  So 
     1925               we need to fail this scenario; print an error and 
     1926               abort. 
     1927            
     1928           Case 8: one INI value, different than default 
     1929           - MCA parameter: not specified 
     1930           - default receive_queues: value A 
     1931           - device 0: no receive_queues in INI file 
     1932           - device 1: receive_queues value B in INI file 
     1933           - device 2: no receive_queues in INI file 
     1934 
     1935           --> Jeff thinks that it would be great to use 
     1936               receive_queues value B with all devices.  However, it 
     1937               has (at least) 2 problems: 
     1938 
     1939               1. The check for local receive_queue compatibility is 
     1940                  done here in init_one_device().  By the time we call 
     1941                  init_one_device() for device 1, we have already 
     1942                  called init_one_device() for device 0, meaning that 
     1943                  device 0's QPs have already been created and setup 
     1944                  using the MCA parameter's default receive_queues 
     1945                  value.  So if device 1 *changes* the 
     1946                  component.receive_queues value, then device 0 and 
     1947                  device 1 now have different receive_queue sets (more 
     1948                  specifically: the QPs setup for device 0 are now 
     1949                  effectively lost).  This is Bad. 
     1950 
     1951                  It would be great if we didn't have this restriction 
     1952                  -- either by letting each module have its own 
     1953                  receive_queues value or by scanning all devices and 
     1954                  figuring out a final receive_queues value *before* 
     1955                  actually setting up any QPs.  But that's not the 
     1956                  current flow of the code (patches would be greatly 
     1957                  appreciated here, of course!).  Unfortunately, no 
     1958                  one has time to code this up right now, so we're 
     1959                  leaving this as explicitly documented for some 
     1960                  future implementer... 
     1961 
     1962               2. Conside a scenario with server 1 having HCA A/subnet 
     1963                  X, and server 2 having HCA B/subnet X and HCA 
     1964                  C/subnet Y.  And let's assume: 
     1965 
     1966                  Server 1: 
     1967                  HCA A: no receive_queues in INI file 
     1968 
     1969                  Server 2: 
     1970                  HCA B: no receive_queues in INI file 
     1971                  HCA C: receive_queues specified in INI file 
     1972                 
     1973                  A will therefore use the default receive_queues 
     1974                  value.  B and C will use C's INI receive_queues. 
     1975                  But note that modex [currently] only sends around 
     1976                  vendor/part IDs for OpenFabrics devices -- not the 
     1977                  actual receive_queues value (it was felt that 
     1978                  including the final receive_queues string value in 
     1979                  the modex would dramatically increase the size of 
     1980                  the modex).  So processes on server 1 will get the 
     1981                  vendor/part ID for HCA B, look it up in the INI 
     1982                  file, see that it has no receive_queues value 
     1983                  specified, and then assume that it uses the default 
     1984                  receive_queues value.  Hence, procs on server 1 will 
     1985                  try to connect HCA A-->HCA B with the wrong 
     1986                  receive_queues value.  Bad.  Further, the error 
     1987                  won't be discovered by checks like this because A 
     1988                  won't check D's receive_queues because D is on a 
     1989                  different subnet. 
     1990 
     1991                  This could be fixed, of course; either by a) send 
     1992                  the final receive_queues value in the modex (perhaps 
     1993                  compressing or encoding it so that it can be much 
     1994                  shorter than the string -- the current vendor/part 
     1995                  ID stuff takes 8 bytes for each device), or b) 
     1996                  replicating the determination process of each host 
     1997                  in each process (i.e., procs on server 1 would see 
     1998                  both B and C, and use them both to figure out what 
     1999                  the "final" receive_queues value is for B). 
     2000                  Unfortunately, no one has time to code this up right 
     2001                  now, so we're leaving this as explicitly documented 
     2002                  for some future implementer... 
     2003 
     2004               Because of both of these problems, this case is 
     2005               problematic and must fail with a show_help error. 
     2006 
     2007           Case 9: two devices with same INI value (different than default) 
     2008           - MCA parameter: not specified 
     2009           - default receive_queues: value A 
     2010           - device 0: no receive_queues in INI file 
     2011           - device 1: receive_queues value B in INI file 
     2012           - device 2: receive_queues value B in INI file 
     2013           --> per case 8, fail with a show_help message. 
     2014            
     2015           Case 10: two devices with different INI values 
     2016           - MCA parameter: not specified 
     2017           - default receive_queues: value A 
     2018           - device 0: no receive_queues in INI file 
     2019           - device 1: receive_queues value B in INI file 
     2020           - device 2: receive_queues value C in INI file 
     2021           --> per case 8, fail with a show_help message. 
     2022 
     2023        */ 
     2024 
     2025        /* If the MCA param was specified, skip all the checks */ 
     2026        if (BTL_OPENIB_RQ_SOURCE_MCA == 
     2027            mca_btl_openib_component.receive_queues_source) { 
     2028            goto good; 
     2029        } 
     2030 
     2031        /* If we're the first device and we have a receive_queues 
     2032           value from the INI file *that is different than the 
     2033           already-existing default value*, then set the component to 
     2034           use that. */ 
     2035        if (0 == mca_btl_openib_component.devices_count) { 
     2036            if (NULL != values.receive_queues && 
     2037                0 != strcmp(values.receive_queues, 
     2038                            mca_btl_openib_component.receive_queues)) { 
     2039                if (NULL != mca_btl_openib_component.receive_queues) { 
     2040                    free(mca_btl_openib_component.receive_queues); 
     2041                } 
     2042                mca_btl_openib_component.receive_queues = 
     2043                    strdup(values.receive_queues); 
     2044                mca_btl_openib_component.receive_queues_source = 
     2045                    BTL_OPENIB_RQ_SOURCE_DEVICE_INI; 
     2046            } 
     2047        } 
     2048 
     2049        /* If we're not the first device, then we have to conform to 
     2050           either the default value if the first device didn't set 
     2051           anything, or to whatever the first device decided. */ 
     2052        else { 
     2053            /* In all cases, if this device has a receive_queues value 
     2054               in the INI, then it must agree with 
     2055               component.receive_queues. */ 
     2056            if (NULL != values.receive_queues) { 
     2057                if (0 != strcmp(values.receive_queues,  
    18172058                                mca_btl_openib_component.receive_queues)) { 
    18182059                    orte_show_help("help-mpi-btl-openib.txt", 
    1819                                    "conflicting receive_queues", true, 
     2060                                   "locally conflicting receive_queues", true, 
     2061                                   opal_install_dirs.pkgdatadir, 
    18202062                                   orte_process_info.nodename, 
    1821                                    ibv_get_device_name(device->ib_dev), 
    1822                                    device->ib_dev_attr.vendor_id, 
    1823                                    device->ib_dev_attr.vendor_part_id, 
    1824                                    values.receive_queues, 
    18252063                                   ibv_get_device_name(receive_queues_device->ib_dev), 
    18262064                                   receive_queues_device->ib_dev_attr.vendor_id, 
    18272065                                   receive_queues_device->ib_dev_attr.vendor_part_id, 
    18282066                                   mca_btl_openib_component.receive_queues, 
    1829                                    opal_install_dirs.pkgdatadir); 
     2067                                   ibv_get_device_name(device->ib_dev), 
     2068                                   device->ib_dev_attr.vendor_id, 
     2069                                   device->ib_dev_attr.vendor_part_id, 
     2070                                   values.receive_queues); 
    18302071                    ret = OMPI_ERR_RESOURCE_BUSY; 
    18312072                    goto error; 
    18322073                } 
    1833             } else { 
    1834                 if (NULL != mca_btl_openib_component.receive_queues) { 
    1835                     free(mca_btl_openib_component.receive_queues); 
    1836                 } 
    1837                 receive_queues_device = device; 
    1838                 mca_btl_openib_component.receive_queues = 
    1839                     strdup(values.receive_queues); 
    1840                 mca_btl_openib_component.receive_queues_source = 
    1841                     BTL_OPENIB_RQ_SOURCE_DEVICE_INI; 
    1842             } 
    1843         } 
     2074            } 
     2075 
     2076            /* If this device doesn't have an INI receive_queues 
     2077               value, then if the component.receive_queues value came 
     2078               from the default, we're ok.  But if the 
     2079               component.receive_queues value came from the 1st 
     2080               device's INI file, we must error. */ 
     2081            else if (BTL_OPENIB_RQ_SOURCE_DEVICE_INI == 
     2082                mca_btl_openib_component.receive_queues_source) { 
     2083                orte_show_help("help-mpi-btl-openib.txt", 
     2084                               "locally conflicting receive_queues", true, 
     2085                               opal_install_dirs.pkgdatadir, 
     2086                               orte_process_info.nodename, 
     2087                               ibv_get_device_name(receive_queues_device->ib_dev), 
     2088                               receive_queues_device->ib_dev_attr.vendor_id, 
     2089                               receive_queues_device->ib_dev_attr.vendor_part_id, 
     2090                               mca_btl_openib_component.receive_queues, 
     2091                               ibv_get_device_name(device->ib_dev), 
     2092                               device->ib_dev_attr.vendor_id, 
     2093                               device->ib_dev_attr.vendor_part_id, 
     2094                               mca_btl_openib_component.default_recv_qps); 
     2095                ret = OMPI_ERR_RESOURCE_BUSY; 
     2096                goto error; 
     2097            } 
     2098        } 
     2099 
     2100        receive_queues_device = device; 
     2101 
     2102    good: 
     2103        mca_btl_openib_component.devices_count++; 
    18442104        return OMPI_SUCCESS; 
    18452105    } 
  • trunk/ompi/mca/btl/openib/help-mpi-btl-openib.txt

    r22402 r22592  
    561561only single active port was found. Disabling APM over ports 
    562562# 
    563 [conflicting receive_queues] 
    564 Open MPI detected two different sets of OpenFabrics receives queues on 
    565 the same host (in the openib BTL).  Open MPI currently only supports 
    566 one set of OF receive queues in an MPI job, even if you have different 
    567 types of OpenFabrics adapters on the same host. 
    568  
    569   Local host:      %s 
    570   Adapter 1: %s (vendor 0x%x, part ID %d) 
    571   Queues:    %s 
    572   Adapter 2: %s (vendor 0x%x, part ID %d) 
    573   Queues:    %s 
    574  
    575 Note that these receive queues values may have come from the Open MPI 
    576 adapter default settings file: 
     563[locally conflicting receive_queues] 
     564Open MPI detected two devices on a single server that have different 
     565"receive_queues" parameter values (in the openib BTL).  Open MPI 
     566currently only supports one OpenFabrics receive_queues value in an MPI 
     567job, even if you have different types of OpenFabrics adapters on the 
     568same host. 
     569 
     570Device 2 (in the details shown below) will be ignored for the duration 
     571of this MPI job. 
     572 
     573You can fix this issue by one or more of the following: 
     574 
     575  1. Set the MCA parameter btl_openib_receive_queues to a value that 
     576     is usable by all the OpenFabrics devices that you will use. 
     577  2. Use the btl_openib_if_include or btl_openib_if_exclue MCA 
     578     parameters to select exactly which OpenFabrics devices to use in 
     579     your MPI job. 
     580 
     581Finally, note that the "receive_queues" values may have been set by 
     582the Open MPI device default settings file.  You may want to look in 
     583this file and see if your devices are getting receive_queues values 
     584from this file: 
    577585 
    578586    %s/mca-btl-openib-device-params.ini 
     587 
     588Here is more detailed information about the recieive_queus value 
     589conflict: 
     590 
     591  Local host:     %s 
     592  Device 1:       %s (vendor 0x%x, part ID %d) 
     593  Receive queues: %s 
     594  Device 2:       %s (vendor 0x%x, part ID %d) 
     595  Receive queues: %s 
    579596# 
    580597[eager RDMA and progress threads] 
Note: See TracChangeset for help on using the changeset viewer.