One-sided Communication Operations ================================== Global Arrays provide one-sided, noncollective communication operations that allow to access data in global arrays without cooperation with the process or processes that hold the referenced data. These processes do not know what data items in their own memory are being accessed or updated by remote processes. Moreover, since the GA interface uses global array indices to reference nonlocal data, the calling process does not even have to know process ids and location in memory where the refernenced data resides. The one-sided operations that Global Arrays provide can be summarized into three categories: ============================= ======================================= Operation Process ============================= ======================================= Remote blockwise write/read ``ga_put, ga_get`` Remote atomic update ``ga_acc, ga_read_inc, ga_scatter_acc`` Remote elementwise write/read ``ga_scatter, ga_gather`` ============================= ======================================= Put/Get ------- *Put* and *get* are two powerful operations for interprocess communication, performing remote write and read. Because of their one-sided nature, they don’t need cooperation from the process(es) that owns the data. The semantics of these operations do not require the user to specify which remote process or processes own the accessed portion of a global array. The data is simply accessed as if it were in shared memory. Put copies data from the local array to the global array section, which is: - n-D Fortran subroutine: `nga_put `__\ (g_a, lo, hi, buf, ld) - 2-D Fortran subroutine: `ga_put `__\ (g_a, ilo, ihi, jlo, jhi, buf, ld) - C: void `NGA_Put `__\ (int g_a, int lo[], int hi[], void \*buf, int ld[]) - C++: void GA::GlobalArray::put(int lo[], int hi[], void \*buf, int ld[]) All the arguments are provided in one call: ``lo`` and ``hi`` specify where the data should go in the global array; ``ld`` specifies the stride information of the local array ``buf``. The local array should have the same number of dimensions as the global array; however, it is really required to present the n-dimensional view of the local memory buffer, that by itself might be one-dimensional. The operation is transparent to the user, which means the user doesn’t have to worry about where the region defined by ``lo`` and ``hi`` is located. It can be in the memory of one or many remote processes, owned by the local process, or even mixed (part of it belongs to remote processes and part of it belongs to a local process). *Get* is the reverse operation of *put*. It copies data from a global array section to the local array. It is: - n-D Fortran subroutine: `nga_get `__\ (g_a, lo, hi, buf, ld) - 2-D Fortran subroutine: `ga_get `__\ (g_a, ilo, ihi, jlo, jhi, buf, ld) - C: void `NGA_get `__\ (int g_a, int lo[], int hi[], void \*buf, int ld[]) - C++: void GA::GlobalArray::get(int lo[], int hi[], void \*buf, int ld[]) Similar to *put*, ``lo`` and ``hi`` specify where the data should come from in the global array, and ``ld`` specifies the stride information of the local array ``buf``. The local array is assumed to have the same number of dimensions as the global array. Users don't need to worry about where the region defined by ``lo`` and ``hi`` is physically located. For a ``ga_get`` operation transferring data from the (11:15,1:5) section of a 2-dimensional 15x10 global array into a local buffer 5x10 array we have: (In Fortran notation) *lo*\ ={11,1}, *hi*\ ={15,5}, *ld*\ ={10} .. figure:: images/GA_get_example.png :width: 40% Accumulate and Read-and-increment --------------------------------- It is often useful in a put operation to combine the data moved to the target process with the data that resides at that process, rather then replacing the data there. *Accumulate* and *read_inc* perform an *atomic* remote update to a patch (a section of the global array) in the global array and an element in the global array, respectively. They don’t need the cooperation of the process(es) who owns the data. Since the operations are atomic, the same portion of a global array can be referenced by these operations issued by multiple processes and the GA will assure the correct and consistent result of the updates. *Accumulate* combines the data from the local array with data in the global array section, which is - n-D Fortran subroutine: `nga_acc `__\ (g_a, lo, hi, buf, ld, alpha) - 2-D Fortran subroutine: `ga_acc `__\ (g_a, ilo, ihi, jlo, jhi, buf, ld, alpha) - C: void `NGA_Acc `__\ (int g_a, int lo[], int hi[], void \*buf, int ld[], void \*alpha) - C++: void NGA::GlobalArray::acc(int lo[], int hi[], void \*buf, int ld[], void \*alpha) The local array is assumed to have the same number of dimensions as the global array. Users don’t need to worry about where the region defined by lo and hi is physically located. The function performs *global array section (lo[], hi[])* += *alpha \* buf* Read_inc remotely updates a particular element in the global array, which is - n-D Fortran subroutine: `nga_read_inc `__\ (g_a, subscript, inc) - 2-D Fortran subroutine: `ga_read_inc `__\ (g_a, i, j, inc) - C: long `NGA_Read_inc `__\ (int g_a, int subscript[], long inc) - C++: long GA::GlobalArray::readInc(int subscript[], long inc) This function applies to integer arrays only. It atomically reads and increments an element in an integer array. It performs *a(subsripts)* += *inc* and returns the original value (before the update) of *a(subscript)*. Scatter/Gather -------------- *Scatter* and *gather* transfer a specified set of elements to and from global arrays. They are one-sided: that is they don’t need the cooperation of the process(es) who own the referenced elements in the global array. Scatter puts array elements into a global array, which is - n-D Fortran subroutine: `nga_scatter `__\ (g_a, v, subsarray, n) - 2-D Fortran subroutine: `ga_scatter `__\ (g_a, v, i, j, n) - C: void `NGA_Scatter `__\ (int g_a, void \*v, int \*subsarray[], int n) - C++: void GA::GlobalArray::scatter(void \*v, int \*subsarray[], int n) It performs (in C notation) :: for(k=0; k<=n; k++) { a[subsArray[k][0]][subsArray[k][1]][subsArray[k][2]] ... = v[k]; } Example: Scatter the 5 elements into a 10x10 global array :: Element 1: v[0]=5; subsArray[0][0]=2; subsArray[0][1]=3; Element 2: v[1]=3; subsArray[1][0]=3; subsArray[1][1]=4; Element 3: v[2]=8; subsArray[2][0]=8; subsArray[2][1]=5; Element 4: v[3]=7; subsArray[3][0]=3; subsArray[3][1]=7; Element 5: v[4]=2; subsArray[4][0]=6; subsArray[4][1]=3; After the scatter operation, the five elements would be scattered into the global array as shown in the following figure. .. figure:: images/scatter-GA.png :width: 80% :align: center *Gather* is the reverse operation of scatter. It gets the array elements from a global array into a local array. - n-D Fortran subroutine: `nga_gather `__\ (g_a, v, subsarray, n) - 2-D Fortran subroutine: `ga_gather `__\ ga_gather(g_a, v, i, j, n) - C: void `NGA_Gather `__\ (int g_a, void \*v, int \*subsarray[], int n) - C++: void GA::GlobalArray::gather(void \*v, int \*subsarray[], int n) It performs (in C notation) :: for(k=0; k<=n; k++) { v[k] = a[subsArray[k][0]][subsArray[k][1]][subsArray[k][2]] ...; } Periodic Interfaces ------------------- Periodic interfaces to the one-sided operations have been added to Global Arrays in version 3.1 to support some computational fluid dynamics problems on multidimensional grids. They provide an index translation layer that allows you to use put, get, and accumulate operations, possibly extending beyond the boundaries of a global array. The references that are outside of the boundaries are wrapped up inside the global array. To better illustrate these operations, look at the following example: *Example*: Assume a two dimensional global array g_a with dimensions 5 X 5. .. figure:: images/periodic1.png :width: 50% To access a patch [2:4,-1:3], one can assume that the array is wrapped over in the second dimension, as shown in the following figure .. figure:: images/periodic2.png :width: 70% Therefore the patch [2:4, -1:3] is :: 17 22 2 7 12 18 23 3 8 13 19 24 4 9 14 Periodic operations extend the boudary of each dimension in two directions, toward the lower bound and toward the upper bound. For any dimension with lo(i) to hi(i), where 1 < i < ndim, it extends the range from :: [lo(i) : hi(i)] to [(lo(i)-1-(hi(i)-lo(i)+1)) : (lo(i)-1)], [lo(i) : hi(i)], and [(hi(i)+1) : (hi(i)+1+(hi(i)-lo(i)+1))], or [(lo(i)-1-(hi(i)-lo(i)+1)) : (hi(i)+1+(hi(i)-lo(i)+1))]. Even though the patch spans in a much large range, the length must always be less, or equal to (hi(i)-lo(i)+1)). *Example*: For a 2 x 2 array as shown in the following figure, where the dimensions are [1:2, 1:2], periodic operations would look at the range of each of the dimensions as [-1:4, -1:4]. .. figure:: images/periodic3.png :width: 80% Current version of GA supports three periodic operations. They are - periodic get, - periodic put, and - periodic accumulate *Periodic Get* copies data from a global array section to a local array, which is almost the same as regular get, except the indices of the patch can be outside the boundaries of each dimension. - Fortran subroutine: `nga_periodic_get `__\ (g_a, lo, hi, buf, ld) - C: void `NGA_Periodic_get `__\ (int g_a, int lo[], int hi[], void \*buf, int ld[]) - C++: void GA::GlobalArray::periodicGet(int lo[], int hi[], void \*buf, int ld[]) Similar to regular *get*, ``lo`` and ``hi`` specify where the data should come from in the global array, and ``ld`` specifies the stride information of the local array ``buf``. *Example*: Let us look at the first example in this section. It is 5 x 5 two dimensional global array. Assume that the local buffer is an 4x3 array. Also assume that :: 1o[0] = -1, hi[0] = 2, lo[1] = 4, hi[1] = 6, and ld[0] = 4. .. figure:: images/periodic1.png :width: 50% The local buffer ``buf`` is :: 19 24 4 20 25 5 16 21 1 17 22 2 Periodic Put is the reverse operations of Periodic Get. It copies data from the local array to the global array section, which is - Fortran subroutine: `nga_periodic_put `__\ (g_a, lo, hi, buf, ld) - C: void `NGA_Periodic_put `__\ (int g_a, int lo[], int hi[], void \*buf, int ld[]) - C++: void GA::GlobalArray::periodicPut(int lo[], int hi[], void \*buf, int ld[]) Similar to regular *put*, ``lo`` and ``hi`` specify where the data should go in the global array; ``ld`` specifies the stride information of the local array ``buf``. *Periodic Put/Get* (also include the *Accumulate*, which will be discussed later in this section) divide the patch into several smaller patches. For those smaller patches that are outside the global aray, adjust the indices so that they rotate back to the original array. After that call the regular *Put/Get/Accumulate*, for each patch, to complete the operations. *Example*: Look at the example for periodic get. Because it is a 5 x 5 global array, the valid indices for each dimension are :: dimension 0: [1 : 5] dimension 1: [1 : 5] The specified lo and hi are apparently out of the range of each dimension: :: dimemsion 0: [-1 : 2] --> [-1 : 0] -- wrap back --> [4 : 5] [ 1 : 2] ok dimension 1: [ 4 : 6] --> [ 4 : 5] ok [ 6 : 6] -- wrap back --> [1 : 1] Hence, there will be four smaller patches after the adjustment. They are :: patch 0: [4 : 5, 4 : 5] patch 1: [4 : 5, 1 : 1] patch 2: [1 : 2, 4 : 5] patch 3: [1 : 2, 1 : 1] as shown in the following figure .. figure:: images/periodic4.png :width: 70% Of course the destination addresses of each samller patch in the local buffer also need to be calculated. Similar to regular *Accumulate, Periodic Accumulate* combines the data from the local array with data in the global array section, which is - Fortran subroutine: `nga_periodic_acc `__\ (g_a, lo, hi, buf, ld, alpha) - C: void `NGA_Periodic_acc `__\ (int g_a, int lo[], int hi[], void \*buf, int ld[], void \*alpha) - C++: void GA::GlobalArray::periodicAcc(int lo[], int hi[], void \*buf, int ld[], void \*alpha) The local array is assumed to have the same number of dimensions as the global array. Users don’t need to worry about where the region defined by ``lo`` and ``hi`` is physically located. The function performs *global array section (lo[], hi[]) += alpha \* buf* *Example*: Let us look at the same example as above. There is a 5 x 5 two dimensional global array. Assume that the local buffer is an 4x3 array. Also assume that :: 1o[0] = -1, hi[0] = 2, lo[1] = 4, hi[1] = 6, and ld[0] = 4. .. figure:: images/periodic1.png :width: 50% The local buffer buf is :: 1 5 9 4 6 5 3 2 1 7 8 2 and ``alpha = 2``. After the Periodic Accumulate operation, the global array will be .. figure:: images/periodic5.png :width: 50% Non-blocking operations ----------------------- The non-blocking operations (get/put/accumulate) are derived from the blocking interface by adding a handle argument that identifies an instance of the non-blocking request. Nonblocking operations initiate a communication call and then return control to the application. A return from a nonblocking operation call indicates a mere initiation of the data transfer process and the operation can be completed locally by making a call to the wait (e.g. nga_nbwait) routine. The wait function completes a non-blocking one-sided operation locally. Waiting on a nonblocking put or an accumulate operation assures that data was injected into the network and the user buffer can be now be reused. Completing a get operation assures data has arrived into the user memory and is ready for use. Wait operation ensures only local completion. Unlike their blocking counterparts, the nonblocking operations are not ordered with respect to the destination. Performance being one reason, the other reason is that by ensuring ordering we incur additional and possibly unnecessary overhead on applications that do not require their operations to be ordered. For cases where ordering is necessary, it can be done by calling a fence operation. The fence operation is provided to the user to confirm remote completion if needed. *Example*: Let us take a simple case for illustration. Say, there are two global arrays i.e. one array stores pressure and the other stores temperature. If there are two computation phases (first phase computes pressure and second phase computes temperature), then we can overlap communication with computation, thus hiding latency. .. code-block:: cpp . . . . . . . . . nga_get (get_pressure_array) nga_nbget(initiates data transfer to get temperature_array, and returns immediately) /* hiding latency - communication is overlapped with computation */ compute_pressure() nga_nbwait(temperature_array - completes data transfer) compute_temperature() . . . . . . . . The non-blocking APIs are derived from the blocking interface by adding a handle argument that identifies an instance of the non-blocking request. - n-D Fortran subroutine: `nga_nbput `__\ (g_a, lo, hi, buf, ld, nbhandle) - n-D Fortran subroutine: `nga_nbget `__\ (g_a, lo, hi, buf, ld, nbhandle) - n-D Fortran subroutine: `nga_nbacc `__\ (g_a, lo, hi, buf, ld, alpha, nbhandle) - n-D Fortran subroutine: `nga_nbwait `__\ (nbhandle) - 2-D Fortran subroutine: `ga_nbput `__\ (g_a, ilo, ihi, jlo, jhi, buf, ld, nbhandle) - 2-D Fortran subroutine: `ga_nbget `__\ (g_a, ilo, ihi, jlo, jhi, buf, ld, nbhandle) - 2-D Fortran subroutine: `ga_nbacc `__\ (g_a, ilo, ihi, jlo, jhi, buf, ld, alpha, nbhandle) - 2-D Fortran subroutine: `ga_nbwait `__\ (nbhandle) - C: void `NGA_NbPut `__\ (int g_a, int lo[], int hi[], void \*buf, int ld[], ga_nbhdl_t\* nbhandle) - C: void `NGA_NbGet `__\ (int g_a, int lo[], int hi[], void \*buf, int ld[], ga_nbhdl_t\* nbhandle) - C: void `NGA_NbAcc `__\ (int g_a, int lo[], int hi[], void \*buf, int ld[], void \*alpha, ga_nbhdl_t\* nbhandle) - C int `NGA_NbWait `__\ (ga_nbhdl_t\* nbhandle) - C++: void GA::GlobalArray::nbPut(int lo[], int hi[], void \*buf, int ld[], ga_nbhdl_t\* nbhandle) - C++: void GA::GlobalArray::nbGet(int lo[], int hi[], void \*buf, int ld[], ga_nbhdl_t\* nbhandle) - C++: void GA::GlobalArray::nbAcc(int lo[], int hi[], void \*buf, int ld[], void \*alpha, ga_nbhdl_t\* nbhandle) - C++ int GA::GlobalArray::NbWait(ga_nbhdl_t\* nbhandle)