One-sided Communication Operations
==================================

Global Arrays provide one-sided, noncollective communication operations
that allow to access data in global arrays without cooperation with the
process or processes that hold the referenced data. These processes do
not know what data items in their own memory are being accessed or
updated by remote processes. Moreover, since the GA interface uses
global array indices to reference nonlocal data, the calling process
does not even have to know process ids and location in memory where the
refernenced data resides.

The one-sided operations that Global Arrays provide can be summarized
into three categories:

============================= =======================================
Operation                     Process
============================= =======================================
Remote blockwise write/read   ``ga_put, ga_get``
Remote atomic update          ``ga_acc, ga_read_inc, ga_scatter_acc``
Remote elementwise write/read ``ga_scatter, ga_gather``
============================= =======================================

Put/Get 
-------

*Put* and *get* are two powerful operations for interprocess
communication, performing remote write and read. Because of their
one-sided nature, they don’t need cooperation from the process(es) that
owns the data. The semantics of these operations do not require the user
to specify which remote process or processes own the accessed portion of
a global array. The data is simply accessed as if it were in shared
memory.

Put copies data from the local array to the global array section, which
is:

- n-D Fortran subroutine: `nga_put <https://hpc.pnl.gov/globalarrays/api/f_op_api.html#PUT>`__\ (g_a, lo, hi, buf, ld) 

- 2-D Fortran subroutine: `ga_put <https://hpc.pnl.gov/globalarrays/api/f_op_api.html#PUT>`__\ (g_a, ilo, ihi, jlo, jhi, buf, ld) 

- C:           void `NGA_Put <https://hpc.pnl.gov/globalarrays/api/c_op_api.html#PUT>`__\ (int g_a, int lo[], int hi[], void \*buf, int ld[]) 

- C++:         void GA::GlobalArray::put(int lo[], int hi[], void \*buf, int ld[]) 

All the arguments are
provided in one call: ``lo`` and ``hi`` specify where the data should go
in the global array; ``ld`` specifies the stride information of the
local array ``buf``. The local array should have the same number of
dimensions as the global array; however, it is really required to
present the n-dimensional view of the local memory buffer, that by
itself might be one-dimensional.

The operation is transparent to the user, which means the user doesn’t
have to worry about where the region defined by ``lo`` and ``hi`` is
located. It can be in the memory of one or many remote processes, owned
by the local process, or even mixed (part of it belongs to remote
processes and part of it belongs to a local process).

*Get* is the reverse operation of *put*. It copies data from a global
array section to the local array. It is:

- n-D Fortran subroutine: `nga_get <https://hpc.pnl.gov/globalarrays/api/f_op_api.html#ga_get>`__\ (g_a, lo, hi, buf, ld) 

- 2-D Fortran subroutine: `ga_get <https://hpc.pnl.gov/globalarrays/api/f_op_api.html#ga_get>`__\ (g_a, ilo, ihi, jlo, jhi, buf, ld) 

- C: void `NGA_get <https://hpc.pnl.gov/globalarrays/api/c_op_api.html#ga_get>`__\ (int g_a, int lo[], int hi[], void \*buf, int ld[]) 

- C++: void GA::GlobalArray::get(int lo[], int hi[], void \*buf, int ld[]) 
  
Similar to *put*, ``lo`` and ``hi`` specify where the data should come from in the global array,
and ``ld`` specifies the stride information of the local array ``buf``.
The local array is assumed to have the same number of dimensions as the
global array. Users don't need to worry about where the region defined
by ``lo`` and ``hi`` is physically located.

For a ``ga_get`` operation transferring data from the (11:15,1:5)
section of a 2-dimensional 15x10 global array into a local buffer 5x10
array we have: (In Fortran notation)

*lo*\ ={11,1}, *hi*\ ={15,5}, *ld*\ ={10}

.. figure:: images/GA_get_example.png
   :width: 40%

Accumulate and Read-and-increment 
---------------------------------

It is often useful in a put operation to combine the data moved to the
target process with the data that resides at that process, rather then
replacing the data there. *Accumulate* and *read_inc* perform an
*atomic* remote update to a patch (a section of the global array) in the
global array and an element in the global array, respectively. They
don’t need the cooperation of the process(es) who owns the data. Since
the operations are atomic, the same portion of a global array can be
referenced by these operations issued by multiple processes and the GA
will assure the correct and consistent result of the updates.

*Accumulate* combines the data from the local array with data in the
global array section, which is

- n-D Fortran subroutine: `nga_acc <https://hpc.pnl.gov/globalarrays/api/f_op_api.html#ga_acc>`__\ (g_a, lo, hi, buf, ld, alpha) 

- 2-D Fortran subroutine: `ga_acc <https://hpc.pnl.gov/globalarrays/api/f_op_api.html#ga_acc>`__\ (g_a, ilo, ihi, jlo, jhi, buf, ld, alpha)

- C:           void `NGA_Acc <https://hpc.pnl.gov/globalarrays/api/c_op_api.html#ga_acc>`__\ (int g_a, int lo[], int hi[], void \*buf, int ld[], void \*alpha)

- C++:         void NGA::GlobalArray::acc(int lo[], int hi[], void \*buf, int ld[], void \*alpha) 

The local array is
assumed to have the same number of dimensions as the global array. Users
don’t need to worry about where the region defined by lo and hi is
physically located. The function performs

*global array section (lo[], hi[])* += *alpha \* buf*

Read_inc remotely updates a particular element in the global array,
which is

- n-D Fortran subroutine: `nga_read_inc <https://hpc.pnl.gov/globalarrays/api/f_op_api.html#ga_read_inc>`__\ (g_a, subscript, inc) 

- 2-D Fortran subroutine: `ga_read_inc <https://hpc.pnl.gov/globalarrays/api/f_op_api.html#ga_read_inc>`__\ (g_a, i, j, inc) 

- C:           long `NGA_Read_inc <https://hpc.pnl.gov/globalarrays/api/c_op_api.html#ga_read_inc>`__\ (int g_a, int subscript[], long inc) 

- C++:         long GA::GlobalArray::readInc(int subscript[], long inc)

This function applies to integer arrays only. It atomically reads and
increments an element in an integer array. It performs

*a(subsripts)* += *inc*

and returns the original value (before the update) of *a(subscript)*.

Scatter/Gather 
--------------

*Scatter* and *gather* transfer a specified set of elements to and from
global arrays. They are one-sided: that is they don’t need the
cooperation of the process(es) who own the referenced elements in the
global array.

Scatter puts array elements into a global array, which is

- n-D Fortran subroutine: `nga_scatter <https://hpc.pnl.gov/globalarrays/api/f_op_api.html#ga_scatter>`__\ (g_a, v, subsarray, n) 

- 2-D Fortran subroutine: `ga_scatter <https://hpc.pnl.gov/globalarrays/api/f_op_api.html#ga_scatter>`__\ (g_a, v, i, j, n) 

- C:           void `NGA_Scatter <https://hpc.pnl.gov/globalarrays/api/c_op_api.html#ga_scatter>`__\ (int g_a, void \*v, int \*subsarray[], int n) 

- C++:         void GA::GlobalArray::scatter(void \*v, int \*subsarray[], int n)

It performs (in C notation)

::

   for(k=0; k<=n; k++) {
       a[subsArray[k][0]][subsArray[k][1]][subsArray[k][2]] ... = v[k];
   }

Example: Scatter the 5 elements into a 10x10 global array

::

   Element 1: v[0]=5; subsArray[0][0]=2; subsArray[0][1]=3;
   Element 2: v[1]=3; subsArray[1][0]=3; subsArray[1][1]=4;
   Element 3: v[2]=8; subsArray[2][0]=8; subsArray[2][1]=5;
   Element 4: v[3]=7; subsArray[3][0]=3; subsArray[3][1]=7;
   Element 5: v[4]=2; subsArray[4][0]=6; subsArray[4][1]=3;

After the scatter operation, the five elements would be scattered into
the global array as shown in the following figure.

.. figure:: images/scatter-GA.png
   :width: 80%
   :align: center

*Gather* is the reverse operation of scatter. It gets the array elements
from a global array into a local array.

- n-D Fortran subroutine: `nga_gather <https://hpc.pnl.gov/globalarrays/api/f_op_api.html#ga_gather>`__\ (g_a, v, subsarray, n) 

- 2-D Fortran subroutine: `ga_gather <https://hpc.pnl.gov/globalarrays/api/f_op_api.html#ga_gather>`__\ ga_gather(g_a, v, i, j, n) 

- C:           void `NGA_Gather <https://hpc.pnl.gov/globalarrays/api/c_op_api.html#ga_gather>`__\ (int g_a, void \*v, int \*subsarray[], int n) 

- C++:         void GA::GlobalArray::gather(void \*v, int \*subsarray[], int n)

It performs (in C notation)

::

   for(k=0; k<=n; k++) {
       v[k] = a[subsArray[k][0]][subsArray[k][1]][subsArray[k][2]] ...;
   }

Periodic Interfaces 
-------------------

Periodic interfaces to the one-sided operations have been added to
Global Arrays in version 3.1 to support some computational fluid
dynamics problems on multidimensional grids. They provide an index
translation layer that allows you to use put, get, and accumulate
operations, possibly extending beyond the boundaries of a global array.
The references that are outside of the boundaries are wrapped up inside
the global array. To better illustrate these operations, look at the
following example:

*Example*: Assume a two dimensional global array g_a with dimensions 5 X 5.

.. figure:: images/periodic1.png
   :width: 50%

To access a patch [2:4,-1:3], one can assume that the array is wrapped
over in the second dimension, as shown in the following figure

.. figure:: images/periodic2.png
   :width: 70%

Therefore the patch [2:4, -1:3] is

::

   17 22 2 7 12
   18 23 3 8 13
   19 24 4 9 14

Periodic operations extend the boudary of each dimension in two
directions, toward the lower bound and toward the upper bound. For any
dimension with lo(i) to hi(i), where 1 < i < ndim, it extends the range
from

::

   [lo(i) : hi(i)]
   to
   [(lo(i)-1-(hi(i)-lo(i)+1)) : (lo(i)-1)], [lo(i) : hi(i)],
   and
   [(hi(i)+1) : (hi(i)+1+(hi(i)-lo(i)+1))],
   or
   [(lo(i)-1-(hi(i)-lo(i)+1)) : (hi(i)+1+(hi(i)-lo(i)+1))].

Even though the patch spans in a much large range, the length must
always be less, or equal to (hi(i)-lo(i)+1)).

*Example*: For a 2 x 2 array as shown in the following figure, where the
dimensions are [1:2, 1:2], periodic operations would look at the range
of each of the dimensions as [-1:4, -1:4].

.. figure:: images/periodic3.png
   :width: 80%

Current version of GA supports three periodic operations. They are

-  periodic get,

-  periodic put, and

-  periodic accumulate

*Periodic Get* copies data from a global array section to a local array,
which is almost the same as regular get, except the indices of the patch
can be outside the boundaries of each dimension.

- Fortran subroutine: `nga_periodic_get <https://hpc.pnl.gov/globalarrays/api/f_op_api.html#ga_periodic_get>`__\ (g_a, lo, hi, buf, ld) 

- C:       void `NGA_Periodic_get <https://hpc.pnl.gov/globalarrays/api/c_op_api.html#ga_periodic_get>`__\ (int g_a, int lo[], int hi[], void \*buf, int ld[]) 

- C++:     void GA::GlobalArray::periodicGet(int lo[], int hi[], void \*buf, int ld[])

Similar to regular *get*, ``lo`` and ``hi`` specify where the data
should come from in the global array, and ``ld`` specifies the stride
information of the local array ``buf``.

*Example*: Let us look at the first example in this section. It is 5 x 5
two dimensional global array. Assume that the local buffer is an 4x3
array.

Also assume that

::

   1o[0] = -1, hi[0] = 2,
   lo[1] =  4, hi[1] = 6, and
   ld[0] =  4.

.. figure:: images/periodic1.png
   :width: 50%

The local buffer ``buf`` is

::

       19 24 4
       20 25 5
       16 21 1
       17 22 2

Periodic Put is the reverse operations of Periodic Get. It copies data
from the local array to the global array section, which is

- Fortran subroutine: `nga_periodic_put <https://hpc.pnl.gov/globalarrays/api/f_op_api.html#nga_periodic_put>`__\ (g_a, lo, hi, buf, ld) 

- C:       void `NGA_Periodic_put <https://hpc.pnl.gov/globalarrays/api/c_op_api.html#nga_periodic_put>`__\ (int g_a, int lo[], int hi[], void \*buf, int ld[]) 

- C++:     void GA::GlobalArray::periodicPut(int lo[], int hi[], void \*buf, int ld[])

Similar to regular *put*, ``lo`` and ``hi`` specify where the data
should go in the global array; ``ld`` specifies the stride information
of the local array ``buf``.

*Periodic Put/Get* (also include the *Accumulate*, which will be
discussed later in this section) divide the patch into several smaller
patches. For those smaller patches that are outside the global aray,
adjust the indices so that they rotate back to the original array. After
that call the regular *Put/Get/Accumulate*, for each patch, to complete
the operations.

*Example*: Look at the example for periodic get. Because it is a 5 x 5
global array, the valid indices for each dimension are

::

       dimension 0: [1 : 5]
       dimension 1: [1 : 5]

The specified lo and hi are apparently out of the range of each
dimension:

::

       dimemsion 0: [-1 : 2] --> [-1 : 0] -- wrap back --> [4 : 5] [ 1 : 2] ok
       dimension 1: [ 4 : 6] --> [ 4 : 5] ok [ 6 : 6] -- wrap back --> [1 : 1]

Hence, there will be four smaller patches after the adjustment. They are

::

       patch 0: [4 : 5, 4 : 5]
       patch 1: [4 : 5, 1 : 1]
       patch 2: [1 : 2, 4 : 5]
       patch 3: [1 : 2, 1 : 1]

as shown in the following figure

.. figure:: images/periodic4.png
   :width: 70%

Of course the destination addresses of each samller patch in the local
buffer also need to be calculated.

Similar to regular *Accumulate, Periodic Accumulate* combines the data
from the local array with data in the global array section, which is

- Fortran subroutine: `nga_periodic_acc <https://hpc.pnl.gov/globalarrays/api/f_op_api.html#ga_periodic_acc>`__\ (g_a, lo, hi, buf, ld, alpha) 

- C:       void `NGA_Periodic_acc <https://hpc.pnl.gov/globalarrays/api/c_op_api.html#ga_periodic_acc>`__\ (int g_a, int lo[], int hi[], void \*buf, int ld[], void \*alpha) 

- C++:     void GA::GlobalArray::periodicAcc(int lo[], int hi[], void \*buf, int ld[], void \*alpha)

The local array is assumed to have the same number of dimensions as the
global array. Users don’t need to worry about where the region defined
by ``lo`` and ``hi`` is physically located. The function performs

*global array section (lo[], hi[]) += alpha \* buf*

*Example*: Let us look at the same example as above. There is a 5 x 5
two dimensional global array. Assume that the local buffer is an 4x3
array.

Also assume that

::

       1o[0] = -1, hi[0] = 2,
       lo[1] =  4, hi[1] = 6, and
       ld[0] =  4.

.. figure:: images/periodic1.png
   :width: 50%

The local buffer buf is

::

       1 5 9
       4 6 5
       3 2 1
       7 8 2

and ``alpha = 2``.

After the Periodic Accumulate operation, the global array will be

.. figure:: images/periodic5.png
   :width: 50%

Non-blocking operations
-----------------------

The non-blocking operations (get/put/accumulate) are derived from the
blocking interface by adding a handle argument that identifies an
instance of the non-blocking request. Nonblocking operations initiate a
communication call and then return control to the application. A return
from a nonblocking operation call indicates a mere initiation of the
data transfer process and the operation can be completed locally by
making a call to the wait (e.g. nga_nbwait) routine.

The wait function completes a non-blocking one-sided operation locally.
Waiting on a nonblocking put or an accumulate operation assures that
data was injected into the network and the user buffer can be now be
reused. Completing a get operation assures data has arrived into the
user memory and is ready for use. Wait operation ensures only local
completion. Unlike their blocking counterparts, the nonblocking
operations are not ordered with respect to the destination. Performance
being one reason, the other reason is that by ensuring ordering we incur
additional and possibly unnecessary overhead on applications that do not
require their operations to be ordered. For cases where ordering is
necessary, it can be done by calling a fence operation. The fence
operation is provided to the user to confirm remote completion if
needed.

*Example*: Let us take a simple case for illustration. Say, there    
are two global arrays i.e. one array stores pressure and the other   
stores temperature. If there are two computation phases (first phase 
computes pressure and second phase computes temperature), then we    
can overlap communication with computation, thus hiding latency.     

.. code-block:: cpp

 . . . . . . . . .                                                    
                                                                      
 nga_get (get_pressure_array)                                         
                                                                      
 nga_nbget(initiates data transfer to get temperature_array, and returns immediately)                                        

  /* hiding latency - communication is overlapped with computation */                                                                       
 compute_pressure() 
                                                                      
 nga_nbwait(temperature_array - completes data transfer)              
                                                                      
 compute_temperature()                                                
                                                                      
 . . . . . . . .                                                      


The non-blocking APIs are derived from the blocking interface by adding
a handle argument that identifies an instance of the non-blocking
request.

- n-D Fortran subroutine: `nga_nbput <https://hpc.pnl.gov/globalarrays/api/f_op_api.html#nga_nbput>`__\ (g_a, lo, hi, buf, ld, nbhandle) 

- n-D Fortran subroutine: `nga_nbget <https://hpc.pnl.gov/globalarrays/api/f_op_api.html#nga_nbget>`__\ (g_a, lo, hi, buf, ld, nbhandle) 

- n-D Fortran subroutine: `nga_nbacc <https://hpc.pnl.gov/globalarrays/api/f_op_api.html#nga_nbacc>`__\ (g_a, lo, hi, buf, ld, alpha, nbhandle)

- n-D Fortran subroutine: `nga_nbwait <https://hpc.pnl.gov/globalarrays/api/f_op_api.html#nga_nbwait>`__\ (nbhandle)

- 2-D Fortran subroutine: `ga_nbput <https://hpc.pnl.gov/globalarrays/api/f_op_api.html#ga_nbput>`__\ (g_a, ilo, ihi, jlo, jhi, buf, ld, nbhandle)

- 2-D Fortran subroutine: `ga_nbget <https://hpc.pnl.gov/globalarrays/api/f_op_api.html#ga_nbget>`__\ (g_a, ilo, ihi, jlo, jhi, buf, ld, nbhandle)

- 2-D Fortran subroutine: `ga_nbacc <https://hpc.pnl.gov/globalarrays/api/f_op_api.html#ga_nbacc>`__\ (g_a, ilo, ihi, jlo, jhi, buf, ld, alpha, nbhandle) 

- 2-D Fortran subroutine: `ga_nbwait <https://hpc.pnl.gov/globalarrays/api/f_op_api.html#ga_nbwait>`__\ (nbhandle)

- C:   void `NGA_NbPut <https://hpc.pnl.gov/globalarrays/api/c_op_api.html#ga_nbput>`__\ (int g_a, int lo[], int hi[], void \*buf, int ld[], ga_nbhdl_t\* nbhandle) 

- C:   void `NGA_NbGet <https://hpc.pnl.gov/globalarrays/api/c_op_api.html#ga_nbget>`__\ (int g_a, int lo[], int hi[], void \*buf, int ld[], ga_nbhdl_t\* nbhandle) 

- C:   void `NGA_NbAcc <https://hpc.pnl.gov/globalarrays/api/c_op_api.html#ga_nbacc>`__\ (int g_a, int lo[], int hi[], void \*buf, int ld[], void \*alpha, ga_nbhdl_t\* nbhandle) 

- C    int `NGA_NbWait <https://hpc.pnl.gov/globalarrays/api/c_op_api.html#ga_nbwait>`__\ (ga_nbhdl_t\* nbhandle)

- C++: void GA::GlobalArray::nbPut(int lo[], int hi[], void \*buf, int ld[], ga_nbhdl_t\* nbhandle) 

- C++: void GA::GlobalArray::nbGet(int lo[], int hi[], void \*buf, int ld[], ga_nbhdl_t\* nbhandle) 

- C++: void GA::GlobalArray::nbAcc(int lo[], int hi[], void \*buf, int ld[], void \*alpha, ga_nbhdl_t\* nbhandle) 

- C++ int GA::GlobalArray::NbWait(ga_nbhdl_t\* nbhandle)