===============================================================================
Peer memory support over Mellanox OFED                
README
      Dec 2013


===============================================================================
Table of Contents
===============================================================================
1. Overview   
2. Peer memory API



===============================================================================
1. Overview
===============================================================================
In General:
------------

MLNX_OFED 2.1 introduced an API between IB CORE to peer memory clients,
(e.g. GPU cards) to provide access for the HCA to read/write peer memory for
data buffers. As a result it allows RDMA-based (over InfiniBand/RoCE)
application to use peer device computing power, and RDMA interconnect at the
same time w/o copying the data between the P2P devices

This README describes the required steps to develop a peer memory over
Mellanox OFED. Specifically it focuses on the flow and API to achieve this
Task.


===============================================================================
2. Peer memory API
===============================================================================


Flow:
------
Each peer memory should register itself into the IB CORE (ib_core) module, and
provide a set of callbacks to manage its memory basic functionality such as
get/put pages, get_page_size, dma map/unmap.
Those callbacks are quite similar to HOST memory ones, description for each one
is detailed below.

Peer client structure:
-------------------------------------------------------------------------------
struct peer_memory_client {
       char   name[IB_PEER_MEMORY_NAME_MAX];
       char   version[IB_PEER_MEMORY_VER_MAX];
       int (*acquire) (unsigned long addr, size_t size, void *peer_mem_private_data,
                                  char *peer_mem_name, void **client_context);
       int (*get_pages) (unsigned long addr,
                       size_t size, int write, int force,
                       struct sg_table *sg_head,
                       void *client_context, void *core_context);
       int (*dma_map) (struct sg_table *sg_head, void *client_context,
                     struct device *dma_device, int dmasync, int *nmap);
       int (*dma_unmap) (struct sg_table *sg_head, void *client_context,
                        struct device  *dma_device);
       void (*put_pages) (struct sg_table *sg_head, void *client_context);
       unsigned long (*get_page_size) (void *client_context);
       void (*release) (void *client_context);

};


APIs:
-------------------------------------------------------------------------------
void *ib_register_peer_memory_client(struct peer_memory_client *peer_client,
                                    invalidate_peer_memory *invalidate_callback);

                                    
                                    
Description:
Each peer driver register its callbacks upon loading, the callbacks provided as
part of the peer_client may be used later on by the IB core when processing
peer memory.


Parameters:
peer_client         [IN]  - Structure filled with peer information,
                            name/version/callbacks.
invalidate_callback [OUT] - IB core callback function to be called by the peer
                            once some allocation should be invalidated.

                                                
Return value:
A valid registration handle pointer on success, NULL otherwise.

Notes:
name should be unique comparing other registered peers.
version and some statistics for RDMA operations performed by the peer are exposed
by IB CORE via sysfs entries under: /sys/kernel/mm/memory_peers/<peer_name>/
-------------------------------------------------------------------------------
void ib_unregister_peer_memory_client(void *reg_handle);

Description:
On unload, peer client should unregister itself with IB CORE

Parameters:
reg_handle [IN] - registration handle previously returned in registration. 

-------------------------------------------------------------------------------
int (*acquire) (unsigned long addr, size_t size, void *peer_mem_private_data,
                char *peer_mem_name, void **client_context);


Description:
Given a virtual address, the peer driver should be able to identify whether
this address sits on its physical memory. If the answer is positive further
calls for memory management will be tunneled to that peer callbacks.

Parameters:
addr                  [IN]  - virtual address to be checked whether belongs to.
size                  [IN]  - size of memory area starting at addr.
peer_mem_private_data [IN]  - private data which peer already set on ib_ucontext,
                              if it's not set, this param will be NULL.
                              This parameter normally can help peers that running
                              in kernel space and have access to ib_ucontext to
                              set some private data to let them later identify
                              their memory.                                                 
peer_mem_name         [IN]  - peer name in case was already set on ib_ucontext,
                              if it's not set, this param is NULL.
                              This param is normally used along with
                              peer_mem_private_data.

client_context        [OUT] - peer opaque data which holds peer context for
                              that address range, will be passed in for further
                              calls for that given memory.
                                                
Return value:
1 - virtual address belongs to the peer device, otherwise 0                
-------------------------------------------------------------------------------
int (*get_pages) (unsigned long addr,
                  size_t size, int write, int force,
                  struct sg_table *sg_head,
                  void *client_context, void *core_context);
                       


Description:
This function is called for the peer device that owns the virtual address range,
peer is expected to pin the physical pages of the given address
(if not already pinned) and to fill sg_table with the information of the
physical pages associated with the given address range. This function is
equivalent to the Host memory get_user_pages() method.

Parameters:
addr           [IN] - start virtual address of that given allocation.
size           [IN] - size of memory area starting at addr.
write          [IN] - indicates whether pages will be written to by the caller. 
                      Same meaning as of kernel API get_user_pages, can be
                      ignored if not relevant.
force          [IN] - indicates whether to force write access even if user
                      mapping is readonly. Same meaning as of kernel API
                      get_user_pages, can be ignored if not relevant.                     
sg_head        [OUT] - pointer to head of struct sg_table, peer should allocate
                       required entries to set its pages information on, then
                       fill it with its physical addresses and size.                         
                       Below APIs expect to be used: sg_alloc_table, sg_set_page.                              
client_context [IN] - peer context for that given allocation, returned as part
                     of the acquire call.
core_context   [IN] - opaque IB core context, to be used by the invalidate
                      callback in case this address range should be freed.


Return value:
0 success, otherwise errno return code.                
-------------------------------------------------------------------------------
int (*dma_map) (struct sg_table *sg_head, void *client_context,
                struct device *dma_device, int dmasync, int *nmap);


Description:
This function is called to let the peer driver fill the sg_table with its dma
information for that address range.
Parameters have same meaning as of kernel host ones when calling dma_map_sg.

Parameters:
sg_head        [OUT] - pointer to head of struct sg_table, peer should set its
                       dma_address & dma_length on each scatter gather entry.                                                           
client_context [IN] - peer context for that given allocation.
dma_device     [IN] - current device for that operation.
dmasync        [IN] - flush in-flight DMA when the memory region is written.
                      Same meaning as with host memory mapping, can be ignored if not relevant.                 
nmap           [OUT] - number of mapped/set entries.

Return value:
0 success, otherwise errno return code.                
-------------------------------------------------------------------------------
int (*dma_unmap) (struct sg_table *sg_head, void *client_context,
                        struct device  *dma_device);
                        

Description:
This API is the opposite of the dma map API, it should take relevant actions
to unmap it. 

Parameters:
sg_head        [IN] - pointer to head of struct sg_table.
client_context [IN] - peer context for that given allocation.
dma_device     [IN] - current device for that operation.

                        
Return value:
0 success, otherwise errno return code.
-------------------------------------------------------------------------------
void (*put_pages) (struct sg_table *sg_head, void *client_context);

Description:
This API is the opposite of the get_pages API, it should free the pages as done
in Host memory when calling put_page.

Parameters:
sg_head        [IN] - pointer to head of struct sg_table.
client_context [IN] - peer context for that given allocation.

-------------------------------------------------------------------------------
unsigned long (*get_page_size) (void *client_context);

Description:
This API returns page size for that given allocation. 

Parameters:
client_context [IN] - peer context for that given allocation.

Return value:
Page size in bytes

-------------------------------------------------------------------------------
void (*release) (void *client_context);

Description:
This API is the opposite of the acquire call, let peer release all resources.

Parameters:
client_context [IN] - peer context for that given allocation.
-------------------------------------------------------------------------------
typedef int (*invalidate_peer_memory)(void *reg_handle,
                                    void *core_context);
                                  
Description:
Function pointer returned from IB core as part of successful peer registration.
Should be used by peer when a given memory allocation represented by
core_context should be deleted.

Parameters:
reg_handle   [IN] - peer handle.
core_context [IN] - core context that represents a given allocation, as it was
                    set in as part of get_pages call.

Return value:
0 success, otherwise errno return code.

-------------------------------------------------------------------------------

