===========================
NVMe Live Migration Support
===========================

Introduction
------------
To support live migration, NVMe device designs its own implementation,
including five new specific admin commands and a capability flag in
the vendor-specific field in the identify controller data structure to
support VF's live migration usage. Software can use these live migration
admin commands to get device migration state data size, save and load the
data, suspend and resume the given VF device. They are submitted by software
to the NVMe PF device's admin queue and ignored if placed in the VF device's
admin queue. This is due to the NVMe VF device being passed to the virtual
machine in the virtualization scenario. So VF device's admin queue is not
available for the hypervisor to submit VF device live migration commands.
The capability flag in the identify controller data structure can be used by
software to detect if the NVMe device supports live migration. The following
chapters introduce the detailed format of the commands and the capability flag.

Definition of opcode for live migration commands
------------------------------------------------

+---------------------------+-----------+-----------+------------+
|                           |           |           |            |
|     Opcode by Field       |           |           |            |
|                           |           |           |            |
+--------+---------+--------+           |           |            |
|        |         |        | Combined  | Namespace |            |
|    07  |  06:02  | 01:00  |  Opcode   | Identifier|  Command   |
|        |         |        |           |    used   |            |
+--------+---------+--------+           |           |            |
|Generic | Function|  Data  |           |           |            |
|command |         |Transfer|           |           |            |
+--------+---------+--------+-----------+-----------+------------+
|                                                                |
|                     Vendor SpecificOpcode                      |
+--------+---------+--------+-----------+-----------+------------+
|        |         |        |           |           | Query the  |
|   1b   |  10001  |  00    |   0xC4    |           | data size  |
+--------+---------+--------+-----------+-----------+------------+
|        |         |        |           |           | Suspend the|
|   1b   |  10010  |  00    |   0xC8    |           |    VF      |
+--------+---------+--------+-----------+-----------+------------+
|        |         |        |           |           | Resume the |
|   1b   |  10011  |  00    |   0xCC    |           |    VF      |
+--------+---------+--------+-----------+-----------+------------+
|        |         |        |           |           | Save the   |
|   1b   |  10100  |  10    |   0xD2    |           |device data |
+--------+---------+--------+-----------+-----------+------------+
|        |         |        |           |           | Load the   |
|   1b   |  10101  |  01    |   0xD5    |           |device data |
+--------+---------+--------+-----------+-----------+------------+

Definition of QUERY_DATA_SIZE command
-------------------------------------

+---------+------------------------------------------------------------------------------------+
|         |                                                                                    |
|   Bytes |                                    Description                                     |
|         |                                                                                    |
+---------+------------------------------------------------------------------------------------+
|         |                                                                                    |
|         |                                                                                    |
|         | +-----------+--------------------------------------------------------------------+ |
|         | |  Bits     |Description                                                         | |
|         | +-----------+--------------------------------------------------------------------+ |
|         | |  07:00    |Opcode(OPC):set to 0xC4 to indicate a qeury command                 | |
|         | +-----------+--------------------------------------------------------------------+ |
|         | |  09:08    |Fused Operation(FUSE):Please see NVMe SPEC for more details[1]      | |
|         | +-----------+--------------------------------------------------------------------+ |
|  03:00  | |  13:10    |Reserved                                                            | |
|         | +-----------+--------------------------------------------------------------------+ |
|         | |  15:14    |PRP or SGL for Data Transfer(PSDT): See NVMe SPEC for details[1]    | |
|         | +-----------+--------------------------------------------------------------------+ |
|         | |  31:16    |Command Identifier(CID)                                             | |
|         | +-----------+--------------------------------------------------------------------+ |
|         |                                                                                    |
|         |                                                                                    |
+---------+------------------------------------------------------------------------------------+
|  39:04  |  Reserved                                                                          |
+---------+------------------------------------------------------------------------------------+
|  41:40  |  VF index: means which VF controller internal data size to query                   |
+---------+------------------------------------------------------------------------------------+
|  63:42  |  Reserved                                                                          |
+---------+------------------------------------------------------------------------------------+

The QUERY_DATA_SIZE command is used to query the NVMe VF internal data size for live migration.
When the NVMe firmware receives the command, it will return the size of NVMe VF internal
data. The data size depends on how many IO queues are created.

Definition of SUSPEND command
-----------------------------

+---------+------------------------------------------------------------------------------------+
|         |                                                                                    |
|   Bytes |                                    Description                                     |
|         |                                                                                    |
+---------+------------------------------------------------------------------------------------+
|         |                                                                                    |
|         |                                                                                    |
|         | +-----------+--------------------------------------------------------------------+ |
|         | |  Bits     |Description                                                         | |
|         | +-----------+--------------------------------------------------------------------+ |
|         | |  07:00    |Opcode(OPC):set to 0xC8 to indicate a suspend command               | |
|         | +-----------+--------------------------------------------------------------------+ |
|         | |  09:08    |Fused Operation(FUSE):Please see NVMe specification for details[1]  | |
|         | +-----------+--------------------------------------------------------------------+ |
|  03:00  | |  13:10    |Reserved                                                            | |
|         | +-----------+--------------------------------------------------------------------+ |
|         | |  15:14    |PRP or SGL for Data Transfer(PSDT):See NVMe SPEC for details[1]     | |
|         | +-----------+--------------------------------------------------------------------+ |
|         | |  31:16    |Command Identifier(CID)                                             | |
|         | +-----------+--------------------------------------------------------------------+ |
|         |                                                                                    |
|         |                                                                                    |
+---------+------------------------------------------------------------------------------------+
|  39:04  |  Reserved                                                                          |
+---------+------------------------------------------------------------------------------------+
|  41:40  |  VF index: means which VF controller to suspend                                    |
+---------+------------------------------------------------------------------------------------+
|  63:42  |  Reserved                                                                          |
+---------+------------------------------------------------------------------------------------+

The SUSPEND command is used to suspend the NVMe VF controller. When the NVMe firmware receives
this command, it will suspend the NVMe VF controller.

Definition of RESUME command
----------------------------

+---------+------------------------------------------------------------------------------------+
|         |                                                                                    |
|   Bytes |                                    Description                                     |
|         |                                                                                    |
+---------+------------------------------------------------------------------------------------+
|         |                                                                                    |
|         |                                                                                    |
|         | +-----------+--------------------------------------------------------------------+ |
|         | |  Bits     |Description                                                         | |
|         | +-----------+--------------------------------------------------------------------+ |
|         | |  07:00    |Opcode(OPC):set to 0xCC to indicate a resume command                | |
|         | +-----------+--------------------------------------------------------------------+ |
|         | |  09:08    |Fused Operation(FUSE):Please see NVMe SPEC for details[1]           | |
|         | +-----------+--------------------------------------------------------------------+ |
|  03:00  | |  13:10    |Reserved                                                            | |
|         | +-----------+--------------------------------------------------------------------+ |
|         | |  15:14    |PRP or SGL for Data Transfer(PSDT):See NVMe SPEC for details[1]     | |
|         | +-----------+--------------------------------------------------------------------+ |
|         | |  31:16    |Command Identifier(CID)                                             | |
|         | +-----------+--------------------------------------------------------------------+ |
|         |                                                                                    |
|         |                                                                                    |
+---------+------------------------------------------------------------------------------------+
|  39:04  |  Reserved                                                                          |
+---------+------------------------------------------------------------------------------------+
|  41:40  |  VF index: means which VF controller to resume                                     |
+---------+------------------------------------------------------------------------------------+
|  63:42  |  Reserved                                                                          |
+---------+------------------------------------------------------------------------------------+

The RESUME command is used to resume the NVMe VF controller. When firmware receives this command,
it will restart the NVMe VF controller.

Definition of SAVE_DATA command
--------------------------

+---------+------------------------------------------------------------------------------------+
|         |                                                                                    |
|   Bytes |                                    Description                                     |
|         |                                                                                    |
+---------+------------------------------------------------------------------------------------+
|         |                                                                                    |
|         |                                                                                    |
|         | +-----------+--------------------------------------------------------------------+ |
|         | |  Bits     |Description                                                         | |
|         | +-----------+--------------------------------------------------------------------+ |
|         | |  07:00    |Opcode(OPC):set to 0xD2 to indicate a save command                  | |
|         | +-----------+--------------------------------------------------------------------+ |
|         | |  09:08    |Fused Operation(FUSE):Please see NVMe SPEC for details[1]           | |
|         | +-----------+--------------------------------------------------------------------+ |
|  03:00  | |  13:10    |Reserved                                                            | |
|         | +-----------+--------------------------------------------------------------------+ |
|         | |  15:14    |PRP or SGL for Data Transfer(PSDT):See NVMe SPEC for details[1]     | |
|         | +-----------+--------------------------------------------------------------------+ |
|         | |  31:16    |Command Identifier(CID)                                             | |
|         | +-----------+--------------------------------------------------------------------+ |
|         |                                                                                    |
|         |                                                                                    |
+---------+------------------------------------------------------------------------------------+
|  23:04  | Reserved                                                                           |
+---------+------------------------------------------------------------------------------------+
|  31:24  | PRP Entry1:the first PRP entry for the commmand or a PRP List Pointer              |
+---------+------------------------------------------------------------------------------------+
|  39:32  | PRP Entry2:the second address entry(reserved,page base address or PRP List Pointer)|
+---------+------------------------------------------------------------------------------------+
|  41:40  | VF index: means which VF controller internal data to save                          |
+---------+------------------------------------------------------------------------------------+
|  63:42  | Reserved                                                                           |
+---------+------------------------------------------------------------------------------------+

The SAVE_DATA command is used to save the NVMe VF internal data for live migration. When firmware
receives this command, it will save the admin queue states, save some registers, drain IO SQs
and CQs, save every IO queue state, disable the VF controller, and transfer all data to the
host memory through DMA.

Definition of LOAD_DATA command
--------------------------

+---------+------------------------------------------------------------------------------------+
|         |                                                                                    |
|   Bytes |                                    Description                                     |
|         |                                                                                    |
+---------+------------------------------------------------------------------------------------+
|         |                                                                                    |
|         |                                                                                    |
|         | +-----------+--------------------------------------------------------------------+ |
|         | |  Bits     |Description                                                         | |
|         | +-----------+--------------------------------------------------------------------+ |
|         | |  07:00    |Opcode(OPC):set to 0xD5 to indicate a load command                  | |
|         | +-----------+--------------------------------------------------------------------+ |
|         | |  09:08    |Fused Operation(FUSE):Please see NVMe SPEC for details[1]           | |
|         | +-----------+--------------------------------------------------------------------+ |
|  03:00  | |  13:10    |Reserved                                                            | |
|         | +-----------+--------------------------------------------------------------------+ |
|         | |  15:14    |PRP or SGL for Data Transfer(PSDT): See NVMe SPEC for details[1]    | |
|         | +-----------+--------------------------------------------------------------------+ |
|         | |  31:16    |Command Identifier(CID)                                             | |
|         | +-----------+--------------------------------------------------------------------+ |
|         |                                                                                    |
|         |                                                                                    |
+---------+------------------------------------------------------------------------------------+
|  23:04  | Reserved                                                                           |
+---------+------------------------------------------------------------------------------------+
|  31:24  | PRP Entry1:the first PRP entry for the commmand or a PRP List Pointer              |
+---------+------------------------------------------------------------------------------------+
|  39:32  | PRP Entry2:the second address entry(reserved,page base address or PRP List Pointer)|
+---------+------------------------------------------------------------------------------------+
|  41:40  | VF index: means which VF controller internal data to load                          |
+---------+------------------------------------------------------------------------------------+
|  47:44  | Size: means the size of the device's internal data to be loaded                    |
+---------+------------------------------------------------------------------------------------+
|  63:48  | Reserved                                                                           |
+---------+------------------------------------------------------------------------------------+

The LOAD_DATA command is used to restore the NVMe VF internal data. When firmware receives this
command, it will read the device internal's data from the host memory through DMA, restore the
admin queue states and some registers, and restore every IO queue state.

Extensions of the vendor-specific field in the identify controller data structure
---------------------------------------------------------------------------------

+---------+------+------+------+-------------------------------+
|         |      |      |      |                               |
|  Bytes  | I/O  |Admin | Disc |        Description            |
|         |      |      |      |                               |
+---------+------+------+------+-------------------------------+
|         |      |      |      |                               |
| 01:00   |  M   |  M   |  R   | PCI Vendor ID(VID)            |
+---------+------+------+------+-------------------------------+
|         |      |      |      |                               |
| 03:02   |  M   |  M   |  R   | PCI Subsytem Vendor ID(SSVID) |
+---------+------+------+------+-------------------------------+
|         |      |      |      |                               |
|  ...    | ...  | ...  | ...  |  ...                          |
+---------+------+------+------+-------------------------------+
|         |      |      |      |                               |
|  3072   |  O   |  O   |  O   | Live Migration Support        |
+---------+------+------+------+-------------------------------+
|         |      |      |      |                               |
|4095:3073|  O   |  O   |  O   | Vendor Specific               |
+---------+------+------+------+-------------------------------+

According to NVMe specification, the bytes from 3072 to 4095 are vendor-specific fields.
NVMe device uses the 3072 bytes in the identify controller data structure to indicate
whether live migration is supported. 0x0 means live migration is not supported. 0x01 means
live migration is supported, and other values are reserved.

[1] https://nvmexpress.org/wp-content/uploads/NVMe-NVM-Express-2.0a-2021.07.26-Ratified.pdf
