# VFIO MLX5 Sample Application

A sample application demonstrating the usage of the libvfio-mlx5 library for managing Mellanox ConnectX devices through VFIO in user space.

## Table of Contents

- [Overview](#overview)
- [Prerequisites](#prerequisites)
- [System Setup](#system-setup)
  - [Device Binding](#device-binding)
  - [SRIOV Configuration](#sriov-configuration)
- [Building the Application](#building-the-application)
- [Advanced: VF Delegation in Multi-PF Environments](#advanced-vf-delegation-in-multi-pf-environments)
- [Usage](#usage)
  - [Command Line Options](#command-line-options)
  - [Examples](#examples)
- [Monitoring and Statistics](#monitoring-and-statistics)
- [Troubleshooting](#troubleshooting)

## Overview

The VFIO MLX5 sample application (`vfio_mlx5`) provides a reference implementation for using the libvfio-mlx5 library. It demonstrates:

- Device initialization and management
- VFIO setup and memory mapping
- Event processing and statistics collection
- Multi-device support with up to 8 ConnectX devices
- Virtual Function (VF) management

## Prerequisites

### Hardware Requirements
- Mellanox ConnectX network adapters
- IOMMU-capable system
- For hitless upgrade, ConnectX7 or newer device with multi-PF support see [vf delegation](#advanced-vf-delegation-in-multi-pf-environments)

### Software Requirements
- Linux kernel with VFIO support (CONFIG_VFIO_PCI=y)
- VFIO-PCI driver loaded
- Root privileges for device binding and memory operations

### Build Dependencies
- GCC or Clang compiler
- Meson build system
- Ninja build tool

### vfio-mlx5 library documentation

- Check the library documentation in `libvfio-mlx5/README.md`
- Review the source code in `samples/vfio_mlx5.c` for implementation details.

## System Setup

### Device Binding

Before running the sample application, you must bind your ConnectX devices to the VFIO-PCI driver.

1. **Unbind from the native driver:**
```bash
# Find your ConnectX device
lspci | grep Mellanox

# Example: Unbind device 0000:17:00.0 from mlx5_core
CX_DEVICE="0000:17:00.0"
echo $CX_DEVICE | sudo tee /sys/bus/pci/devices/$CX_DEVICE/driver/unbind
```

2. **Bind to VFIO-PCI:**
```bash
CX_DEVICE="0000:17:00.0"
CX_DEVICE_ID=$(lspci -ns $CX_DEVICE | cut -d " " -f3 | sed 's/:/ /g')
echo "$CX_DEVICE_ID" | sudo tee /sys/bus/pci/drivers/vfio-pci/new_id
echo "$CX_DEVICE" | sudo tee /sys/bus/pci/drivers/vfio-pci/bind

# Verify binding to vfio-pci
sudo lspci -nks $CX_DEVICE | grep "driver in use"
```

### SRIOV Configuration

To enable Virtual Functions (VFs) on your ConnectX device:

1. **Enable SRIOV in VFIO-PCI (if not already enabled):**
```bash
#optional Disable auto probe:
echo 0 | sudo tee /sys/bus/pci/devices/$CX_DEVICE/sriov_drivers_autoprobe

echo 1 | sudo tee /sys/module/vfio_pci/parameters/enable_sriov
```

2. **Create Virtual Functions:**
```bash
# after binding the device to vfio-pci and starting the vfio-mlx5 application
CX_DEVICE="0000:17:00.0"
# Create 2 VFs (adjust number as needed)
echo 2 | sudo tee /sys/bus/pci/devices/$CX_DEVICE/sriov_numvfs

# Verify VF creation
lspci | grep "Virtual Function" | grep Mellanox
```

3. **Run the sample application: before binding VFs to the native driver:**
```bash
sudo ./builddir/samples/vfio_mlx5 --device=$CX_DEVICE --nvfs=2
```
**Note: If you are passing the VF to a qemu VM, you MUST set the VF token in the VFIO MLX5 App:**
```bash

CX_DEVICE="0000:17:00.0"
CX_VF_DEV="0000:17:00.4"
# Create a vf_token for the VF
vf_token=$(uuidgen)
sudo ./builddir/samples/vfio_mlx5 --device="$CX_DEVICE,vf_token=$vf_token" --nvfs=2 --nvfs=2

qemu-system-x86_64  ... -device vfio-pci,host=$CX_VF_DEV,vf-token=$vf_token
```

4. **Optionally bind VFs to the native driver:**
```bash
# Find VF PCI addresses
CX_VF_DEV="0000:17:00.4"
echo $CX_VF_DEV | sudo tee /sys/bus/pci/devices/$CX_VF_DEV/driver/unbind
echo "mlx5_core" | sudo tee /sys/bus/pci/devices/$CX_VF_DEV/driver_override
echo $CX_VF_DEV | sudo tee /sys/bus/pci/drivers/mlx5_core/bind

# The vfio-mlx5 application will receive a page request for the VFs identified by func_id (vf index +1)
// [INFO   ] fwp(0000:17:00.0)   : FW_REQ_GIVE   (0):func_id 2, npages 8
// [INFO   ] fwp(0000:17:00.0)   : GIVE          (0):Posting func_id (2) npages (8)
// [INFO   ] fwp(0000:17:00.0)   : GIVE_SUCCESS  (0):Completed: func_id(2) npages(8)
```
## Building the Application

1. **Configure the build:**
```bash
meson setup builddir
# Or with specific compiler:
CC=clang meson setup builddir
```

2. **Build:**
```bash
ninja -C builddir
```

3. **The sample application will be available at:**
```
builddir/samples/vfio_mlx5
```

## Usage

### Command Line Options

```
Usage: vfio_mlx5 [OPTIONS]
Options:
  --help                          Show help message
  --device=PCIBDF, -d PCIBDF     Add PCI device (up to 8 devices)
                                 Use PCIBDF,vf_token=... for VF tokens
  --memsize=SIZE, -m SIZE        Memory size (4K aligned, default 128MB)
                                 Supports K/M/G suffixes
  --nvfs=NUM, -n NUM             Number of VFs to enable (default 0)
  --file=FILE, -f FILE           Log output to file (default stdout)
  --stats=INTERVAL, -s INTERVAL  Stats collection interval in seconds
  --noiommu                      Enable NoIOMMU mode (for testing)
```

### Examples

#### Basic Single Device Operation
```bash
# Single device with default memory
sudo ./builddir/samples/vfio_mlx5 --device=$CX_DEVICE

# Single device with 1GB memory and 2 VFs
sudo ./builddir/samples/vfio_mlx5 --device=$CX_DEVICE --memsize=1G --nvfs=2

```

#### Multi-Device Operation
```bash
# Two devices with custom memory allocation 2 vfs each,
# 512MB memory will be shared between the two devices and their vfs
sudo ./builddir/samples/vfio_mlx5 \
    --device=0000:17:00.0 \
    --device=0000:18:00.0 \
    --memsize=512M \
    --nvfs=2
```

#### Using VF Tokens and attach VF to qemu VM (optional)
```bash
# Single device with 1GB memory and 2 VFs and attach to qemu VM
CX_DEVICE="0000:17:00.0"
CX_VF_DEV="0000:17:00.4"
vf_token=$(uuidgen)
sudo ./builddir/samples/vfio_mlx5 --device="$CX_DEVICE,vf_token=$vf_token" --memsize=1G --nvfs=2
qemu-system-x86_64  ... -device vfio-pci,host=$CX_VF_DEV,vf-token=$vf_token
```

#### Statistics Collection
```bash
# Enable statistics monitoring every 1 seconds with logging redirected to a file
sudo ./builddir/samples/vfio_mlx5 \
    --device=0000:17:00.0 \
    --stats=1 \
    --file=/tmp/mlx5.log
```

#### NoIOMMU Mode (Testing Only)
```bash
# For systems without IOMMU or testing environments
sudo ./builddir/samples/vfio_mlx5 \
    --device=0000:17:00.0 \
    --memsize=1G \
    --noiommu
```

## Monitoring and Statistics

The sample application provides real-time statistics including:

- **Page Events**: Firmware page allocation/deallocation events
- **Memory Statistics**: Page allocator utilization and performance
- **Page ownership**: Firmware page usage and driver metrics

Statistics are disabled periodically and when --stats is specified.

## Advanced: VF Delegation in Multi-PF Environments

### Overview

```text
With latest ConnectX firmware (internal version **28.46.0332** or newer),
the sample application automatically supports VF delegation in multi-PF environments.

When VFs are created on a PRIMARY PF managed by the application, VFs are automatically
delegated to adjacent ACCESS PFs running the native mlx5_core driver for centralized network management.
```

### Architecture and Data Flow
The library automatically detects and configures VF delegation when the following conditions are met:

1. **Adjacent PF Detection**: Library queries for adjacent PFs on the same ConnectX device
2. **Capability Verification**: Checks firmware support for `delegate_vhca_management_profiles` latest firmware
3. **Automatic Delegation**: VFs created on PRIMARY PF are delegated to adjacent ACCESS PFs
4. **Centralized Management**: VFs become manageable through standard devlink/switchdev and DOCA SDK on ACCESS PF

### Architecture and Data Flow

```
+-------------------------------------------------------------+
|                    ConnectX Device                          |
|                 (Latest Firmware)                           |
+-------------------------------------------------------------+
|    PRIMARY PF       |           ACCESS PF                   |
|   (vfio-pci)        |         (mlx5_core)                   |
|                     |                                       |
|  +-------------+    |    +-----------------------------+    |
|  |libvfio-mlx5 |    |    |    mlx5_core driver         |    |
|  |             |    |    |    + switchdev mode         |    |
|  | Creates VFs |-------->|  Receives delegated VFs     |    |
|  | Delegates   |    |    |  Manages via devlink        |    |
|  | Management  |    |    |  Provides representors      |    |
|  +-------------+    |    +-----------------------------+    |
|                     |                                       |
|  VFs: 17:00.2       |         VF Representors:              |
|       17:00.3       |         pf0vf0, pf0vf1                |
|       17:00.4   ------------>  pf0vf2, pf0vf3               |
|       17:00.5       |         (switchdev ports)             |
+-------------------------------------------------------------+

Application Flow:
1. Sample app binds to PRIMARY PF (vfio-pci)
2. enables VFs on PRIMARY PF
3. automatically delegates VFs to ACCESS PF
4. ACCESS PF: VFs representors appear as switchdev ports
5. Network admin: manages via devlink/tc/ovs/DOCA tools

Multi Access-PF for hitless upgrade: (not supported yet)
- Multiple Access PFs: Active/stand-by model:
   - First Access PF will assume ownership of the VFs traffic pipeline (Active Access PF)
   - DOCA SDK will be available on the Active Access PF and the Standby Access PF simultaneously
   - VFs Traffic will go through the Active Access PF FDB table (steering rules) both slow path and fast/offload path
   - Stand-by Access PFs will be on standby mode for hitless upgrade (not supported yet)
   - Duplicating steering rules (e.g via DOCA SDK) on the Standby Access PF is outside the scope of this document.
   - Switching between Active and Stand-by Access PFs is outside the scope of this document.
```

### Configuration Steps

#### HW/Software Requirements

- ConnectX7 or newer device with multi-PF support
- Latest firmware version ** 28.46.0332 ** (internal version) or newer
- For switchdev/DOCA enabled driver, one of the following:
  - latest linux upstream kernel 6.16 or later
  - DOCA-Host 25.07 or later e.g: **doca-host-x.y.z-25.07**

#### step 1: Configure Multi-PF device and SRIOV on PRIMARY PF

```bash

# Unbind PRIMARY PF from current driver
PRIMARY_PF="0000:17:00.0"
echo $PRIMARY_PF | sudo tee /sys/bus/pci/devices/$PRIMARY_PF/driver/unbind

# Configure multi-PF device via fw tools
sudo mlxconfig -d $PRIMARY_PF -y s NUM_OF_PF=4 # use NUM_OF_PF=2 for Single-Port devices

# Configure sriov on PRIMARY PF via fw tools (per PF sriov, needs 2 steps)
sudo mlxconfig -d $PRIMARY_PF -y s NUM_OF_VFS=0 SRIOV_EN=1 PF_NUM_OF_VF_VALID=1
sudo mlxconfig -d $PRIMARY_PF -y s SRIOV_EN=1 PF_NUM_OF_VF_VALID=1 PF_NUM_OF_VF=2

# reboot the system to apply the changes or perform fw reset and pci rescan
mlxfwreset -d 17:00.0 reset -l3 -y

# BUS level reset
BUS=$(dirname $(dirname $(readlink -f /sys/bus/pci/devices/$PRIMARY_PF/class)))
echo 1 | sudo tee $BUS/remove
echo 1 | sudo tee /sys/bus/pci/rescan

# Verify Multi-PF and SRIOV is enabled
lspci | grep "Mellanox"

```

#### step 2: Enable SRIOV on PRIMARY PF PCI function
```bash

# Enable SRIOV on vfio-pci if not already enabled
echo 1 | sudo tee /sys/module/vfio_pci/parameters/enable_sriov

# trun off auto probe for vfio-pci
echo 0 | sudo tee /sys/bus/pci/devices/$PRIMARY_PF/sriov_drivers_autoprobe

# Create VFs on PRIMARY PF
echo 2 | sudo tee /sys/bus/pci/devices/$PRIMARY_PF/sriov_numvfs

# Verify VF creation
lspci | grep "Virtual Function" | grep Mellanox
```


#### step 3: Run the sample application on PRIMARY PF

```bash
PRIMARY_PF="0000:17:00.0"
# unbind the PRIMARY PF from current driver
echo $PRIMARY_PF | sudo tee /sys/bus/pci/devices/$PRIMARY_PF/driver/unbind

# bind the PRIMARY PF to vfio-pci
echo $PRIMARY_PF | sudo tee /sys/bus/pci/drivers/vfio-pci/bind

# verify the PRIMARY PF is bound to vfio-pci
lspci  -nks $PRIMARY_PF | grep "driver in use"

# run the sample application with vf_token
vf_token=$(uuidgen)
sudo ./builddir/samples/vfio_mlx5 --device=$PRIMARY_PF,vf_token=$vf_token --nvfs=2

# Expected output showing VF delegation:
#[INFO   ] dev(0000:17:00.0)   : Adjacent functions count 1/2; management profiles 0x1
#[INFO   ] dev(0000:17:00.0)   :          pci: 17:00.2, vhca_id: 0x2, func_id: 0x0, host: 00, bus_assigned: 1
#[INFO   ] dev(0000:17:00.0)   : Setting up vf 0
#[INFO   ] dev(0000:17:00.0)   : Delegating vf[0]: vhca_id 0x4 to adj_pf[0]: vhca_id 0x2 pci: 17:00.2
#[INFO   ] dev(0000:17:00.0)   : Setting up vf 1
#[INFO   ] dev(0000:17:00.0)   : Delegating vf[1]: vhca_id 0x5 to adj_pf[0]: vhca_id 0x2 pci: 17:00.2

# 17:00.1 is the PRIMARY PF
# 17:00.2 is the ACCESS PF
# vf0/vf1 are now delegated to the ACCESS PF 17:00.2

```

#### step 4: Run ACCESS PF in switchdev mode

```bash
ACCESS_PF="0000:17:00.2"

# Make sure latest upstream mlx5_core driver or latest OFED is installed
# bind the ACCESS PF to mlx5_core or passthrough to a VM
echo $ACCESS_PF | sudo tee /sys/bus/pci/drivers/mlx5_core/bind

# Set the ACCESS PF to switchdev mode
devlink dev eswitch set pci/$ACCESS_PF mode switchdev
devlink port show pci/$ACCESS_PF

pci/0000:00:03.0/180224: type eth netdev enp0s3npf0vf1 flavour pcivf controller 0 pfnum 0 vfnum 1 external false splittable false
  function:
    hw_addr 00:00:00:00:00:00
pci/0000:00:03.0/180225: type eth netdev enp0s3npf0vf2 flavour pcivf controller 0 pfnum 0 vfnum 2 external false splittable false
  function:
    hw_addr 00:00:00:00:00:00
auxiliary/mlx5_core.eth.0/196607: type eth netdev enp0s3np0 flavour physical port 0 splittable false

# discovered two VFs representors on the ACCESS PF
# port pci/$ACCESS_PF/180224 ==> enp0s3npf0vf1
# port pci/$ACCESS_PF/180225 ==> enp0s3npf0vf2

# Set the VFs mac addresses and active port state
devlink port function set pci/$ACCESS_PF/180224 hw_addr 00:00:00:00:00:11 state active
devlink port function set pci/$ACCESS_PF/180225 hw_addr 00:00:00:00:00:12 state active

# Run your favorite network tool to manage the VFs traffic
# e.g. tc, ovs, linux bridge, DOCA SDK, etc.

```

#### step 5: Load/Bind VFs normally

```bash
PRIMARY_PF="0000:17:00.0"
CX_VF_DEV="0000:17:00.4"

# Option 1: bind the VF to mlx5_core (on the host)
echo "mlx5_core" | sudo tee /sys/bus/pci/devices/$CX_VF_DEV/driver_override
echo $CX_VF_DEV | sudo tee /sys/bus/pci/drivers/mlx5_core/bind

# Option 2: Pass the VF to a VM, see 'Device Binding' section
# Important: vfio-mlx5 application must be running with a VF token, see step 3
# Pass the VF to a qemu VM with the same VF token
qemu-system-x86_64 ... -device vfio-pci,host=$CX_VF_DEV,vf-token=$vf_token
```

#### step 6: Teardown the VF delegation

```bash
- Unbind the VFs, otherwise they will experince FW errors.
- Disable switchdev mode on the ACCESS PF
    - devlink dev eswitch set pci/$ACCESS_PF mode legacy
- Stop the sample application

```

## Troubleshooting

### Common Issues

1. **Permission Denied**
   - Solution: Run with `sudo` or ensure proper VFIO permissions

2. **Device Busy**
   - Check if device is bound to another driver
   - Verify VFIO group permissions
   - Ensure that the VF token is set and matches what is being run in the application and any VMs

3. **Memory Allocation Failures**
   - Reduce memory size with `--memsize` option
   - Check available system memory
   - Consider using hugepages for large allocations
   - oom killer may kill the application causing for abrupt termination,
     may require FW reset to avoid device health reporting errors by FW.

4. **IOMMU Errors**
   - Verify IOMMU is enabled in BIOS/UEFI
   - Check kernel command line for `intel_iommu=on` or `amd_iommu=on`
   - For testing without IOMMU, use `--noiommu` flag, requires noiommu mode
     to be enabled in the kernel.

### Debugging

1. **Enable detailed logging:**
```bash
sudo ./builddir/samples/vfio_mlx5 --device=0000:17:00.0 --file=/tmp/debug.log
```

2. **Check kernel messages:**
```bash
sudo dmesg | grep -i vfio
sudo dmesg | grep -i iommu
```

3. **Verify device binding:**
```bash
lspci -nks 0000:17:00.0
cat /sys/bus/pci/devices/0000:17:00.0/driver/uevent
```

## Signal Handling

The application handles SIGINT (Ctrl+C) gracefully:
- Stops the event processing loop
- Terminates statistics collection
- Cleanly shuts down all devices
- Releases memory and VFIO resources

This ensures proper cleanup even when interrupted during operation.

## License

This project is licensed under the BSD-3-Clause License - see the [LICENSE](LICENSE) file for details.
