# libvfio-mlx5 Library Documentation

A user-space library for managing Mellanox ConnectX devices through VFIO.

## Overview

libvfio-mlx5 provides APIs for managing ConnectX devices in user space through VFIO, enabling hitless device management and external memory control.

## Key Features

- External memory management
- Multi-device support (up to 8 devices)
- Virtual Function management
- Event-driven architecture
- Device health monitoring

## API Functions

For comprehensive API documentation with detailed parameter descriptions, return values, and usage notes, see [`include/vfio_mlx5.h`](include/vfio_mlx5.h).

### Initialization
- `vfio_mlx5_init(persistent_memory, size, iova, num_devices)` - Initialize library with external memory
    - `persistent_memory` - Persistent memory pointer
    - `size` - Size of the persistent memory
    - `iova` - IOVA/DMA address of the persistent memory (see note below)
    - `num_devices` - Number of devices to prepare for

- `vfio_mlx5_uninit(handle)` - Clean up vfio_mlx5 handle
    - `handle` - Handle returned from `vfio_mlx5_init()`

**Note:** IOVA selection:

- On some ARM systems, the low IOVA range up to `0x08000000` (128 MB) may be reserved by the platform/IOMMU. Mapping persistent memory at `iova = 0` with size > 128 MB will overlap this region and cause DMA mapping failures.
- To avoid this, use a sufficiently high IOVA base address.

Example:

```c
/* On ARM avoid reserved IOVA below 0x08000000 (128 MB). Use a high base. */
#define DEFAULT_IOVA (0x40000000ULL) /* 1 GB */
```


### Device Management
- `vfio_mlx5_device_add(handle, bdf, fd, nvfs)` - Add ConnectX device
    - `handle` - Handle returned from `vfio_mlx5_init()`
    - `bdf` - BDF of the device
    - `fd` - File descriptor of the device (vfio-pci device)
    - `nvfs` - Number of VFs to setup/delegate to adjacent Access PFs (see VF delegation section below)

- `vfio_mlx5_device_del(dev)` - Remove device
    - `dev` - Device returned from `vfio_mlx5_device_add()`

### Event Processing
- `vfio_mlx5_events_process(dev)` - Process device events
    - `dev` - Device returned from `vfio_mlx5_device_add()`

### Statistics
- `vfio_mlx5_dev_stats(dev, stats)` - Get device statistics
    - `dev` - Device returned from `vfio_mlx5_device_add()`
    - `stats` - Pointer to `mlx5_dev_stats` structure to store the statistics

### Logging
- `vfio_mlx5_log_set(level, outf, errf)` - Configure logging
    - `level` - Logging level
    - `outf` - File pointer for output
    - `errf` - File pointer for error

### suspend/resume
 - `vfio_mlx5_dev_index(dev)` - Returns a device index.
 - `vfio_mlx5_device_suspend(dev)` - Suspend a specific device
 - `vfio_mlx5_suspend(vmh)` - Suspend a vfio_mlx5 handle
 - `vfio_mlx5_resume(persist_storage)` - Resume vfio_mlx5 handle from a new storage area
    - returns a new handle that replaces the old one.
 - `vfio_mlx5_dev_get(vmh, index)` - Get a new device handle from a new vmh after resume, using index.
    - returns a new device handle.
 - `vfio_mlx5_device_resume(vmh, dev, device_fd)` - Resume a device, allows new vfio_fd.

## Device Health Monitoring and Statistics

### Statistics Overview

The library provides comprehensive device statistics through `vfio_mlx5_dev_stats()` for monitoring device health and performance:

**Page Events (`mlx5_pg_events_t`):**
- `FWP_EVENT_GIVE_BOOT` - Pages given to firmware during device initialization
- `FWP_EVENT_FW_REQ_GIVE` - Firmware requested additional pages (normal operation)
- `FWP_EVENT_FW_REQ_TAKE` - Firmware released pages back to driver
- `FWP_EVENT_GIVE_SUCCESS/ERROR` - Success/failure of page allocation operations
- `FWP_EVENT_TAKE_SUCCESS/ERROR` - Success/failure of page deallocation operations
- `FWP_EVENT_CANT_GIVE` - Driver unable to satisfy firmware page request (potential issue)
- `FWP_EVENT_FW_REQ_DROP` - Firmware request dropped (potential issue)
- `*_ERROR` - Any non-zero `*_ERROR` events indicate device operation failures

**Memory Statistics (`mlx5_pg_alloc_stats_t`):**
- `total_pages` - Total pages managed by the internal allocator
- `free_pages` - Currently available pages for allocation
- `allocs/frees` - Allocation and free operation counters
- `allocs_failed` - Failed allocations (indicates memory pressure)
- `double_frees` - Detected double-free errors (indicates bugs)

**Page ownership:**
- `firmware_pages` - Pages currently held by device firmware
- `driver_pages` - Pages used by driver structures

**Health Record (`mlx5_health_record`):**

Contains FW health buffer and should be collected on error events.
When vfio_mlx5_events_process() or vfio_mlx5_dev_stats() return a negative value,
this means error event occurred and the health_record will contain the error details.

### Thread Safety and Usage Patterns

**Stats collection is thread-safe** and can run in parallel with `vfio_mlx5_events_process()` per device.

**Recommended Usage: thread-per-device**
```c
void *device_thread(void *arg) {
    struct vfio_mlx5_dev *dev = vfio_mlx5_device_add(...);
    struct mlx5_dev_stats stats;

    while (running) {
        // Process events
        int ret = vfio_mlx5_events_process(dev);

        if (ret < 0) {
            // Device error detected - collect final stats
            vfio_mlx5_dev_stats(dev, &stats);
            log_device_failure(device_bdf, &stats);

            // Gracefully remove device
            vfio_mlx5_device_del(dev);
            break;
        }
        // Periodic health check (recommended)
        if (should_check_health()) {
            vfio_mlx5_dev_stats(dev, &stats);

            // Check for ERROR events (must be logged and reported)
            if (stats.page_events[*_ERROR] > previous_*_errors) {
                log_error("Device %s: Failed with EVENT %d", device_bdf, *_ERROR);
            }
            /* check other ERROR event stats */

            // Monitor for concerning patterns (warnings)
            if (stats.page_stats.allocs_failed > previous_failed) {
                log_warning("Device %s: Memory pressure detected", device_bdf);
            }

            if (stats.page_events[FWP_EVENT_CANT_GIVE] > previous_cant_give) {
                log_warning("Device %s: Unable to satisfy firmware requests", device_bdf);
            }
        }
    }
}

// Dedicated monitoring thread (optional)
void *stats_monitoring_thread(void *arg) {
    while (running) {
        for (int i = 0; i < num_devices; i++) {
            struct mlx5_dev_stats stats;
            vfio_mlx5_dev_stats(devices[i], &stats);

            // Log stats, update dashboards, trigger alerts, etc.
            process_device_stats(device_bdfs[i], &stats);
        }
        sleep(stats_interval);
    }
}
```

### Health Monitoring Guidelines

**Normal Operation Indicators:**
- `FWP_EVENT_FW_REQ_GIVE/TAKE` events are normal during operation
- `allocs_failed` should remain zero or very low
- `firmware_pages` should stabilize after initialization

**Warning Signs:**
- Increasing `allocs_failed` indicates memory pressure
- Non-zero `FWP_EVENT_CANT_GIVE` suggests resource exhaustion
- Sudden increases in `double_frees` indicate potential corruption

**Error Events (Require Logging and Reporting):**
- `FWP_EVENT_GIVE_ERROR` - Failed to allocate pages to firmware
- `FWP_EVENT_TAKE_ERROR` - Failed to reclaim pages from firmware
- `FWP_EVENT_CANT_GIVE_ERROR` - Failed to notify firmware of resource exhaustion
- `FWP_EVENT_FW_REQ_DROP_ERROR` - Failed to process firmware request
- Any non-zero `*_ERROR` events indicate device operation failures

**Critical Errors:**
- `vfio_mlx5_events_process()` returning negative values
- Requires immediate stats collection and device removal

### Error Recovery Workflow

```c
int handle_device_error(struct vfio_mlx5_dev *dev, const char *bdf) {
	struct mlx5_dev_stats final_stats;

	// 1. Collect final statistics for debugging
	vfio_mlx5_dev_stats(dev, &final_stats);
	log_device_stats("DEVICE_FAILURE", bdf, &final_stats);

	// Report any ERROR events that occurred
	if (final_stats.page_events[FWP_EVENT_GIVE_ERROR] > 0) {
		log_error("Device %s had %lu firmware page allocation failures",
			bdf, final_stats.page_events[FWP_EVENT_GIVE_ERROR]);
	}
	if (final_stats.page_events[FWP_EVENT_TAKE_ERROR] > 0) {
		log_error("Device %s had %lu firmware page reclaim failures",
			bdf, final_stats.page_events[FWP_EVENT_TAKE_ERROR]);
	}
	if (final_stats.page_events[FWP_EVENT_CANT_GIVE_ERROR] > 0) {
		log_error("Device %s had %lu firmware notification failures",
			bdf, final_stats.page_events[FWP_EVENT_CANT_GIVE_ERROR]);
	}
	if (final_stats.page_events[FWP_EVENT_FW_REQ_DROP_ERROR] > 0) {
		log_error("Device %s had %lu firmware request processing failures",
			bdf, final_stats.page_events[FWP_EVENT_FW_REQ_DROP_ERROR]);
	}

	// Check other failure patterns (memory exhaustion)
	if (final_stats.page_stats.allocs_failed > 0) {
		log_error("Device %s failed with %lu memory allocation failures",
			bdf, final_stats.page_stats.allocs_failed);
	}

	// 2. Gracefully remove device
	vfio_mlx5_device_del(dev);

	// 4. Implement recovery logic (restart, failover, etc.)
	return initiate_device_recovery(bdf);
}
```

## Complete Usage Example

```c
#include "vfio_mlx5.h"

// Initialize with external memory for multiple devices
struct vfio_mlx5_handle *vmh = vfio_mlx5_init(mem, size, iova, num_devices);

// Thread function for each device
void *device_thread(void *arg) {
    struct device_info *info = (struct device_info *)arg;

    // Add device (thread-safe)
    struct vfio_mlx5_dev *dev = vfio_mlx5_device_add(vmh, info->bdf, info->fd, info->nvfs);

    // Process events for this device
    while (running) {
        vfio_mlx5_events_process(dev);
    }

    // Cleanup
    vfio_mlx5_device_del(dev);
    return NULL;
}

// Create one thread per device (recommended)
for (int i = 0; i < num_devices; i++) {
    pthread_create(&threads[i], NULL, device_thread, &device_info[i]);
}

// Wait for threads and cleanup
for (int i = 0; i < num_devices; i++) {
    pthread_join(threads[i], NULL);
}
vfio_mlx5_uninit(vmh);
```

## Integration Patterns

### VFIO Legacy Container
- Standard VFIO container/group model
- IOMMU Type1 mapping

### IOMMUFD
- Modern IOMMUFD-based VFIO
- Direct IOMMU management

### Virtualization
- QEMU/KVM integration
- QEMU with vf_token support

## VF Delegation in Multi-PF Environments


### Overview

```text
With latest ConnectX firmware (internal version **28.46.0332** or newer),
the sample application automatically supports VF delegation in multi-PF environments.

When VFs are created on a PRIMARY PF managed by the application, VFs are automatically
delegated to adjacent ACCESS PFs running the native mlx5_core driver for centralized network management.
```

### Library Behavior with Latest Firmware

The library automatically detects and configures VF delegation when the following conditions are met:

1. **Adjacent PF Detection**: Library queries for adjacent PFs on the same ConnectX device
2. **Capability Verification**: Checks firmware support for `delegate_vhca_management_profiles`
3. **Automatic Delegation**: VFs created on PRIMARY PF are delegated to adjacent ACCESS PFs
4. **Centralized Management**: VFs become manageable through standard devlink/switchdev and DOCA SDK on ACCESS PF

### Architecture and Data Flow

```
+-------------------------------------------------------------+
|                    ConnectX Device                          |
|                 (Latest Firmware)                           |
+-------------------------------------------------------------+
|    PRIMARY PF       |           ACCESS PF                   |
|   (vfio-pci)        |         (mlx5_core)                   |
|                     |                                       |
|  +-------------+    |    +-----------------------------+    |
|  |libvfio-mlx5 |    |    |    mlx5_core driver         |    |
|  |             |    |    |    + switchdev mode         |    |
|  | Creates VFs |-------->|  Receives delegated VFs     |    |
|  | Delegates   |    |    |  Manages via devlink        |    |
|  | Management  |    |    |  Provides representors      |    |
|  +-------------+    |    +-----------------------------+    |
|                     |                                       |
|  VFs: 17:00.2       |         VF Representors:              |
|       17:00.3       |         pf0vf0, pf0vf1                |
|       17:00.4   ------------>  pf0vf2, pf0vf3               |
|       17:00.5       |         (switchdev ports)             |
+-------------------------------------------------------------+

Library Flow:
1. vfio_mlx5_device_add() → Detects adjacent PFs
2. Enables VFs on PRIMARY PF
3. For each VF: delegates to ACCESS PFs
4. ACCESS PF: VFs representors appear as switchdev ports
5. Network admin: manages via standard devlink commands

Multi Access-PF for hitless upgrade: (not supported yet)
- Multiple Access PFs: Active/stand-by model:
   - First Access PF will assume ownership of the VFs traffic pipeline (Active Access PF)
   - DOCA SDK will be available on the Active Access PF and the Standby Access PF simultaneously
   - VFs Traffic will go through the Active Access PF FDB table (steering rules) both slow path and fast/offload path
   - Stand-by Access PFs will be on standby mode for hitless upgrade (not supported yet)
   - Duplicating steering rules (e.g via DOCA SDK) on the Standby Access PF is outside the scope of this document.
   - Switching between Active and Stand-by Access PFs is outside the scope of this document.

```

### API Integration

The VF delegation is transparent to the application - no additional API calls are required:

```c
// Standard library usage - delegation happens automatically
struct vfio_mlx5_handle *vmh = vfio_mlx5_init(mem, size, iova, 1);

// When nvfs > 0, VFs are created and automatically delegated
struct vfio_mlx5_dev *dev = vfio_mlx5_device_add(vmh, "0000:17:00.0", fd, 4);

// Library logs show delegation status:
// [INFO] Adjacent functions count 1; management profiles 0x1
// [INFO] Delegating vf[0]: vhca_id 0x2 to adj_pf[0]: vhca_id 0x1
// [INFO] Delegating vf[1]: vhca_id 0x3 to adj_pf[0]: vhca_id 0x1
```

### Delegation Requirements

#### HW/Software Requirements

- ConnectX7 or newer device with multi-PF support
- Latest firmware version ** 28.46.0332 ** (internal version) or newer
- For switchdev/DOCA enabled driver, one of the following:
  - latest linux upstream kernel 6.16 or later
  - DOCA-Host 25.07 or later e.g: **doca-host-x.y.z-25.07**

**System Configuration:**
- Multi-PF ConnectX device with at least 2 PFs
- PRIMARY PF bound to vfio-pci driver
- secondary PFs (ACCESS PFs) bound to mlx5_core driver with switchdev mode

### Benefits for Applications

1. **Transparent Operation**: No code changes required for delegation
2. **Hybrid Control**: Application controls PRIMARY PF, kernel manages VFs
3. **Standard Tooling**: VFs manageable via devlink, tc, ovs, DOCA, etc.
4. **Scalability**: Efficient resource utilization across PFs
5. **Flexibility**: disable delegation by removing device using device removal API.
6. **Hitless Upgrade**:
    - Kexec support is trivial for userspace applications.
    - Delegation can be done to multiple Access PFs which support
      Active/stand-by model, to allow for hitless upgrade between the Access
      PFs software stacks.

## Memory Management

Memory must be externally allocated and mapped:
- Minimum 128MB recommended
- 4KB alignment required
- Persistent across kexec operations for hitless upgrade support

## Thread Safety

**Thread-per-device model supported and recommended**

### Supported Threading Model

- **Device Addition**: `vfio_mlx5_device_add()` can be called concurrently from multiple threads
- **Event Processing**: `vfio_mlx5_events_process()` can be called concurrently, one thread per device
- **Per-Device Operations**: Each device can be managed by its own dedicated thread

### Recommended Usage Pattern

```c
// Thread-per-device approach (RECOMMENDED)
void *device_thread(void *arg) {
    struct device_info *info = (struct device_info *)arg;

    // Add device (can be done concurrently with other device additions)
    struct vfio_mlx5_dev *dev = vfio_mlx5_device_add(vmh, info->bdf, info->fd, info->nvfs);

    // Process events for this device only
    while (running) {
        int ret = vfio_mlx5_events_process(dev);
        if (ret < 0) {
            // Handle device error (see error recovery workflow)
            break;
        }
    }

    // Clean up device
    vfio_mlx5_device_del(dev);
    return NULL;
}

// Main thread creates one thread per device
for (int i = 0; i < num_devices; i++) {
    pthread_create(&device_threads[i], NULL, device_thread, &device_info[i]);
}
```

## Limitations

- **Shared Resources**: All devices share the same DMA page allocator

### Suspend/Resume
The suspend resume API is primarily useful when the device's entire DMA space is re-mapped (e.g. in kexec hitless upgrade scenarios).
The user is required to suspend both devices and vmh (vfio-mlx5 handle) prior to process suspension.
then resume the vmh and devices on a new storage area (new virtual mapping of the dma space). vmh and device handles
must be re-obtained by the suspend resume/API.

# process A
```c
struct vfio_mlx5_dev *dev_map[MAX_DEVS] = {};
vmh = vfio_mlx5_init(storage, ...);

for (int i=0; i < MAX_DEVS; i++) {
    dev = vfio_mlx5_device_add(vmh, ...);
    dev_map[vfio_mlx5_dev_index(dev)] = dev;
}

// suspend ...
for (int i=0; i < MAX_DEVS; i++)
    vfio_mlx5_device_suspend(dev_map[i]);

vfio_mlx5_suspend(vmh);
```

# process B (map to new storage)
```c
struct vfio_mlx5_dev *dev_map[MAX_DEVS] = {};

new_vmh = vfio_mlx5_resume(new_storage);
// re-fill device map from new vmh
for (int i=0; i < MAX_DEVS; i++) {
    dev_map[i] = vfio_mlx5_dev_get(new_vmh, i);
    vfio_mlx5_dev_resume(dev_map[i], new_vfio_device_fd);
}

// Normally manage the same devices with new vmh and new dev handles after resume
```


## Best Practices

1. **Use thread-per-device model** for optimal performance and scalability
2. **Monitor device health frequently** - collect stats periodically and/or after event processing for early problem detection
3. **Use hugepages** for large allocations (>= 512MB)
4. **Process events via epoll/interrupt mode** rather than polling
5. **Implement proper error recovery** - collect stats on device errors before removal
6. **Follow proper cleanup order**: stop event processing → remove error devices

For complete examples, see `samples/vfio_mlx5.c`