11. Debug & Troubleshoot guide
DPDK applications can be designed to have simple or complex pipeline processing
stages making use of single or multiple threads. Applications can use poll mode
hardware devices which helps in offloading CPU cycles too. It is common to find
solutions designed with
- single or multiple primary processes
- single primary and single secondary
- single primary and multiple secondaries
In all the above cases, it is tedious to isolate, debug, and understand various
behaviors which occur randomly or periodically. The goal of the guide is to
consolidate a few commonly seen issues for reference. Then, isolate to identify
the root cause through step by step debug at various stages.
Note
It is difficult to cover all possible issues; in a single attempt. With
feedback and suggestions from the community, more cases can be covered.
11.1. Application Overview
By making use of the application model as a reference, we can discuss multiple
causes of issues in the guide. Let us assume the sample makes use of a single
primary process, with various processing stages running on multiple cores. The
application may also make uses of Poll Mode Driver, and libraries like service
cores, mempool, mbuf, eventdev, cryptodev, QoS, and ethdev.
The overview of an application modeled using PMD is shown in
Link.
11.2. Bottleneck Analysis
A couple of factors that lead the design decision could be the platform, scale
factor, and target. This distinct preference leads to multiple combinations,
that are built using PMD and libraries of DPDK. While the compiler, library
mode, and optimization flags are the components are to be constant, that
affects the application too.
11.2.1. Is there mismatch in packet (received < desired) rate?
RX Port and associated core Link.
- Is the configuration for the RX setup correctly?
- Identify if port Speed and Duplex is matching to desired values with
rte_eth_link_get.
- Check DEV_RX_OFFLOAD_JUMBO_FRAME is set with rte_eth_dev_info_get.
- Check promiscuous mode if the drops do not occur for unique MAC address
with rte_eth_promiscuous_get.
- Is the drop isolated to certain NIC only?
- Make use of rte_eth_dev_stats to identify the drops cause.
- If there are mbuf drops, check nb_desc for RX descriptor as it might not
be sufficient for the application.
- If rte_eth_dev_stats shows drops are on specific RX queues, ensure RX
lcore threads has enough cycles for rte_eth_rx_burst on the port queue
pair.
- If there are redirect to a specific port queue pair with, ensure RX lcore
threads gets enough cycles.
- Check the RSS configuration rte_eth_dev_rss_hash_conf_get if the
spread is not even and causing drops.
- If PMD stats are not updating, then there might be offload or configuration
which is dropping the incoming traffic.
- Is there drops still seen?
- If there are multiple port queue pair, it might be the RX thread, RX
distributor, or event RX adapter not having enough cycles.
- If there are drops seen for RX adapter or RX distributor, try using
rte_prefetch_non_temporal which intimates the core that the mbuf in the
cache is temporary.
11.2.2. Is there packet drops at receive or transmit?
RX-TX port and associated cores Link.
- At RX
- Identify if there are multiple RX queue configured for port by
nb_rx_queues using rte_eth_dev_info_get.
- Using rte_eth_dev_stats fetch drops in q_errors, check if RX thread
is configured to fetch packets from the port queue pair.
- Using rte_eth_dev_stats shows drops in rx_nombuf, check if RX
thread has enough cycles to consume the packets from the queue.
- At TX
- If the TX rate is falling behind the application fill rate, identify if
there are enough descriptors with rte_eth_dev_info_get for TX.
- Check the nb_pkt in rte_eth_tx_burst is done for multiple packets.
- Check rte_eth_tx_burst invokes the vector function call for the PMD.
- If oerrors are getting incremented, TX packet validations are failing.
Check if there queue specific offload failures.
- If the drops occur for large size packets, check MTU and multi-segment
support configured for NIC.
11.2.3. Is there object drops in producer point for the ring library?
Producer point for ring Link.
- Performance issue isolation at producer
- Use rte_ring_dump to validate for all single producer flag is set to
RING_F_SP_ENQ.
- There should be sufficient rte_ring_free_count at any point in time.
- Extreme stalls in dequeue stage of the pipeline will cause
rte_ring_full to be true.
11.2.4. Is there object drops in consumer point for the ring library?
Consumer point for ring Link.
- Performance issue isolation at consumer
- Use rte_ring_dump to validate for all single consumer flag is set to
RING_F_SC_DEQ.
- If the desired burst dequeue falls behind the actual dequeue, the enqueue
stage is not filling up the ring as required.
- Extreme stall in the enqueue will lead to rte_ring_empty to be true.
11.2.5. Is there a variance in packet or object processing rate in the pipeline?
Memory objects close to NUMA Link.
- Stall in processing pipeline can be attributes of MBUF release delays.
These can be narrowed down to
- Heavy processing cycles at single or multiple processing stages.
- Cache is spread due to the increased stages in the pipeline.
- CPU thread responsible for TX is not able to keep up with the burst of
traffic.
- Extra cycles to linearize multi-segment buffer and software offload like
checksum, TSO, and VLAN strip.
- Packet buffer copy in fast path also results in stalls in MBUF release if
not done selectively.
- Application logic sets rte_pktmbuf_refcnt_set to higher than the
desired value and frequently uses rte_pktmbuf_prefree_seg and does
not release MBUF back to mempool.
- Lower performance between the pipeline processing stages can be
- The NUMA instance for packets or objects from NIC, mempool, and ring
should be the same.
- Drops on a specific socket are due to insufficient objects in the pool.
Use rte_mempool_get_count or rte_mempool_avail_count to monitor
when drops occurs.
- Try prefetching the content in processing pipeline logic to minimize the
stalls.
- Performance issue can be due to special cases
- Check if MBUF continuous with rte_pktmbuf_is_contiguous as certain
offload requires the same.
- Use rte_mempool_cache_create for user threads require access to
mempool objects.
- If the variance is absent for larger huge pages, then try rte_mem_lock_page
on the objects, packets, lookup tables to isolate the issue.
11.2.8. Is the execution cycles for dynamic service functions are not frequent?
service functions on service cores Link.
- Performance issue isolation
- Services configured for parallel execution should have
rte_service_lcore_count should be equal to
rte_service_lcore_count_services.
- A service to run parallel on all cores should return
RTE_SERVICE_CAP_MT_SAFE for rte_service_probe_capability and
rte_service_map_lcore_get returns unique lcore.
- If service function execution cycles for dynamic service functions are
not frequent?
- If services share the lcore, overall execution should fit budget.
- Configuration issue isolation
- Check if service is running with rte_service_runstate_get.
- Generic debug via rte_service_dump.
11.2.10. Is there a variance in traffic manager?
Traffic Manager on TX interface Link.
- Identify the cause for a variance from expected behavior, is due to
insufficient CPU cycles. Use rte_tm_capabilities_get to fetch features
for hierarchies, WRED and priority schedulers to be offloaded hardware.
- Undesired flow drops can be narrowed down to WRED, priority, and rates
limiters.
- Isolate the flow in which the undesired drops occur. Use
rte_tn_get_number_of_leaf_node and flow table to ping down the leaf
where drops occur.
- Check the stats using rte_tm_stats_update and rte_tm_node_stats_read
for drops for hierarchy, schedulers and WRED configurations.
11.2.12. Does the issue still persist?
The issue can be further narrowed down to the following causes.
- If there are vendor or application specific metadata, check for errors due
to META data error flags. Dumping private meta-data in the objects can give
insight into details for debugging.
- If there are multi-process for either data or configuration, check for
possible errors in the secondary process where the configuration fails and
possible data corruption in the data plane.
- Random drops in the RX or TX when opening other application is an indication
of the effect of a noisy neighbor. Try using the cache allocation technique
to minimize the effect between applications.
11.3. How to develop a custom code to debug?
- For an application that runs as the primary process only, debug functionality
is added in the same process. These can be invoked by timer call-back,
service core and signal handler.
- For the application that runs as multiple processes. debug functionality in
a standalone secondary process.