NVMe I/O timeouts

I tried this patch but I still get timeouts :weary:

3 Likes

This is a stab in the dark, but since USB 3.0 chipset also uses PCIe, I wonder if you remove all USB devices and access the machine only over SSH are the timeouts less or gone.

I’m only thinking about this because later firmware that the one available for the VF2 fixed some PCIe issues. It is probably a red herring (idiom’s do not translate into all languages, so hence the link). But my thinking is that if you unplug all USB devices there should be no PCIe traffic generated by that lane. At the very least it would cross one item off the list as a possible cause of the problem or shift it much further down the list.

1 Like

I have no USB devices plugged in and I only access it via ssh.

I have the same issue, both with a vf2 1.3b and a 1.2a board. Only with the upstream kernel branch.
With the 5.15 kernel included in the wayland debian image I do not get any timeouts.
Just tried the current 6.4rc1 (JH7110_VisionFive2_upstream branch from just now), and still getting timeouts.

@Wrybane thanks for reporting that, an interesting data point.

Just for completeness, what brand/model of NVMe are you using? there is some suggestion that this affects some NVMe’s more than others…

This occurs with the current kernel as well.
If you search the forum you will find discussions on this.

WD Red SN700 500GB,
Firmware version 111150WD

2 Likes

Another interesting thing is this: I built u-boot & opensbi from upstream sources, as they seem to have enough support to boot from sdcard, and I had a significantly increased number of nvme timeouts with that. So much that it prolonged the boot process with root on the nvme to take over 5 minutes before I could login. So I’m wondering if there’s some extra power management steps opensbi/u-boot with the starfive versions take which perhaps upstream linux also doesn’t yet do which affects this?

3 Likes

I can’t provide a solution but here is at least a workaround that seems to do the job. Lower all the timouts to the absolute minimum such that the delay when a timeout occurs is also minimized:

nvme_core.io_timeout=1 nvme_core.max_retries=1 nvme_core.shutdown_timeout=1

With this I get reasonable to good performance and the “QID timeout, completion polled” warning does not happen very often. I know it’s an ugly hack.

2 Likes