NVMe I/O timeouts

I tried this patch but I still get timeouts :weary:

3 Likes

This is a stab in the dark, but since USB 3.0 chipset also uses PCIe, I wonder if you remove all USB devices and access the machine only over SSH are the timeouts less or gone.

I’m only thinking about this because later firmware that the one available for the VF2 fixed some PCIe issues. It is probably a red herring (idiom’s do not translate into all languages, so hence the link). But my thinking is that if you unplug all USB devices there should be no PCIe traffic generated by that lane. At the very least it would cross one item off the list as a possible cause of the problem or shift it much further down the list.

1 Like

I have no USB devices plugged in and I only access it via ssh.

I have the same issue, both with a vf2 1.3b and a 1.2a board. Only with the upstream kernel branch.
With the 5.15 kernel included in the wayland debian image I do not get any timeouts.
Just tried the current 6.4rc1 (JH7110_VisionFive2_upstream branch from just now), and still getting timeouts.

@Wrybane thanks for reporting that, an interesting data point.

Just for completeness, what brand/model of NVMe are you using? there is some suggestion that this affects some NVMe’s more than others…

This occurs with the current kernel as well.
If you search the forum you will find discussions on this.

WD Red SN700 500GB,
Firmware version 111150WD

2 Likes

Another interesting thing is this: I built u-boot & opensbi from upstream sources, as they seem to have enough support to boot from sdcard, and I had a significantly increased number of nvme timeouts with that. So much that it prolonged the boot process with root on the nvme to take over 5 minutes before I could login. So I’m wondering if there’s some extra power management steps opensbi/u-boot with the starfive versions take which perhaps upstream linux also doesn’t yet do which affects this?

3 Likes

I can’t provide a solution but here is at least a workaround that seems to do the job. Lower all the timouts to the absolute minimum such that the delay when a timeout occurs is also minimized:

nvme_core.io_timeout=1 nvme_core.max_retries=1 nvme_core.shutdown_timeout=1

With this I get reasonable to good performance and the “QID timeout, completion polled” warning does not happen very often. I know it’s an ugly hack.

2 Likes

Is there anybody can solve the “nvme i/o timeout” right now? I have recently found that this problem is probably related to msi interrupts. I have tried to add pci=nomsi to cmdline, the problem is solved. But is there any idea about the real solution without closing the msi?

1 Like

For me, all NVME related issues went away after I replaced my power supply by a stronger one. If you have a different one with some amps more, perhaps you could give that a try.

1 Like

Thank you for your reply! But I’ve tried two power supply ( 3A max and 7.25A max ), neither can solve the problem. By the way, the iso I used is from How to create Linux.iso · starfive-tech/edk2 Wiki · GitHub. At the same time, I also encountered an usb problem, which can also be solved by adding pci=nomsi. What are the underlying reasons?

1 Like

The PCIe bridge ans USB share the same PCIe-link. That’s why :wink:

1 Like