NVMe I/O timeouts

andrew · April 23, 2023, 10:17pm

I tried this patch but I still get timeouts

mzs · April 24, 2023, 12:43am

This is a stab in the dark, but since USB 3.0 chipset also uses PCIe, I wonder if you remove all USB devices and access the machine only over SSH are the timeouts less or gone.

I’m only thinking about this because later firmware that the one available for the VF2 fixed some PCIe issues. It is probably a red herring (idiom’s do not translate into all languages, so hence the link). But my thinking is that if you unplug all USB devices there should be no PCIe traffic generated by that lane. At the very least it would cross one item off the list as a possible cause of the problem or shift it much further down the list.

andrew · April 24, 2023, 9:46am

I have no USB devices plugged in and I only access it via ssh.

Wrybane · May 9, 2023, 5:25pm

I have the same issue, both with a vf2 1.3b and a 1.2a board. Only with the upstream kernel branch.
With the 5.15 kernel included in the wayland debian image I do not get any timeouts.
Just tried the current 6.4rc1 (JH7110_VisionFive2_upstream branch from just now), and still getting timeouts.

easytarget · May 9, 2023, 5:34pm

@Wrybane thanks for reporting that, an interesting data point.

Just for completeness, what brand/model of NVMe are you using? there is some suggestion that this affects some NVMe’s more than others…

andrew · May 9, 2023, 7:14pm

This occurs with the current kernel as well.
If you search the forum you will find discussions on this.

Wrybane · May 9, 2023, 7:22pm

WD Red SN700 500GB,
Firmware version 111150WD

Wrybane · May 12, 2023, 7:54am

Another interesting thing is this: I built u-boot & opensbi from upstream sources, as they seem to have enough support to boot from sdcard, and I had a significantly increased number of nvme timeouts with that. So much that it prolonged the boot process with root on the nvme to take over 5 minutes before I could login. So I’m wondering if there’s some extra power management steps opensbi/u-boot with the starfive versions take which perhaps upstream linux also doesn’t yet do which affects this?

bml303 · January 28, 2024, 7:44am

I can’t provide a solution but here is at least a workaround that seems to do the job. Lower all the timouts to the absolute minimum such that the delay when a timeout occurs is also minimized:

nvme_core.io_timeout=1 nvme_core.max_retries=1 nvme_core.shutdown_timeout=1

With this I get reasonable to good performance and the “QID timeout, completion polled” warning does not happen very often. I know it’s an ugly hack.

nyjah · June 19, 2024, 3:34am

Is there anybody can solve the “nvme i/o timeout” right now? I have recently found that this problem is probably related to msi interrupts. I have tried to add pci=nomsi to cmdline, the problem is solved. But is there any idea about the real solution without closing the msi?

Nightwulf · June 19, 2024, 1:06pm

For me, all NVME related issues went away after I replaced my power supply by a stronger one. If you have a different one with some amps more, perhaps you could give that a try.

nyjah · June 20, 2024, 2:32am

Thank you for your reply！ But I’ve tried two power supply ( 3A max and 7.25A max ), neither can solve the problem. By the way, the iso I used is from How to create Linux.iso · starfive-tech/edk2 Wiki · GitHub. At the same time, I also encountered an usb problem, which can also be solved by adding pci=nomsi. What are the underlying reasons？

Nightwulf · June 21, 2024, 7:41am

The PCIe bridge ans USB share the same PCIe-link. That’s why

andrew · October 5, 2024, 10:16am

Since upgrading my linux kernel to the latest version 6.6.20 of the source code for the Visionfive 2 Linux StarFive 6.6.20-starfive #10 SMP Thu Oct 3 00:10:16 WEST 2024 riscv64 GNU/Linux my NVME timeouts have not reoccurred.

Hopefully this has solved the problem for me.

andrew · October 5, 2024, 10:04pm

Unfortunately after 3 days I am seeings the timeouts again

[199859.796160] nvme nvme0: I/O 188 QID 4 timeout, completion polled
[211087.707444] nvme nvme0: I/O 198 QID 4 timeout, completion polled

mzs · October 6, 2024, 12:10am

Could it be vibration or EMI related. Maybe try and corelate the times it happens with external events?

e. g. Your mobile phone (sitting near the machine) ringing or receiving a text message.
Or maybe strong infrasound vibrations from a nearby rock quarry that blast like clockwork every Tuesday morning at 11:00.

andrew · October 6, 2024, 1:20pm

It tends to happen when I am compiling chromium or when I have heavy IO.

ganboing · October 6, 2024, 7:34pm

There’re some memory ordering issues in the PCIe controller implementation.

https://lore.kernel.org/linux-pci/ZQ0PR01MB0981DB4002BAECB38DAF0AE18233A@ZQ0PR01MB0981.CHNPR01.prod.partner.outlook.cn/

I’m still waiting for starfive’s update on the issue. I own 2 vf2, but I won’t trust my data with it, especially with NVMe storage, until starfive can thoroughly investigate the problem and figure out what’s going on.

mzs · October 8, 2024, 11:13am

So it happens when the CPU is drawing the maximum current and the NVMe is writing which draws the maximum current and reading a lot which is also going to make it use a lot of power.

Just out of interest if you install cpupower what does the following command display:

$ sudo cpupower info
(or if using cpufreq instead use “cpufreq-info”)
If the governor policy is not set it to performance (The governor is probably set to ondemand), change it temporarily (until you next reboot) with:
$ sudo cpupower frequency-set -g performance

And see if the timeout happens more often.

If it glitches more when the CPU is running at the maximum frequency (highest voltage, which when active means that it can draw the maximum current) all the time, it would suggest that you might have a power issue.

The other possibility that I can think of is that with such high currents that it might be temperature related. That under higher loads the temperature is higher and there is possibly a poor solder joint on the board or NVMe that looses contact under high temperature. I have no idea what cooling you have in place so maybe try additional cooling.

EDIT: I’m just saying where I would look if I had the same problem. I could be wrong, I probably am.

andrew · October 8, 2024, 1:16pm

I always run in performance mode to speedup compiling code on gentoo
cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
performance
performance
performance
performance