NVMe I/O timeouts

looks like i’m a lucky one with my few years old low end 128g toshiba nvme in my vf2 running off a normal rpi 5.1v/3a power brick :slight_smile: … no issues yet, but maybe i did not yet put enough load onto it beyond -j4 kernel builds etc. …

3 Likes

I also get it with my Patriot P300 NVMe

[134995.705555] nvme nvme0: I/O 181 QID 2 timeout, completion polled
[135489.136238] nvme nvme0: I/O 63 QID 4 timeout, completion polled
2 Likes

Same here, with another Patriot P300.

“find /usr -type f | xargs md5sum” was sufficient to reproduce it - this wasn’t doing any explicit writing but I have a feeling that the access times in the file inodes were being touched.

The system seems to recover OK each time (after a delay of about 30 seconds). Maybe a race condition involving a lost interrupt from the device e.g. two queues signalling completion at once, and only one being noticed/serviced?

[10307.600100] nvme nvme0: I/O 8 QID 4 timeout, completion polled
[10338.399635] nvme nvme0: I/O 6 QID 4 timeout, completion polled
[10369.039234] nvme nvme0: I/O 5 QID 4 timeout, completion polled
[10404.058725] nvme nvme0: I/O 9 QID 4 timeout, completion polled
[10454.158009] nvme nvme0: I/O 11 QID 4 timeout, completion polled
[10497.677395] nvme nvme0: I/O 9 QID 1 timeout, completion polled
[10531.596896] nvme nvme0: I/O 4 QID 4 timeout, completion polled
[10564.876405] nvme nvme0: I/O 5 QID 1 timeout, completion polled
[10598.795926] nvme nvme0: I/O 5 QID 4 timeout, completion polled
[10638.955338] nvme nvme0: I/O 13 QID 2 timeout, completion polled
[10684.554657] nvme nvme0: I/O 13 QID 1 timeout, completion polled
1 Like

Does adding norelatime,noatime where the root partition is mounted help (no need for a nodiratime because noatime disables it as well) ? Or reduce the number of timeouts.

1 Like

Patriot P310, Arch 5.15.2-cwt12 default kernel options, btrfs default.
No timeouts running the above over c. 20500 files.

2 Likes

Could it be a buggy firmware issue ?
With nvme-cli installed ($ sudo apt install nvme-cli), does a command like the following list the firmware version ?

nvme id-ctrl /dev/nvme0 
nvme id-ctrl /dev/nvme0 --vendor-specific

Here is the output for my device

StarFive ~ # nvme id-ctrl /dev/nvme0 
NVME Identify Controller:
vid       : 0x126f
ssvid     : 0x126f
sn        : *****************************
mn        : Patriot M.2 P300 256GB                  
fr        : V0513A0 
rab       : 6
ieee      : 000001
cmic      : 0
mdts      : 6
cntlid    : 0x1
ver       : 0x10300
rtd3r     : 0x249f0
rtd3e     : 0x13880
oaes      : 0x200
ctratt    : 0
rrls      : 0
cntrltype : 0
fguid     : 00000000-0000-0000-0000-000000000000
crdt1     : 0
crdt2     : 0
crdt3     : 0
nvmsr     : 0
vwci      : 0
mec       : 0
oacs      : 0x7
acl       : 4
aerl      : 7
frmw      : 0x12
lpa       : 0x3
elpe      : 63
npss      : 0
avscc     : 0
apsta     : 0
wctemp    : 356
cctemp    : 358
mtfa      : 100
hmpre     : 16384
hmmin     : 8192
tnvmcap   : 0
unvmcap   : 0
rpmbs     : 0
edstt     : 0
dsto      : 0
fwug      : 4
kas       : 0
hctma     : 0x1
mntmt     : 273
mxtmt     : 358
sanicap   : 0
hmminds   : 0
hmmaxd    : 0
nsetidmax : 0
endgidmax : 0
anatt     : 0
anacap    : 0
anagrpmax : 0
nanagrpid : 0
pels      : 0
domainid  : 0
megcap    : 0
sqes      : 0x66
cqes      : 0x44
maxcmd    : 0
nn        : 1
oncs      : 0x15
fuses     : 0
fna       : 0x1
vwc       : 0x1
awun      : 0
awupf     : 0
icsvscc   : 0
nwpc      : 0
acwu      : 0
ocfs      : 0
sgls      : 0
mnan      : 0
maxdna    : 0
maxcna    : 0
subnqn    : 
ioccsz    : 0
iorcsz    : 0
icdoff    : 0
fcatt     : 0
msdbd     : 0
ofcs      : 0
ps      0 : mp:6.00W operational enlat:0 exlat:0 rrt:0 rrl:0
            rwt:0 rwl:0 idle_power:- active_power:-
            active_power_workload:-
StarFive ~ # nvme id-ctrl /dev/nvme0 --vendor-specific
NVME Identify Controller:
vid       : 0x126f
ssvid     : 0x126f
sn        : *************************************
mn        : Patriot M.2 P300 256GB                  
fr        : V0513A0 
rab       : 6
ieee      : 000001
cmic      : 0
mdts      : 6
cntlid    : 0x1
ver       : 0x10300
rtd3r     : 0x249f0
rtd3e     : 0x13880
oaes      : 0x200
ctratt    : 0
rrls      : 0
cntrltype : 0
fguid     : 00000000-0000-0000-0000-000000000000
crdt1     : 0
crdt2     : 0
crdt3     : 0
nvmsr     : 0
vwci      : 0
mec       : 0
oacs      : 0x7
acl       : 4
aerl      : 7
frmw      : 0x12
lpa       : 0x3
elpe      : 63
npss      : 0
avscc     : 0
apsta     : 0
wctemp    : 356
cctemp    : 358
mtfa      : 100
hmpre     : 16384
hmmin     : 8192
tnvmcap   : 0
unvmcap   : 0
rpmbs     : 0
edstt     : 0
dsto      : 0
fwug      : 4
kas       : 0
hctma     : 0x1
mntmt     : 273
mxtmt     : 358
sanicap   : 0
hmminds   : 0
hmmaxd    : 0
nsetidmax : 0
endgidmax : 0
anatt     : 0
anacap    : 0
anagrpmax : 0
nanagrpid : 0
pels      : 0
domainid  : 0
megcap    : 0
sqes      : 0x66
cqes      : 0x44
maxcmd    : 0
nn        : 1
oncs      : 0x15
fuses     : 0
fna       : 0x1
vwc       : 0x1
awun      : 0
awupf     : 0
icsvscc   : 0
nwpc      : 0
acwu      : 0
ocfs      : 0
sgls      : 0
mnan      : 0
maxdna    : 0
maxcna    : 0
subnqn    : 
ioccsz    : 0
iorcsz    : 0
icdoff    : 0
fcatt     : 0
msdbd     : 0
ofcs      : 0
ps      0 : mp:6.00W operational enlat:0 exlat:0 rrt:0 rrl:0
            rwt:0 rwl:0 idle_power:- active_power:-
            active_power_workload:-
vs[]:
       0  1  2  3  4  5  6  7  8  9  a  b  c  d  e  f
0000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0040: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0050: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0060: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0070: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0080: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0090: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
00a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
00b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
00c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 01 "................"
00d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
00e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
00f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0100: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0110: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0120: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0130: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0140: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0150: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0160: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0170: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0180: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0190: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
01a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
01b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
01c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
01d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
01e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
01f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0200: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0210: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0220: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0230: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0240: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0250: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0260: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0270: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0280: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0290: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
02a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
02b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
02c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
02d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
02e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
02f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0300: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0310: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0320: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0330: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0340: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0350: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0360: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0370: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0380: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0390: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
03a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
03b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
03c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
03d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
03e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
03f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
StarFive ~ # 

If I was you I would remove the serial number, but leave the rest, with enough data points there is probably something shown that might help track down and identify the cause of the timeouts. Looks like your firmware revision is “V0513A0”

1 Like

Serial Number has been removed :slightly_smiling_face:

2 Likes

I found a second brand of NVMe (SSD M.2 NVMe Aoluska Gen 3.0 x4 2400Mb/s Leitura 256GB) that shares the same firmware revision, and it uses a Silicon Motion SM2263 NVMe SSD Controller which I am guessing is where the firmware runs.
So either:

Product		Host Standards		   Flash Interface	ECC Support		        Flash VCCQ Support	DRAM	TCG/AES		Package
SM2263EN	PCIe Gen3 x4 NVMe 1.3	4-CH			Configurable LDPC ECC	1.8V/1.2V		    Yes	    Yes		    TFBGA288 (12 x 12mm)
SM2263XT	PCIe Gen3 x4 NVMe 1.3	4-CH			Configurable LDPC ECC	1.8V/1.2V		    --	    Yes		    TFBGA288 (12 x 12mm)

And my guess would be a SM2263XT since patriotmemory do not mention any use of DRAM in their marketing.

I checked Silicon Motion website and there is no sign of a latter firmware ( “site:siliconmotion.com firmware” ).
I also checked patriotmemory website and they have none either (“site:patriotmemory.com firmware”).
And found that Aoluska does not appear to have a website.

The PCIe VendorID 0x126f is allocated to Silicon Motion, Inc., which would corroborate that this is the manufacturer of the controller chip used.

1 Like

Yes it has a Silicon Motion SSD Controller according to hwinfo

NVME 00.0: 10600 Disk
  [Created at block.255]
  Unique ID: GP4z.dfVB1eXouQ4
  Parent ID: xKWB._aNoHWEPua6
  SysFS ID: /class/block/nvme0n1
  SysFS BusID: nvme0
  SysFS Device Link: /devices/platform/soc/2c000000.pcie/pci0001:00/0001:00:00.0/0001:01:00.0/nvme/nvme0
  Hardware Class: disk
  Model: "Silicon Motion SM2263EN/SM2263XT SSD Controller"
  Vendor: pci 0x126f "Silicon Motion, Inc."
  Device: pci 0x2263 "SM2263EN/SM2263XT SSD Controller"
  SubVendor: pci 0x126f "Silicon Motion, Inc."
  SubDevice: pci 0x2263 
  Serial ID: "P300ABBB22111823091"
  Driver: "nvme"
  Driver Modules: "nvme"
  Device File: /dev/nvme0n1
  Device Number: block 259:0
  Geometry (Logical): CHS 244198/64/32
  Size: 500118192 sectors a 512 bytes
  Capacity: 238 GB (256060514304 bytes)
  Config Status: cfg=new, avail=yes, need=no, active=unknown
  Attached to: #10 (Non-Volatile memory controller)

2 Likes

I have not had a chance to try it but this patch may help.

4 Likes

I tried this patch but I still get timeouts :weary:

3 Likes

This is a stab in the dark, but since USB 3.0 chipset also uses PCIe, I wonder if you remove all USB devices and access the machine only over SSH are the timeouts less or gone.

I’m only thinking about this because later firmware that the one available for the VF2 fixed some PCIe issues. It is probably a red herring (idiom’s do not translate into all languages, so hence the link). But my thinking is that if you unplug all USB devices there should be no PCIe traffic generated by that lane. At the very least it would cross one item off the list as a possible cause of the problem or shift it much further down the list.

1 Like

I have no USB devices plugged in and I only access it via ssh.

I have the same issue, both with a vf2 1.3b and a 1.2a board. Only with the upstream kernel branch.
With the 5.15 kernel included in the wayland debian image I do not get any timeouts.
Just tried the current 6.4rc1 (JH7110_VisionFive2_upstream branch from just now), and still getting timeouts.

@Wrybane thanks for reporting that, an interesting data point.

Just for completeness, what brand/model of NVMe are you using? there is some suggestion that this affects some NVMe’s more than others…

This occurs with the current kernel as well.
If you search the forum you will find discussions on this.

WD Red SN700 500GB,
Firmware version 111150WD

2 Likes

Another interesting thing is this: I built u-boot & opensbi from upstream sources, as they seem to have enough support to boot from sdcard, and I had a significantly increased number of nvme timeouts with that. So much that it prolonged the boot process with root on the nvme to take over 5 minutes before I could login. So I’m wondering if there’s some extra power management steps opensbi/u-boot with the starfive versions take which perhaps upstream linux also doesn’t yet do which affects this?

3 Likes