NVMe I/O timeouts

I saw someone on twitter got the same problem as me about NVMe I/O timeouts. I added these params to the kernel and my problem is fixed.

pcie_aspm.policy=performance pcie_aspm=off pcie_port_pm=off nvme_core.default_ps_max_latency_us=0 nvme_core.io_timeout=255 nvme_core.max_retries=10 nvme_core.shutdown_timeout=10

Please try it if you got the same problem.

7 Likes

I’m having the same NVMe issue with my VisionFive2. I’ve been struggling with this for a week.

After some research I added nvme_core.default_ps_max_latency_us=0 to /boot/boot/extlinux/extlinux.conf .
This got rid of the I/O errors, but now causes a random reboot instead.

I’ll try adding your other parameters.

thank you.

2 Likes

Hmm, i tried setting the params that @cwt suggests above. It was OK to cope with apt-get, but unfortunately the system did repeatably not survive a git-clone linux. That’s very unfortunate, as it appears, that the NVMe cannot be used under load right now. Did anyone had more success?

1 Like

I believe, the “fix” above with the kernel parameters are misleading.

At least for me, the point was, that i used a weak power supply. After attaching a 18 W USB-C charger, the nvme works as supposed.

Please @cwt and @zu2 confirm, to put this rumor to rest.

3 Likes

I always use powerful 65W PD, without those params my NVMe always timeouts while it got high IO loads. So, it may fixed my problem, specifically to my hardware.

3 Likes

My problem was that I was using an “Anker Nano II 30W”.
(I haven’t checked yet if the cause is the power supply or the cable)

It seems that it was solved by changing to Sanwa Supply’s PD65W. It’s been working steadily for over 3 days.

1 Like

Hi,
I’ve got few NVMe timeouts aswell in my dmesg, with or without 12V PD supply. Currently running off a 9V 15W capable charger, with measured stable 5V supply at board. These are quite problematic if occur but happen very rarely:

nvme nvme0: I/O 805 QID 3 timeout, completion polled
nvme nvme0: I/O 277 QID 4 timeout, completion polled
nvme nvme0: I/O 283 QID 4 timeout, completion polled
nvme nvme0: I/O 319 QID 4 timeout, completion polled
nvme nvme0: I/O 774 QID 3 timeout, completion polled
nvme nvme0: I/O 296 QID 4 timeout, completion polled
nvme nvme0: I/O 794 QID 3 timeout, completion polled
nvme nvme0: I/O 801 QID 3 timeout, completion polled

Usually this happens during heavy load, like if I do lots of I/O, I mean, lots, like doing few tar’s parallel to building gcc. Then yeah, they start to pop up. Linear load like cat /dev/nvme0n1 >/dev/null pops nothing though. Tested both with 5V 15W USB-C PD and 12V 30W USB-C PD, all the same.

The problem might be related to supply since I never got NVMe to work off a USB-A 5V supplies, even powerful ones. Boot just hangs in attempt to mount rootfs from no answering NVMe. With USB monitor, when I see voltage around 4.9V and less, then NVMe unresponsive problem occurs. Funny how PD fixes this. Does it have remote feedback mechanism?

2 Likes

I have something similar, with a Patriot P300 NVMe.
I already reported this here:

Only seems to happen during ‘heavy’ operations, installing packages, cloning repos, etc.

I’m using a 45W Samsung charger and cable; If it is a power problem I’d be surprised, though maybe the VF2 can generate peak loads that drop enough over the cable or board copper to cause this.

Edit: I’ll try the solution @cwt posted at the top of this, and report results.

1 Like

looks like i’m a lucky one with my few years old low end 128g toshiba nvme in my vf2 running off a normal rpi 5.1v/3a power brick :slight_smile: … no issues yet, but maybe i did not yet put enough load onto it beyond -j4 kernel builds etc. …

3 Likes

I also get it with my Patriot P300 NVMe

[134995.705555] nvme nvme0: I/O 181 QID 2 timeout, completion polled
[135489.136238] nvme nvme0: I/O 63 QID 4 timeout, completion polled
2 Likes

Same here, with another Patriot P300.

“find /usr -type f | xargs md5sum” was sufficient to reproduce it - this wasn’t doing any explicit writing but I have a feeling that the access times in the file inodes were being touched.

The system seems to recover OK each time (after a delay of about 30 seconds). Maybe a race condition involving a lost interrupt from the device e.g. two queues signalling completion at once, and only one being noticed/serviced?

[10307.600100] nvme nvme0: I/O 8 QID 4 timeout, completion polled
[10338.399635] nvme nvme0: I/O 6 QID 4 timeout, completion polled
[10369.039234] nvme nvme0: I/O 5 QID 4 timeout, completion polled
[10404.058725] nvme nvme0: I/O 9 QID 4 timeout, completion polled
[10454.158009] nvme nvme0: I/O 11 QID 4 timeout, completion polled
[10497.677395] nvme nvme0: I/O 9 QID 1 timeout, completion polled
[10531.596896] nvme nvme0: I/O 4 QID 4 timeout, completion polled
[10564.876405] nvme nvme0: I/O 5 QID 1 timeout, completion polled
[10598.795926] nvme nvme0: I/O 5 QID 4 timeout, completion polled
[10638.955338] nvme nvme0: I/O 13 QID 2 timeout, completion polled
[10684.554657] nvme nvme0: I/O 13 QID 1 timeout, completion polled
1 Like

Does adding norelatime,noatime where the root partition is mounted help (no need for a nodiratime because noatime disables it as well) ? Or reduce the number of timeouts.

1 Like

Patriot P310, Arch 5.15.2-cwt12 default kernel options, btrfs default.
No timeouts running the above over c. 20500 files.

2 Likes

Could it be a buggy firmware issue ?
With nvme-cli installed ($ sudo apt install nvme-cli), does a command like the following list the firmware version ?

nvme id-ctrl /dev/nvme0 
nvme id-ctrl /dev/nvme0 --vendor-specific

Here is the output for my device

StarFive ~ # nvme id-ctrl /dev/nvme0 
NVME Identify Controller:
vid       : 0x126f
ssvid     : 0x126f
sn        : *****************************
mn        : Patriot M.2 P300 256GB                  
fr        : V0513A0 
rab       : 6
ieee      : 000001
cmic      : 0
mdts      : 6
cntlid    : 0x1
ver       : 0x10300
rtd3r     : 0x249f0
rtd3e     : 0x13880
oaes      : 0x200
ctratt    : 0
rrls      : 0
cntrltype : 0
fguid     : 00000000-0000-0000-0000-000000000000
crdt1     : 0
crdt2     : 0
crdt3     : 0
nvmsr     : 0
vwci      : 0
mec       : 0
oacs      : 0x7
acl       : 4
aerl      : 7
frmw      : 0x12
lpa       : 0x3
elpe      : 63
npss      : 0
avscc     : 0
apsta     : 0
wctemp    : 356
cctemp    : 358
mtfa      : 100
hmpre     : 16384
hmmin     : 8192
tnvmcap   : 0
unvmcap   : 0
rpmbs     : 0
edstt     : 0
dsto      : 0
fwug      : 4
kas       : 0
hctma     : 0x1
mntmt     : 273
mxtmt     : 358
sanicap   : 0
hmminds   : 0
hmmaxd    : 0
nsetidmax : 0
endgidmax : 0
anatt     : 0
anacap    : 0
anagrpmax : 0
nanagrpid : 0
pels      : 0
domainid  : 0
megcap    : 0
sqes      : 0x66
cqes      : 0x44
maxcmd    : 0
nn        : 1
oncs      : 0x15
fuses     : 0
fna       : 0x1
vwc       : 0x1
awun      : 0
awupf     : 0
icsvscc   : 0
nwpc      : 0
acwu      : 0
ocfs      : 0
sgls      : 0
mnan      : 0
maxdna    : 0
maxcna    : 0
subnqn    : 
ioccsz    : 0
iorcsz    : 0
icdoff    : 0
fcatt     : 0
msdbd     : 0
ofcs      : 0
ps      0 : mp:6.00W operational enlat:0 exlat:0 rrt:0 rrl:0
            rwt:0 rwl:0 idle_power:- active_power:-
            active_power_workload:-
StarFive ~ # nvme id-ctrl /dev/nvme0 --vendor-specific
NVME Identify Controller:
vid       : 0x126f
ssvid     : 0x126f
sn        : *************************************
mn        : Patriot M.2 P300 256GB                  
fr        : V0513A0 
rab       : 6
ieee      : 000001
cmic      : 0
mdts      : 6
cntlid    : 0x1
ver       : 0x10300
rtd3r     : 0x249f0
rtd3e     : 0x13880
oaes      : 0x200
ctratt    : 0
rrls      : 0
cntrltype : 0
fguid     : 00000000-0000-0000-0000-000000000000
crdt1     : 0
crdt2     : 0
crdt3     : 0
nvmsr     : 0
vwci      : 0
mec       : 0
oacs      : 0x7
acl       : 4
aerl      : 7
frmw      : 0x12
lpa       : 0x3
elpe      : 63
npss      : 0
avscc     : 0
apsta     : 0
wctemp    : 356
cctemp    : 358
mtfa      : 100
hmpre     : 16384
hmmin     : 8192
tnvmcap   : 0
unvmcap   : 0
rpmbs     : 0
edstt     : 0
dsto      : 0
fwug      : 4
kas       : 0
hctma     : 0x1
mntmt     : 273
mxtmt     : 358
sanicap   : 0
hmminds   : 0
hmmaxd    : 0
nsetidmax : 0
endgidmax : 0
anatt     : 0
anacap    : 0
anagrpmax : 0
nanagrpid : 0
pels      : 0
domainid  : 0
megcap    : 0
sqes      : 0x66
cqes      : 0x44
maxcmd    : 0
nn        : 1
oncs      : 0x15
fuses     : 0
fna       : 0x1
vwc       : 0x1
awun      : 0
awupf     : 0
icsvscc   : 0
nwpc      : 0
acwu      : 0
ocfs      : 0
sgls      : 0
mnan      : 0
maxdna    : 0
maxcna    : 0
subnqn    : 
ioccsz    : 0
iorcsz    : 0
icdoff    : 0
fcatt     : 0
msdbd     : 0
ofcs      : 0
ps      0 : mp:6.00W operational enlat:0 exlat:0 rrt:0 rrl:0
            rwt:0 rwl:0 idle_power:- active_power:-
            active_power_workload:-
vs[]:
       0  1  2  3  4  5  6  7  8  9  a  b  c  d  e  f
0000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0040: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0050: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0060: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0070: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0080: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0090: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
00a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
00b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
00c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 01 "................"
00d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
00e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
00f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0100: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0110: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0120: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0130: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0140: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0150: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0160: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0170: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0180: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0190: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
01a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
01b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
01c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
01d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
01e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
01f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0200: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0210: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0220: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0230: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0240: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0250: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0260: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0270: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0280: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0290: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
02a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
02b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
02c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
02d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
02e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
02f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0300: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0310: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0320: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0330: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0340: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0350: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0360: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0370: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0380: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0390: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
03a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
03b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
03c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
03d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
03e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
03f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
StarFive ~ # 

If I was you I would remove the serial number, but leave the rest, with enough data points there is probably something shown that might help track down and identify the cause of the timeouts. Looks like your firmware revision is “V0513A0”

1 Like

Serial Number has been removed :slightly_smiling_face:

2 Likes

I found a second brand of NVMe (SSD M.2 NVMe Aoluska Gen 3.0 x4 2400Mb/s Leitura 256GB) that shares the same firmware revision, and it uses a Silicon Motion SM2263 NVMe SSD Controller which I am guessing is where the firmware runs.
So either:

Product		Host Standards		   Flash Interface	ECC Support		        Flash VCCQ Support	DRAM	TCG/AES		Package
SM2263EN	PCIe Gen3 x4 NVMe 1.3	4-CH			Configurable LDPC ECC	1.8V/1.2V		    Yes	    Yes		    TFBGA288 (12 x 12mm)
SM2263XT	PCIe Gen3 x4 NVMe 1.3	4-CH			Configurable LDPC ECC	1.8V/1.2V		    --	    Yes		    TFBGA288 (12 x 12mm)

And my guess would be a SM2263XT since patriotmemory do not mention any use of DRAM in their marketing.

I checked Silicon Motion website and there is no sign of a latter firmware ( “site:siliconmotion.com firmware” ).
I also checked patriotmemory website and they have none either (“site:patriotmemory.com firmware”).
And found that Aoluska does not appear to have a website.

The PCIe VendorID 0x126f is allocated to Silicon Motion, Inc., which would corroborate that this is the manufacturer of the controller chip used.

1 Like

Yes it has a Silicon Motion SSD Controller according to hwinfo

NVME 00.0: 10600 Disk
  [Created at block.255]
  Unique ID: GP4z.dfVB1eXouQ4
  Parent ID: xKWB._aNoHWEPua6
  SysFS ID: /class/block/nvme0n1
  SysFS BusID: nvme0
  SysFS Device Link: /devices/platform/soc/2c000000.pcie/pci0001:00/0001:00:00.0/0001:01:00.0/nvme/nvme0
  Hardware Class: disk
  Model: "Silicon Motion SM2263EN/SM2263XT SSD Controller"
  Vendor: pci 0x126f "Silicon Motion, Inc."
  Device: pci 0x2263 "SM2263EN/SM2263XT SSD Controller"
  SubVendor: pci 0x126f "Silicon Motion, Inc."
  SubDevice: pci 0x2263 
  Serial ID: "P300ABBB22111823091"
  Driver: "nvme"
  Driver Modules: "nvme"
  Device File: /dev/nvme0n1
  Device Number: block 259:0
  Geometry (Logical): CHS 244198/64/32
  Size: 500118192 sectors a 512 bytes
  Capacity: 238 GB (256060514304 bytes)
  Config Status: cfg=new, avail=yes, need=no, active=unknown
  Attached to: #10 (Non-Volatile memory controller)

2 Likes

I have not had a chance to try it but this patch may help.

4 Likes