Hardware quality control

vfridman · August 9, 2023, 7:06pm

When I received my first board a month ago it had USB problems, USB 2.0 was not working. Now one of my other boards (it’s 1.1A variant) has dead 1G Ethernet port. So out of four boards that I received two are defective and I have to replace them. Am I really that unlucky or this is indication of more widespread problem with QA?

hexdump0815 · August 9, 2023, 7:16pm

except your usb report i so far never read about other vf2 hw qa problems, so i guess it seems to be ok as quite a few people are using a vf2 already.

did you use the proper v1.2a (i guess its v1.2a and not v1.1a as you wrote) dtb? it is important as the eth hardware is different for v1.2a: it has two different ports: one 100m and one 1g … or not really sure anymore: maybe the difference is handled with some u-boot setup script hacks even …

best wishes - hexdump

vfridman · August 9, 2023, 7:20pm

Yes, I used everything correctly. The board was fine for couple of weeks then it started loosing Ethernet connection after a few hours, mostly overnight. And now the link state goes up and down in a matter of seconds and then it looses it completely. 100M port is fine.

strlcat · August 10, 2023, 10:05am

I have v1.2a rev (I assume you meant v1.2a, right? v1.1a is nonexistent afaik) and 1G port is not working from Linux so far, known issue, but works from U-Boot. There were some hacky dts fixes attempts made, but they didn’t help me as of today.

I guess it works, but Linux driver for it is broken somewhere.

vfridman · August 10, 2023, 4:50pm

Yes, I have v1.2a. What is interesting that 1G port was working fine for couple of weeks from standard Debian build. Then trouble started and they way it behaved looks like classical hardware problem. Works at first, than intermittent failure than complete failure. It’s no big deal as I can still exchange it. It just to have two boards out of four fail is concerning.

strlcat · August 10, 2023, 7:41pm

Out of curiosity, can you try testing it from U-Boot?

setenv ipaddr <IP of your VF2>
setenv serverip <router IP>
ping <router IP>

Ideally it would be good to have a TFTP server running somewhere in your subnet aswell, trying to load a binary over TFTP with tftpboot.

vfridman · August 10, 2023, 7:46pm

I’ll try it but at the moment it decided to function again, probably not for long. In any case I have replacement board coming today and need to ship the defective one back tomorrow.

LivingLinux · August 10, 2023, 7:51pm

I guess you don’t know how statistics work. Two out of four is not enough to draw conclusions. Looks like you are just unlucky, as I haven’t seen a lot of posts of people complaining about the quality.

vfridman · August 10, 2023, 7:54pm

That is why I posted. Of cause my two bad boards is not a statistic. Just wanted to now if the problem was more widespread, sounds like it’s not.

strlcat · August 10, 2023, 10:02pm

Typical symptom for me now is that packets flow only in a single direction over this interface. Sending from v1.2a does not work at all, but receiving do. So, v1.2a fails to lookup ARP for example, but receives network noise just fine. See my inquiries about that in past: V1.2A: 1G network jack not working, 100M network jack does and github issue link in that post aswell, where I did a test report about this.

Note I did not try to fix it by updating to current dts from git. I do run quite old rebased LTS kernel and I don’t run Debian and I currently don’t need 1G port now (I just wait for official kernel upstreaming process to complete, without even GPU opensource drivers, eMMC performance fix, NVMe fixes there is little to see currently. I enjoy hacking it instead). I know SF tried to “livepatch” dtb in memory directly from U-Boot, but it never helped me anyway.

Don’t get me wrong, I don’t want to persuade you to keep potentially faulty hardware nor trying to defend SF’s image there. Just my prior experience. If you try to observe this with e.g. tcpdump and see same results as I described then probably your board is fine but software is messing things up. Again, U-Boot AFAIK shall always work.

AFAIK v1.3b (most current revision) is unaffected by this at all and shall work with two ports fine. They use improved chip, related to one v1.2a got at driver level but better version, not the same 1G found in v1.2a.

strlcat · August 10, 2023, 10:29pm

The current troubles just lie in alot of “hairy” issues still not settled, which are not such visible if you’re quite new to platform, but still do exist, with or without a chance getting fixed:

SDcard 21M/s speed cap issue (will never get fixed officially because of wrong board layout in any revision; OpenWrt hackers might catch this up and hack PCB in the very future )
Same applies to eMMC, but @Stat_headcrabed managed to fix it experimentally (eMMC just got a dedicated power line). See his topics or search “eMMC” there
NVMe spontaneous timeouts, don’t know if this will get attention, got it or will be never fixed
Ethernet port troubles of earlier revisions, talked about here
All ethernet ports gets completely lost when you do hibernation, without any chance to recover after resume
All revisions for me are affected by a strange Ethernet bug causing sudden TCP connections death. So far, my investigation stuck at the fact that somewhere in middle of typical TCP comm stream, VF2’s eth chip sends packet with garbled IPv4 checksum and causes conformant hardware to terminate TCP connection it belongs to. Annoying sometimes; did not try to report it because it might be my chinese residential gateway causing this.
In past, people complained about wrong memory size picked by Linux kernel because of wrong dts passed to it. Seems to be fixed now, at last
And probably others I am unaware of.

Any of this could potentially “trigger” saying VF2 is faulty and deserves a return. Even myself, when I’ve got my 8G S.E.B. back in Dec22, I’ve got free -m telling me it sees only first 4G half of memory, I even was scared first that SF sent me a wrong version and really thought about sending it back (RAM chip’s marking does not even google up any datasheet, only vaguely suggesting it’s a 64Gbit chip). Fortunately, I tested whole RAM presence from U-Boot, filling it with sdcard content then checksumming it back, proving I really have all 8G available, which then led me to googling about why Linux sees it differently. But here is the question: can an average user do same steps and verify their hardware is not at fault?

You might say “just use stock OS provided by vendor” and you’ll be right. Still, there are issues not gotten much attention still resurfacing here and there, which might cause quite a legitimate reason for a refund.

vfridman · August 10, 2023, 10:58pm

It is possible that the issue I’m experiencing is not a hardware fault. I honestly can’t be certain that it is. This is what I’ve done to investigate and what I observed. When the board lost network connection I connected via UART and could see that OS was running just fine. However the interface had no IP address. I triggered the interface down and up and it got the address (from DHCP server). At that point I could ping the board for a minute maybe but after that the interface start switching down-up-down-up continuously. It was still reporting the same IP address but I could no longer ping the board. After couple of minutes of this IP address was gone as well. This fact (just maybe) is due to how my router works. On every interface up the board would request new address and the router may considers this some form of attack.

It is likely that my replacement board will be also v1.2a so we’ll see what happens.

strlcat · August 10, 2023, 11:05pm

So, what you observe is a simple loss of IP address at NIC?
What does dmesg say about ethernets in times you observe the loss?
And, like you describe it for me it just “smells” a trouble I once got with NetworkManager stuff which tried to manage NICs exclusively. But it happened when I manually tried to assign an IP of my liking (for debugging purpose), and NM happily wiped it after a minute or two, thinking there should be no any (but it did not brought NIC down). This happens everywhere with NM installed, even on my PC.

vfridman · August 10, 2023, 11:16pm

I did not run dmesg unfortunately, but because I was connected to a console I was getting continuous stream of messages “interface is down” immediately followed by “interface is up” etc. It eventually stopped, after maybe 10 minutes, and at that point interface was up, it reported a carrier but no IP. At that point my trick with forcing down-up did not work, there was no IP. I did not try to assign static IP, there is some chance it would work. However I think the real problem is interface going into that loop. There is good chance it’s hardware related, but who can really tell.

strlcat · August 10, 2023, 11:28pm

You’ll never see anywhere a program trying to manipulate interface, unless you’ll boot into real single user mode (or even better, off a initramfs minirootfs containing just BusyBox and kernel modules). It also can be outer hardware failure (say, bad cable having broken wire/socket ejects itself/router pissing off your VF2/wrong connection somewhere), it can be internal hardware failure (chip got mad), it can be purely software control failure (some program sitting and watching NICs got mad).

To begin with, I would try to ditch out software first. Boot microrootfs initramfs (so no interference is caused by actors of real OS) and manipulate interface by hand:

# these are examples. Tune values to your liking
ip addr add 192.168.1.10/24 dev eth0
ip link set eth0 up
ping 192.168.1.1

If you’re not sure which NIC Linux picked up as eth0, bring eth0 down and flush it’s address:

ip link set eth0 down
ip addr flush dev eth0 scope global

, and repeat bringup with eth1 instead.

During this, constantly observe status of your NICs with ip link command. Plus see what dmesg currently spits.

If you might need an assistance with building initramfs minirootfs, I can prepare one for you.

Cheers.

vfridman · August 10, 2023, 11:39pm

Thanks so much for your generous offer of help. The practical problem is that the damn board behaves today, no network glitches. I’m pretty confident it will not last, it always loses network overnight. However I’m getting replacement today and tomorrow is the last day I have to ship the problematic board back. So unless it goes bad in the next few hours there not much I’ll be able to do. I did change the network cable and even a switch port but it did not help.

mzs · August 11, 2023, 12:06am

Are the two network ports connected to the same network switch ? (Could the switch be attempting to automatically enable Passive LACP - Link Aggregation Control Protocol). I guess what I am asking is do both NIC’s have the same default gateway, netmask and are in the exact same range of IP addresses, and how advanced is the network switch.

vfridman · August 11, 2023, 12:30am

I only connected one port at a time. 1G port has problems, 100M port is fine.

cwt · August 11, 2023, 12:37am

Hmm, if you truly have a strong feeling that you are the unlucky one, is it possible that something in your environment is causing or accelerating the faulty events in your hardware, such as a power surge or a leaked AC on DC side from the power supply?

vfridman · August 11, 2023, 1:41am

Yes, it possible but I’m not sure what could it be. At the moment I’m working with 3 VisionFive and 2 Raspberry Pi SBCs. All of them use quite good power supplies. All of them have Ice Tower type coolers as well. All the power to my computers and SBCs is conditioned through two CyberPower PR2200 Sinewave UPS systems (total 4KW). I need that much wattage because of number and type of the non-SBC systems I’m running. Actually I need SBCs for that as well, just obviously for different applications. I replaced the first defective VisionFive that I got and replacement works perfectly. I will find out in the next few days if second replacement is going to be good.

Edit: I just got a replacement and as expected it’s version 1.2a. So it will be exact replacement.