Interesting benchmark, but not really fair

Richard Jones compiled qemu and measured the compile time of the source code in seconds (faster compile time is better) for a:

HiFive Unmatched (RISC-V)    - 3642 seconds - 16GB DDR4-2666    SDRAM - NVMe drive
Vision Five 2 (RISC-V)       -  582 seconds -  8GB LPDDR4-3200  SDRAM - NVMe drive
Sipeed Lichee Pi 4A (RISC-V) - 1376 seconds -  8GB LPDDR4X-3733 SDRAM - USB 3.0 SanDisk 500GB SSD
Raspberry Pi 4B (Arm)        - 1154 seconds -  8GB LPDDR4-2400  SDRAM - USB 3.0 SanDisk 500GB SSD

All boards have 4 cores.
All boards have 8GB of RAM, except for the HiFive Unmatched which has 16GB.

Is the above a fair benchmark - hell no (Because in reality you are really benchmarking the performance of the storage, because it would the a major bottleneck in compile times). Is it a real world benchmark - I would say yes.

4 Likes

I’m really surprised at how badly the Hi-Five board did there.:thinking:

There must be something fundamentally wrong with the board or it’s configuration for it to be so much slower despite having 2x the memory.

having more memory don’t necessarily mean faster. It can actually be slower to have more memory in some conditions (though should not be the case with modern systems)

And the test on the sipeed was not done with a better configured distro because “the boot path for this board is insane”. This is such a lame excuse as it came with that distro, and it just work with that distro, and it take litteraly minutes to flash it.

Anyway. I’m not really surprised by the result, the time here are testing the compilation time, but also the disk speed (which is really critical for compiling large projects) and both the starfive and HiFive board use NVME SSD where they use a SD for the rPi and Sipeed board.

The Sipeed would get better results on that specific test using the internal eMMC instead.

Also the HiFive may just have terrible PCIe leading to bad timing with the NVME SSD.

Not really an interesting benchmark I fear in the end.

They used USB 3.0 Super Speed SSD drives for both.

In my mind this might end up adding more latency (due to 30 microseconds of latency added in each direction across the full duplex bus, the maximum packet size for USB 3.0 Super speed is 1024 bytes, so transferring a single 4K data block would add at least 240 microseconds of delay). I suspect that if they had used MicroSD cards in both, or an eMMC on the Lichee Pi 4A that the latency could be lower. But in reality even the size of the file and type of file system used can change overall throughput by a lot for reads and writes.

e.g. (Scroll down to “Results” in both links)
BPI-M1 (SATA 2.0 - spinning rust and SSD) - http://314256.blogspot.com/2014/11/banana-pi-sata-disk-throughput-test.html
RPIB (USB 2.0 - spinning rust) - http://314256.blogspot.com/2014/03/raspberry-pi-usb-disk-throughput-test.html

What about cache in memory?
I think Compiling is not a “low-latency-disk job”.

For maximum performance AKA minimum build times you might have to forcibly pre-load the whole of the source code into cache memory prior to the build with:
vmtouch -f <insert_full_directory_name_here>.
(You will need to install vmtouch before you can use it: sudo apt install vmtouch )

And if you did that then the HiFive Unmatched would do much better.

vmtouch “Description: Portable file system cache diagnostics and control vmtouch is a tool for learning about and controlling the file system cache of unix and unix-like systems. It can discover which files the OS is caching, tell the OS to cache or evict some files or regions of files, lock files into memory so the OS won’t evict them, and more.”

You are right, I mixed up stuff there.

Though that’s even worse SSD via USB3 is even less of a representative use case.

1 Like

How else do you want to connect a SSD to a Pi 4? In my opinion it only shows that the Pi 4 is a bad choice for this kind of workload.

2 Likes

I built it a few times to warm up and then timed the last build, on otherwise unloaded machines.

If he did let the boards compile for hours before the tests, then the source should have been cached as well as it ever would be. More so than normal usage. Perhaps less than vmtouch since a long compile job will push the start out of cache by the end.

In any case, benchmarking the storage subsystem is perfectly valid if that’s what this compile would do, as long as it’s not hidden. I’d say it is a fair comparison to use the best interface each board offers. It wouldn’t be fair to hinder a board intentionally that comes with an NVME slot just because it also offers an SD reader (and a poor one at that). Perhaps it would be fairest to compile all the code from an NFS mount? But then it’s not really a test of the board overall/standalone and might even be a test of the NIC.

3 Likes

This is not provide low-latency only. bandwidth is provided too.

PS: Someyears ago, I compile programs on LoongSon . The EXT4 Filesystem has HUGE POOR performance on it. There are a lot of things that affect proformance. :face_exhaling:

1 Like

Indeed. And that in turn depends on ccache, filemon and other tools being installed and properly configured in each environment.

Without the actual command being make clean && time make ... this becomes a very poor benchmark.

If the goal was just to max the machine out temperature wise you can just run a normal performance benchmark beforehand or play a game of quake, or whatever. But it would be better to let the machine sit idle for a while, and then start all the tests from ‘cold’, a more realistic scenario…

1 Like

For what it is worth, I have a bench with 2 SBC’s but just one kbd & monitor.

One SBC is a VF2 / 8Gb with a patriot 256G nvme.
The other is a Pi4 / 4Gb with a USB3 ssd.

I ‘kvm’ between them by unplugging/plugging the USB/HDMI in as appropriate. Both plug into the same network router and run I3, I keep them well updated.

I really cannot tell the difference between them in use…

They ‘feel’ very similar when used via SSH shell or directly on the GUI, the VF2 is faster with disk operations for sure, and the PI computes faster, but the differences are marginal, and both are nice to use as workshop workhorses.

Both of them run OctoPrint for my 3d printers, which gives a nice demonstration of the differences.

  • Both systems run it with ease, error free and in the background with minimal resource use, even when printing.
  • The VF2 installs Octoprint, via pip, plus its upgrades, dependencies and plugins noticeably faster than the Pi4 does.
  • The Pi4 imports and analyses uploaded files noticeably faster than the VF2.
    The gcode is processed to flag errors, out of bounds movement and estimated print time.
  • Both of them can connect to either of the Octoprint local servers via firefox, and the WebUI UI experience is fully functional, but somewhat laggy, on both.

imho; the ‘owen’ benchmark is that both are indistinguishable in real-world use, somewhat slow but very impressive for a fraction the price of the 8 core i7 machine I’m typing this on.

2 Likes

Until you mentioned temperature, I was reading “warm up” idiomatically for some reason. I was thinking caching and such. There’s no reason to actually heat the boards turning it into a possible cooling solution benchmark, especially if he isn’t even monitoring for throttling and reporting it along with the results.

He doesn’t mention ccache, so I would assume he isn’t using it. And I’d go that route too, as with ccache a “compile” job really turns into a disk benchmark. Which is fine is that’s the intention from a dev’s perspective, but end users will probably rarely recompile the same big package over and over on these boards. I could be wrong there. I mean, I’ve personally compiled rust on my board 4 or 5 times in the last couple weeks before finally getting it to link yesterday only after throwing a 100GB+ swap partition at it. It used about 8GB in addition to the 7.5GB or so ram the system wasn’t otherwise using, and it took 3 1/2 hours just to link. So much larger than qemu.

He did say the actual command was make clean then time make -j4.

Maybe I’m overly sensitive to GUI lag, but I don’t think GUI performance is similar between VF2 and Pi4. My monitor is 3840x1600. VF2 does not support this resolution and set it to 4K. It’s entirely unusable at 4K. Pi4 set resolution correctly and it’s quite usable. If i set VF2 to 1080P it sort of usable but definitively worse that Pi4. I gave up on using GUI on VF2, ssh all the way now.

It is time to try zram with zstd. :star_struck:

3 Likes

Willing to try. The system is running the cwt14 Arch image which has a 4GB zram swap enabled (for my 8GB board) but it’s using the LZ4 algorithm which appears to be the default.

2 Likes

That’s because the GPU driver is not fully implemented. It will come. Don’t forget that the GPU hardware will blow the Pi 4 out of the water.

2 Likes

@LivingLinux Hopefully, this is not just wishful thinking… :slight_smile:

Aubrey

I just received Lichee Pi 4A and decided to confirm (or not) the finding about it’s poor compilation performance relative to VF2. I run linux kernel compile on VF2 couple of weeks ago and it took 39 minutes. I run exact same compile with exact same code on LP4A, first from eMMC and then entirely from tmpfs, i.e RAM. Both runs took 49 minutes. So the difference is not as dramatic as with using USB SSD, but nevertheless VF2 is about 20% faster. It’s hard to figure out why, considering that in theory it should be another way around.

I should probably mention that this boards otherwise feels faster all around, and especially with GUI which is quite usable even on very high resolution monitor. I guess accelerated GPU driver helps.

3 Likes

Couple of possibilities (just from the top of my head):

  • Which LPi4A do you have? As in how much memory as it can have crucial impact on performances
  • Is swap enabled?
  • The eMMC on the LPi4A may have worse performances than the SD card you are using on the VF2
  • The MMC/SD driver on the VF2 may be more optimised, and the LPi4A have a worse one. I remember that Beagle had issue with the MMC driver, at least with their mainlining of the board
  • Any network activity?
  • What is the base idle level between both boards?
  • The code running in the System Controller core may have an impact on performances. I have no idea what is running on it in either boards.

What’s odd is that from my test, the VF2 have a way worse memory bandwidth than my LPi4A.

I need to re-run some pure CPU test to have a better idea on some of the metrics.

Edit: thinking of it, you are building the linux kernel. Are you sure to use the same compiler, same kernel version and especially the same kernel config?

All three parameters can have a huge impact on the build time!

1 Like