Does the JH7110 processor support the RISC-V Extension D: Standard Extension for Double-Precision Floating-Point, Version 2.2
See: RISC-V Instruction Set Manual, Volume I: RISC-V User-Level ISA | Five EmbedDev
Does the JH7110 processor support the RISC-V Extension D: Standard Extension for Double-Precision Floating-Point, Version 2.2
See: RISC-V Instruction Set Manual, Volume I: RISC-V User-Level ISA | Five EmbedDev
yes, U74(JH7110 CPU core) support RV64GC, the letter G means imafd
And Zba (this one matters) and Zbb, see: page 25 of https://starfivetech.com/uploads/u74mc_core_complex_manual_21G1.pdf
B (Zba, Zbb) isnât part of the RVA20 so you shouldnât distribute binaries compiled for it (as they wouldnât work on other RISC-V RV64GC cores), but you might consider using these options for your own software. For GCC, use -march=rv64gc_zba_zbb -mcpu=sifive-u74 -mtune=sifive-u74 in addition to your other favorite optimization flags. For LLVM/Clang, this seems to work: -march=native -mtune=native (I donât generally use Clang).
Thank you, tommythorn.
Your PDF links goes to a document entitled âSiFive U74-MC Core Complex Manual 21G1.01.00â. On page 27 [PDF sheet 27]
Section â1.4 S7 RISC-V Monitor Coreâ states:
The U74-MC Core Complex includes a 64-bit S7 RISC-V
Section â1.5 U7 RISC-V Application Coresâ states:
The U74-MC Core Complex includes four 64-bit U7 RISC-V cores, which each have a dual issue,in-order execution pipeline, with a peak execution rate of two instructions per clock cycle. Each U7 core supports machine, supervisor, and user privilege modes, as well as standard Multiply (M), Single-Precision Floating Point (F), Double-Precision Floating Point (D), Atomic (A), Compressed (C), and Bit Manipulation (B) RISC-V extensions (RV64GCB).
Since I do not have my VisionFive 2 at this time (still on order), I just wanted to clarify that the PDF referenced suggests a model âU74-MC Core Complexâ which suggests five processors and wonder if this PDF is describing a different version than is built into the VisionFive 2. Does the VisionFive 2 contains the U74-MC Core Complex? With all the appendages to various models, it is difficult to know if âU74â means âU74-MCâ and whether âU74-MCâ is the same as âU74-MC Core Complexâ.
yes
Btw which revision of U74-MC and E24 is used in 7110? Documentation of U74-mc on website is 21G1 version and newest seems to be 21G3 version. This seems to affect some compiler options.
That was a really interesting question. If you see page 137 of the SiFive U74-MC Core Complex Manual, the mimpid CSR has this information. However looking through the kernel, OpenSBI, and U-boot, surprisingly neither OpenSBI nor the Linux prints this (though they have access to it). U-boot appears to be the only one exposing this via the sbi command, but lo and behold, the U-boot that StarFive ships doesnât appear to have this enabled.
TL;DR: the easiest way to read mimpid currently appear to be writing a dummy kernel module which prints the result of sbi_get_mimpid(). I didnât go that far.
UPDATE: it looks like the sources on Debian69 arenât complete; at least I wasnât able to build a dummy kernel module:
make -f /lib/modules/5.15.0-starfive/build/Makefile M=$PWD hello-1.ko
ERROR: Kernel configuration is invalid.
include/generated/autoconf.h or include/config/auto.conf are missing.
Run 'make oldconfig && make prepare' on kernel src to fix it.
make: *** [/lib/modules/5.15.0-starfive/build/Makefile:737: include/config/auto.conf] Error 1
Bare metal programming would be easierâŚbut I donât have debug probe now
Thanks to Chris, now we know,JH7110 is using 0x0421_0427 â21G1.02.00 / llama.02.00-generalâ
I tried this using gcc 15.2.0 on Debian forky compiling zstd 1.5.7 as a test software, using the time to compress a tar-archive of Linux-kernel 6.12.5 as a benchmark (I did this in the past on other hardware, so I can use it to compare with other systems).
The command used was time zstd -T1 -c -9 linux-6.12.5.tar > linux-6.12.5.tar.zstd.9
I had a fan blowing at the system the whole time to ensure adequate cooling
The stock version of Debian (1.5.7+dfsg-1.1) takes about 385 seconds
A locally compiled version of zstd 1.5.7, downloaded from github, compiled without any extra options, takes 357 seconds, about 10% faster than stock.
A locally compiled version with -march=rv64gc_zba_zbb -mcpu=sifive-u74 -mtune=sifive-u74and no other options takes about 920 seconds, almost 140% slower than stock.
With the faster zstd -1 the difference was even greater, 57 seconds for the stock version versus 179 seconds for the âoptimizedâ version, that is a factor of 3.
I am not an expert on software optimization or CPUs, but something is definitely wrong here. Either gcc is producing very badly âoptimizedâ code, or some RISC-V commands are extremely slow in their implementation on this CPU.
I havenât reproduced this with other software so far, I took zstd as an example because I have some personal interest in compression software, and I expect zstd to both have a modern code base (unlike, for example gzip) and be optimized for modern CPUs (the rather simple dual issue in-order U74 should still profit from it).
I may play around with different gcc options to find out which options makes it slow, but it always takes some time to compile and run the compression.
Lets put my results into a table (More will follow once my run is finished, I already regret not owning 10 Boards
)
Top row is compression level, left column describes the CFLAGS given to gcc, numbers are the run time measured using time for the given combination of zstd compression level and binary compilation option. Binary was striped after compilation, test runs on a Toshiba Corporation XG5 NVMe SSD, Model âKXG50ZNV256Gâ
| zstd -1 | zstd -3 | zstd -6 | zst -9 | zstd -19 | |
|---|---|---|---|---|---|
| Debian stock | 58 | 75 | 246 | 385 | 8045 |
| gcc 15.2.0 without options | 47 | 71 | 212 | 357 | 6600 |
| gcc 15.2.0 -march=rv64gc_zba_zbb -mcpu=sifive-u74 -mtune=sifive-u74 | 179 | 208 | 668 | 920 | 16077 |
| gcc 15.2.0 -march=rv64gc -mcpu=sifive-u74 -mtune=sifive-u74 | 185 | 216 | 705 | 990 | 19706 |
| gcc 15.2.0 -march=rv64gc -mcpu=sifive-u74 | 184 | 214 | 700 | 992 | 19701 |
| gcc 15.2.0-march=rv64gc | 183 | 214 | 704 | 987 | 19751 |
| gcc 15.2.0-march=rv64gc_zba_zbb | 179 | 206 | 667 | 918 | 16071 |
| gcc 15.2.0-march=rv64gc_zba_zbb (run in tmpfs) | 916 | ||||
| gcc 15.2.0-march=rv64g -misa-spec=20191213 -march=rv64imafd_zicsr_zifencei(run in tmpfs) | 992 | ||||
| gcc 15.2.0-march=rv64g -misa-spec=20191213 -march=rv64imafd_zicsr_zifencei(static binary, run in tmpfs) | 992 | ||||
| gcc 15.2.0 static binary without optimization run in tmpfs | 350 | ||||
| clang 19.1.7 static binary without optimization run in tmpfs | 307 |
You are testing 3 things at once with your benchmark:
If there was enough RAM I would probably try to eliminate benchmarking the data storage. By using something as simple as a basic ram disk.
I was always told to benchmark each component in isolation, as much as is possible, to have a better understanding of where the true bottlenecks are located in any system. Iâm not saying that overall performance benchmarks are not important for real world applications. But comparing a system with an older MicroSD card to one with a brand new M.2 NVMe SSD using a storage access time based benchmark will be dominated by the choice and age of the storage (All SSDâs slowdown with the number of block erase/program cycles. Older used drives are slower when nearly full than empty). Components benchmarked in isolation does show the very best that is possible, that will typically never be achievable under real world usage. But does give an upper limit to which well written applications could peak at under ideal conditions.
Another method, that pre-caches a file or directory into cache memory (as long as there is enough free memory for all the files to fit), for accelerated reads is by using the âvmtouchâ command.
This is intended, I am interested in single core performance, this also allows me to see how a single core performs compared to other CPUs (to be precise, I limit the compression task to a single thread, therefore a system with only one core is still at a slight disadvantage).
CPU cache also comes into play here, unfortunately I wouldnât know how to isolate this.
I am aware of disk speed, especially when comparing with other systems, but I think this doesnât hurt too much in this scenario, especially for the higher compression ratios. Reading 1500MB from an SSD and writing 240-150MB (depending on compression) is almost nothing if the whole process takes over one hour.
Still, I may give your Idea with the ram disk a try.
My main goal at the moment is to figure out, why an âoptimizedâ binary is slower.
This sounds more elegant than my primitive cat linux-6.12.5.tar > /dev/null at the beginning of my test script.
My second step, once I eliminated storage throughput, would be to use âlddâ on each of the executables. My guess would be that even if you are compiling one executable from source code any dynamically linked libraries would not be compiled with the exact same options and this mismatch may be partially responsible for any drop in performance.
EDIT: One gotcha about ldd is that it will only display dynamically loaded shared libraries that were explicitly added at compile time, but any program can load a shared library at runtime by using the dlopen() call. SDR (software defined radio) applications are a good example of this ldd shows no libraries for all the supported hardware, USB attached devices are scanned for on the machine and then wrapper libraries are dynamically loaded at runtime. The reason the programmers do it this way is to prevent lots of warning messages about hardware libraries being unable to initialize hardware that is not physically attached to the machine.
Iâve seen one system where the order of directories in the LD_LIBRARY_PATH variable negatively impacted performance - a remote NFS mount was listed before local library paths, so any drop in network latency was seriously dropping performance. The simple fix was to list all local paths before remote paths. In the order of most commonly called libraries before least called library paths. But that was on a Solaris system and not Linux.
Running zstd it in a tmpfs doesnât speed up things, at least not with compression level 9.
Right not, I am running out of ideas, I will take a deeper look into ldd, but as far as I know, this may be a rabbit hole, with libraries loading other libraries. strace may be my next attempt so see what actually gets loaded when running the binary.
In theory, a static binary could rule out some possible reasons, I will need to research how to create this (may need to recompile every dependency, which sounds like a lot of work, unless there is a tool doing this automatically).
Right now, it looks as if any âoptimizedâ compilation of zstd leads to longer runtime, this is true for both gcc and clang (see my earlier post with the table with results). If I recompile the binary using gcc or clang without any extra options (as dynamic or static binary), it runs faster. There is a Debian bug reports, which reports slow performance of the Debian build. Slow Debian build is something I was also able to verify on my amd64 workstation with a self compiled binary, but unlike RISC-V, clang did not create a faster binary than gcc on amd64 (this indicates that both compilers perform similar on amd64). But, like on RISC-V, optimized builds were slower, therefore I suspect something is fundamentally wrong with the way I compile zstd.
On my JH7110, clang creates faster binaries of zstd than gcc (contrary to older research papers, which saw gcc at an advantage over a wide set of software).
zstd has a built-in benchmark (zstd -b), I used this to verify my results:
root@visionfive2:/tmp/benchmark# zstd -b
3#Lorem ipsum : 10000000 -> 2981954 (x3.354), 12.0 MB/s, 30.1 MB/s
root@visionfive2:/tmp/benchmark# ./zstd-local-optimized-static_nooptions -b
3#Lorem ipsum : 10000000 -> 2981954 (x3.354), 13.0 MB/s, 31.0 MB/s
root@visionfive2:/tmp/benchmark# ./zstd-local-optimized-static-clang-19.1.7_nooptions -b
3#Lorem ipsum : 10000000 -> 2981954 (x3.354), 14.0 MB/s, 42.8 MB/s
root@visionfive2:/tmp/benchmark# ./zstd-local-optimized_march-rv64gc_zba_zbb -b
3#Lorem ipsum : 10000000 -> 2981954 (x3.354), 4.35 MB/s, 10.0 MB/s
root@visionfive2:/tmp/benchmark# ./zstd-local-optimized-static_march-rv64g_-misa-spec_20191213_-march_rv64imafd_zicsr_zifence -b
3#Lorem ipsum : 10000000 -> 2981954 (x3.354), 4.15 MB/s, 9.8 MB/s
Due to the performance regression on zstd, I decided to turn to bzip3 as a second benchmark to test the effect of optimized builds and verify my results.
Results were similar, with minor (2%) speed improvements for a locally compiled version versus Debian Stock, without any optimization for the CPU. Results were the same for gcc and clang. With optimization, I saw massive performance degradation.
Also interesting is the fact, that optimized binaries were larger. I would expect a more optimized binary to be smaller due to the usage of more specialized CPU functions.