Does the JH7110 Support RISC-V Extension D?

Does the JH7110 processor support the RISC-V Extension D: Standard Extension for Double-Precision Floating-Point, Version 2.2

See: RISC-V Instruction Set Manual, Volume I: RISC-V User-Level ISA | Five EmbedDev

yes, U74(JH7110 CPU core) support RV64GC, the letter G means imafd

3 Likes

And Zba (this one matters) and Zbb, see: page 25 of https://starfivetech.com/uploads/u74mc_core_complex_manual_21G1.pdf

B (Zba, Zbb) isn’t part of the RVA20 so you shouldn’t distribute binaries compiled for it (as they wouldn’t work on other RISC-V RV64GC cores), but you might consider using these options for your own software. For GCC, use -march=rv64gc_zba_zbb -mcpu=sifive-u74 -mtune=sifive-u74 in addition to your other favorite optimization flags. For LLVM/Clang, this seems to work: -march=native -mtune=native (I don’t generally use Clang).

2 Likes

Thank you, tommythorn.

Your PDF links goes to a document entitled “SiFive U74-MC Core Complex Manual 21G1.01.00”. On page 27 [PDF sheet 27]

Section “1.4 S7 RISC-V Monitor Core” states:

The U74-MC Core Complex includes a 64-bit S7 RISC-V

Section “1.5 U7 RISC-V Application Cores” states:

The U74-MC Core Complex includes four 64-bit U7 RISC-V cores, which each have a dual issue,in-order execution pipeline, with a peak execution rate of two instructions per clock cycle. Each U7 core supports machine, supervisor, and user privilege modes, as well as standard Multiply (M), Single-Precision Floating Point (F), Double-Precision Floating Point (D), Atomic (A), Compressed (C), and Bit Manipulation (B) RISC-V extensions (RV64GCB).

Since I do not have my VisionFive 2 at this time (still on order), I just wanted to clarify that the PDF referenced suggests a model “U74-MC Core Complex” which suggests five processors and wonder if this PDF is describing a different version than is built into the VisionFive 2. Does the VisionFive 2 contains the U74-MC Core Complex? With all the appendages to various models, it is difficult to know if “U74” means “U74-MC” and whether “U74-MC” is the same as “U74-MC Core Complex”.

yes

2 Likes

Btw which revision of U74-MC and E24 is used in 7110? Documentation of U74-mc on website is 21G1 version and newest seems to be 21G3 version. This seems to affect some compiler options.

1 Like

That was a really interesting question. If you see page 137 of the SiFive U74-MC Core Complex Manual, the mimpid CSR has this information. However looking through the kernel, OpenSBI, and U-boot, surprisingly neither OpenSBI nor the Linux prints this (though they have access to it). U-boot appears to be the only one exposing this via the sbi command, but lo and behold, the U-boot that StarFive ships doesn’t appear to have this enabled.

TL;DR: the easiest way to read mimpid currently appear to be writing a dummy kernel module which prints the result of sbi_get_mimpid(). I didn’t go that far.

UPDATE: it looks like the sources on Debian69 aren’t complete; at least I wasn’t able to build a dummy kernel module:

  make -f /lib/modules/5.15.0-starfive/build/Makefile M=$PWD hello-1.ko

  ERROR: Kernel configuration is invalid.
         include/generated/autoconf.h or include/config/auto.conf are missing.
         Run 'make oldconfig && make prepare' on kernel src to fix it.

make: *** [/lib/modules/5.15.0-starfive/build/Makefile:737: include/config/auto.conf] Error 1
4 Likes

Bare metal programming would be easier…but I don’t have debug probe now

Thanks to Chris, now we know,JH7110 is using 0x0421_0427 “21G1.02.00 / llama.02.00-general”

1 Like

I tried this using gcc 15.2.0 on Debian forky compiling zstd 1.5.7 as a test software, using the time to compress a tar-archive of Linux-kernel 6.12.5 as a benchmark (I did this in the past on other hardware, so I can use it to compare with other systems).

The command used was time zstd -T1 -c -9 linux-6.12.5.tar > linux-6.12.5.tar.zstd.9
I had a fan blowing at the system the whole time to ensure adequate cooling

The stock version of Debian (1.5.7+dfsg-1.1) takes about 385 seconds

A locally compiled version of zstd 1.5.7, downloaded from github, compiled without any extra options, takes 357 seconds, about 10% faster than stock.

A locally compiled version with -march=rv64gc_zba_zbb -mcpu=sifive-u74 -mtune=sifive-u74and no other options takes about 920 seconds, almost 140% slower than stock.

With the faster zstd -1 the difference was even greater, 57 seconds for the stock version versus 179 seconds for the “optimized” version, that is a factor of 3.

I am not an expert on software optimization or CPUs, but something is definitely wrong here. Either gcc is producing very badly “optimized” code, or some RISC-V commands are extremely slow in their implementation on this CPU.

I haven’t reproduced this with other software so far, I took zstd as an example because I have some personal interest in compression software, and I expect zstd to both have a modern code base (unlike, for example gzip) and be optimized for modern CPUs (the rather simple dual issue in-order U74 should still profit from it).

I may play around with different gcc options to find out which options makes it slow, but it always takes some time to compile and run the compression.

Lets put my results into a table (More will follow once my run is finished, I already regret not owning 10 Boards :face_with_tongue: )

Top row is compression level, left column describes the CFLAGS given to gcc, numbers are the run time measured using time for the given combination of zstd compression level and binary compilation option. Binary was striped after compilation, test runs on a Toshiba Corporation XG5 NVMe SSD, Model “KXG50ZNV256G”

zstd -1 zstd -3 zstd -6 zst -9 zstd -19
Debian stock 58 75 246 385 8045
gcc 15.2.0 without options 47 71 212 357 6600
gcc 15.2.0 -march=rv64gc_zba_zbb -mcpu=sifive-u74 -mtune=sifive-u74 179 208 668 920 16077
gcc 15.2.0 -march=rv64gc -mcpu=sifive-u74 -mtune=sifive-u74 185 216 705 990 19706
gcc 15.2.0 -march=rv64gc -mcpu=sifive-u74 184 214 700 992 19701
gcc 15.2.0-march=rv64gc 183 214 704 987 19751
gcc 15.2.0-march=rv64gc_zba_zbb 179 206 667 918 16071
gcc 15.2.0-march=rv64gc_zba_zbb (run in tmpfs) 916
gcc 15.2.0-march=rv64g -misa-spec=20191213 -march=rv64imafd_zicsr_zifencei(run in tmpfs) 992
gcc 15.2.0-march=rv64g -misa-spec=20191213 -march=rv64imafd_zicsr_zifencei(static binary, run in tmpfs) 992
gcc 15.2.0 static binary without optimization run in tmpfs 350
clang 19.1.7 static binary without optimization run in tmpfs 307

You are testing 3 things at once with your benchmark:

  • Single core performance only ( -T1 instead of -T$(nproc) )
  • Memory access speed (read and write)
  • Storage access speed (read and write)

If there was enough RAM I would probably try to eliminate benchmarking the data storage. By using something as simple as a basic ram disk.

I was always told to benchmark each component in isolation, as much as is possible, to have a better understanding of where the true bottlenecks are located in any system. I’m not saying that overall performance benchmarks are not important for real world applications. But comparing a system with an older MicroSD card to one with a brand new M.2 NVMe SSD using a storage access time based benchmark will be dominated by the choice and age of the storage (All SSD’s slowdown with the number of block erase/program cycles. Older used drives are slower when nearly full than empty). Components benchmarked in isolation does show the very best that is possible, that will typically never be achievable under real world usage. But does give an upper limit to which well written applications could peak at under ideal conditions.

Another method, that pre-caches a file or directory into cache memory (as long as there is enough free memory for all the files to fit), for accelerated reads is by using the “vmtouch” command.

This is intended, I am interested in single core performance, this also allows me to see how a single core performs compared to other CPUs (to be precise, I limit the compression task to a single thread, therefore a system with only one core is still at a slight disadvantage).

CPU cache also comes into play here, unfortunately I wouldn’t know how to isolate this.

I am aware of disk speed, especially when comparing with other systems, but I think this doesn’t hurt too much in this scenario, especially for the higher compression ratios. Reading 1500MB from an SSD and writing 240-150MB (depending on compression) is almost nothing if the whole process takes over one hour.
Still, I may give your Idea with the ram disk a try.

My main goal at the moment is to figure out, why an “optimized” binary is slower.

This sounds more elegant than my primitive cat linux-6.12.5.tar > /dev/null at the beginning of my test script.

My second step, once I eliminated storage throughput, would be to use “ldd” on each of the executables. My guess would be that even if you are compiling one executable from source code any dynamically linked libraries would not be compiled with the exact same options and this mismatch may be partially responsible for any drop in performance.

EDIT: One gotcha about ldd is that it will only display dynamically loaded shared libraries that were explicitly added at compile time, but any program can load a shared library at runtime by using the dlopen() call. SDR (software defined radio) applications are a good example of this ldd shows no libraries for all the supported hardware, USB attached devices are scanned for on the machine and then wrapper libraries are dynamically loaded at runtime. The reason the programmers do it this way is to prevent lots of warning messages about hardware libraries being unable to initialize hardware that is not physically attached to the machine.

I’ve seen one system where the order of directories in the LD_LIBRARY_PATH variable negatively impacted performance - a remote NFS mount was listed before local library paths, so any drop in network latency was seriously dropping performance. The simple fix was to list all local paths before remote paths. In the order of most commonly called libraries before least called library paths. But that was on a Solaris system and not Linux.

1 Like

Running zstd it in a tmpfs doesn’t speed up things, at least not with compression level 9.
Right not, I am running out of ideas, I will take a deeper look into ldd, but as far as I know, this may be a rabbit hole, with libraries loading other libraries. strace may be my next attempt so see what actually gets loaded when running the binary.
In theory, a static binary could rule out some possible reasons, I will need to research how to create this (may need to recompile every dependency, which sounds like a lot of work, unless there is a tool doing this automatically).

Right now, it looks as if any “optimized” compilation of zstd leads to longer runtime, this is true for both gcc and clang (see my earlier post with the table with results). If I recompile the binary using gcc or clang without any extra options (as dynamic or static binary), it runs faster. There is a Debian bug reports, which reports slow performance of the Debian build. Slow Debian build is something I was also able to verify on my amd64 workstation with a self compiled binary, but unlike RISC-V, clang did not create a faster binary than gcc on amd64 (this indicates that both compilers perform similar on amd64). But, like on RISC-V, optimized builds were slower, therefore I suspect something is fundamentally wrong with the way I compile zstd.

On my JH7110, clang creates faster binaries of zstd than gcc (contrary to older research papers, which saw gcc at an advantage over a wide set of software).
zstd has a built-in benchmark (zstd -b), I used this to verify my results:

root@visionfive2:/tmp/benchmark# zstd -b
 3#Lorem ipsum       :  10000000 ->   2981954 (x3.354),   12.0 MB/s,   30.1 MB/s
root@visionfive2:/tmp/benchmark# ./zstd-local-optimized-static_nooptions -b
 3#Lorem ipsum       :  10000000 ->   2981954 (x3.354),   13.0 MB/s,   31.0 MB/s
root@visionfive2:/tmp/benchmark# ./zstd-local-optimized-static-clang-19.1.7_nooptions -b
 3#Lorem ipsum       :  10000000 ->   2981954 (x3.354),   14.0 MB/s,   42.8 MB/s
root@visionfive2:/tmp/benchmark# ./zstd-local-optimized_march-rv64gc_zba_zbb -b
 3#Lorem ipsum       :  10000000 ->   2981954 (x3.354),   4.35 MB/s,   10.0 MB/s
root@visionfive2:/tmp/benchmark# ./zstd-local-optimized-static_march-rv64g_-misa-spec_20191213_-march_rv64imafd_zicsr_zifence -b
 3#Lorem ipsum       :  10000000 ->   2981954 (x3.354),   4.15 MB/s,    9.8 MB/s

Due to the performance regression on zstd, I decided to turn to bzip3 as a second benchmark to test the effect of optimized builds and verify my results.
Results were similar, with minor (2%) speed improvements for a locally compiled version versus Debian Stock, without any optimization for the CPU. Results were the same for gcc and clang. With optimization, I saw massive performance degradation.
Also interesting is the fact, that optimized binaries were larger. I would expect a more optimized binary to be smaller due to the usage of more specialized CPU functions.

1 Like