Vulkan Driver Broken? VK_KHR_SURFACE unsupported

I’ve also been compiling Godot, and digging in to the source code to find where exactly the startup fails. It seems that at least for me the failure happens when creating the Vulkan context and specifically when running vkEnumerateInstanceExtensionProperties function to get the extension property count (i.e. the 1st and last arguments to the function are NULL). It says then that a call to free() fails due to a null pointer, and Godot startup stops.

1 Like

I’ve been evaluating this by building the latest Vulkan-SDK / Mesa3D / Vulkan-Demos in a Risc-V64 QEMU VM (with Wayland/GNOME desktop) to see if it has the same issues with the Lavapipe Vulkan driver… and it does…

Host: Ryzen 7000 CPU / RX6000 GPU / Ubuntu 2023.04
QEMU - Risc-V-64-bit virtual machine, 8 core CPU, 8 GBs RAM, with Virtio-gpu
Guest : Official Ubuntu 23.04 RiscV-64with Gnome Desktop (accelerated OpenGL via the Virgl driver)

After building (in the VM) the latest Vulkan-SDK, Mesa (from git) and Sascha Willems Vulkan demos (from git), and testing the Lavapipe Vulkan driver:

Vulkan info works…

vkcube crashes (core dumps after opening an blank window)

Vulakn-demos:

  • triangle - crashes (no window opened, coredump)
  • vulkanscene - crashes (no window opened, coredump)

Files here: TBOT-Virtualization-Adventures/Risc-V-Qemu-x86_64/Ubuntu-23.04 at main · TheBrinkOfTomorrow/TBOT-Virtualization-Adventures · GitHub

All of the above works fine in both an aarch64 guest and a ppc64le guest running in Qemu VMs when set up exactly the same way…

I’ve just flashed the armbian-23.5-lunar image on my VisionFive2, and it has a framebuffer / HDMI driver, so I’ll see if I can get a working Desktop and then I’ll compile everything and run the same build and tests…

(I also just received my Sipeed Lichee Pi 4A, and the pre-installed image has a non-accelerated Xorg desktop on Debian Sid, so I’ll try that too once I flash a new image to a microSDCard (the eMMC is only 8 GB…).)

Considering it crash in the exact same way with both the IMG driver and LLVMpipe (the mesa software renderer) this is not a GPU driver issue.

It happen when doing a dlopen() on a library and the library is being inited, so it is not directly the driver, it could be related to pthread, but something feel really fishy, and I cannot get my finger on what is happening.

Here is the backtrace:

#0  __pthread_kill_implementation (threadid=<optimized out>, signo=<optimized out>, no_tid=no_tid@entry=0) at ./nptl/pthread_kill.c:43
#1  0x0000003ff7e97c32 in __pthread_kill_internal (signo=<optimized out>, threadid=<optimized out>) at ./nptl/pthread_kill.c:78
#2  0x0000003ff7e623fe in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#3  0x0000003ff7e52978 in __GI_abort () at ./stdlib/abort.c:79
#4  0x0000003ff7e8f734 in __libc_message (action=action@entry=do_abort, fmt=fmt@entry=0x3ff7f41150 "%s\n") at ../sysdeps/posix/libc_fatal.c:155
#5  0x0000003ff7e9f3ae in malloc_printerr (str=str@entry=0x3ff7f3c408 "free(): invalid pointer") at ./malloc/malloc.c:5660
#6  0x0000003ff7ea0aa2 in _int_free (av=<optimized out>, p=<optimized out>, have_lock=have_lock@entry=0) at ./malloc/malloc.c:4435
#7  0x0000003ff7ea2946 in __GI___libc_free (mem=<optimized out>) at ./malloc/malloc.c:3385
#8  0x0000002aad7f3d76 in ?? ()
#9  0x0000002aad7f3f40 in ?? ()
#10 0x0000002aad85ad4c in ?? ()
#11 0x0000002aad827dee in std::basic_ios<char, std::char_traits<char> >::init(std::basic_streambuf<char, std::char_traits<char> >*) ()
#12 0x0000003fea5ff9e8 in std::basic_istream<char, std::char_traits<char> >::basic_istream (__sb=<optimized out>, this=<optimized out>, __in_chrg=<optimized out>, __vtt_parm=<optimized out>)
    at /build/gcc-13-fkYlAi/gcc-13-13.1.0/build/riscv64-linux-gnu/libstdc++-v3/include/istream:97
#13 std::ios_base::Init::Init (this=0x3fea755960 <std::__ioinit>) at ../../../../../src/libstdc++-v3/src/c++98/ios_init.cc:92
#14 std::ios_base::Init::Init (this=this@entry=0x3fea755960 <std::__ioinit>) at ../../../../../src/libstdc++-v3/src/c++98/ios_init.cc:78
#15 0x0000003fea5f0078 in __static_initialization_and_destruction_0 () at ../../../../../src/libstdc++-v3/src/c++98/ios_base_init.h:12
#16 _GLOBAL__sub_I.00090_globals_io.cc(void) () at ../../../../../src/libstdc++-v3/src/c++98/globals_io.cc:109
#17 0x0000003ff7fe49ce in call_init (env=0x3ffffff328, argv=0x3ffffff318, argc=1, l=<optimized out>) at ./elf/dl-init.c:70
#18 call_init (l=<optimized out>, argc=<optimized out>, argv=0x3ffffff318, env=0x3ffffff328) at ./elf/dl-init.c:26
#19 0x0000003ff7fe4a8c in _dl_init (main_map=0x2aaf53e230, argc=<optimized out>, argv=0x3ffffff318, env=0x3ffffff328) at ./elf/dl-init.c:117
#20 0x0000003ff7f19260 in __GI__dl_catch_exception (exception=0x0, operate=0x3ff7fe9298 <call_dl_init>, args=0x3fffffb220) at ./elf/dl-error-skeleton.c:182
#21 0x0000003ff7fe945e in dl_open_worker (a=a@entry=0x3fffffb438) at ./elf/dl-open.c:808
#22 0x0000003ff7f19220 in __GI__dl_catch_exception (exception=0x3fffffb420, operate=0x3ff7fe93d4 <dl_open_worker>, args=0x3fffffb438) at ./elf/dl-error-skeleton.c:208
#23 0x0000003ff7fe96ea in _dl_open (file=0x3fffffb888 "/mnt/sd/mesa/isntalldir/lib/riscv64-linux-gnu/libvulkan_lvp.so", mode=<optimized out>, 
    caller_dlopen=0x3ff493eea4 <loader_platform_open_library+22>, nsid=-2, argc=<optimized out>, argv=0x3ffffff318, env=0x3ffffff328) at ./elf/dl-open.c:884
#24 0x0000003ff7e93a08 in dlopen_doit (a=a@entry=0x3fffffb758) at ./dlfcn/dlopen.c:56
#25 0x0000003ff7f19220 in __GI__dl_catch_exception (exception=exception@entry=0x3fffffb690, operate=0x3ff7e939b8 <dlopen_doit>, args=0x3fffffb758) at ./elf/dl-error-skeleton.c:208
#26 0x0000003ff7f192aa in __GI__dl_catch_error (objname=0x3fffffb6f8, errstring=0x3fffffb700, mallocedp=0x3fffffb6f7, operate=<optimized out>, args=<optimized out>) at ./elf/dl-error-skeleton.c:227
#27 0x0000003ff7e9366e in _dlerror_run (operate=operate@entry=0x3ff7e939b8 <dlopen_doit>, args=args@entry=0x3fffffb758) at ./dlfcn/dlerror.c:138
#28 0x0000003ff7e93a94 in dlopen_implementation (dl_caller=0x3ff7e93a94 <___dlopen+84>, mode=<optimized out>, file=<optimized out>) at ./dlfcn/dlopen.c:71
#29 ___dlopen (file=<optimized out>, mode=<optimized out>) at ./dlfcn/dlopen.c:81
#30 0x0000003ff493eea4 in loader_platform_open_library (libPath=0x3fffffb888 "/mnt/sd/mesa/isntalldir/lib/riscv64-linux-gnu/libvulkan_lvp.so")
    at /mnt/sd/Vulkan-Loader/loader/vk_loader_platform.h:355
#31 0x0000003ff49593fe in loader_scanned_icd_add (inst=0x0, icd_tramp_list=0x3ff49c12c8 <scanned_icds>, 
    filename=0x3fffffb888 "/mnt/sd/mesa/isntalldir/lib/riscv64-linux-gnu/libvulkan_lvp.so", api_version=4198646, lib_status=0x3fffffbc90)
    at /mnt/sd/Vulkan-Loader/loader/loader.c:1618
#32 0x0000003ff495def0 in loader_icd_scan (inst=0x0, icd_tramp_list=0x3ff49c12c8 <scanned_icds>, pCreateInfo=0x0, skipped_portability_drivers=0x0) at /mnt/sd/Vulkan-Loader/loader/loader.c:3854
#33 0x0000003ff4959992 in loader_preload_icds () at /mnt/sd/Vulkan-Loader/loader/loader.c:1841
#34 0x0000003ff4964a9a in terminator_EnumerateInstanceExtensionProperties (chain=0x0, pLayerName=0x0, pPropertyCount=0x3fffffe0cc, pProperties=0x0) at /mnt/sd/Vulkan-Loader/loader/loader.c:6951
#35 0x0000003ff4967b94 in vkEnumerateInstanceExtensionProperties (pLayerName=0x0, pPropertyCount=0x3fffffe0cc, pProperties=0x0) at /mnt/sd/Vulkan-Loader/loader/trampoline.c:214
#36 0x0000002aab774ae4 in ?? ()
#37 0x0000002aab778ca2 in ?? ()
#38 0x0000002aaae3ee3c in ?? ()
#39 0x0000002aaae3fe9c in ?? ()
#40 0x0000002aaae7b368 in ?? ()
#41 0x0000002aaae83cb8 in ?? ()
#42 0x0000002aaae13728 in main ()

#42 to #36 are in godot, which is a debug template, but seems it is not with debug symbols :smiley:

The vulkan loader was build by me, it is using a build of mesa I made with llvmpipe enabled, all on the target to avoid potential compiler issues.

if I use the system libvulkan and IMG driver the stacktrace differ a bit, but not that much, it is also when loading the ICD library, and crash in the end in libpthread.

Oh. oh …

I wonder. If I follow the crumb track, libwayland-client is coming with libpthread as dependency, and that libwaylend-client was build against the glibc-2.27.
If I look at the glibc I have locally, it is 2.34 or more recent.

I’ve just spotted that: Why glibc 2.34 removed libpthread | Red Hat Developer

I wonder if this could be an issue.

Going to have to make a local build of libwayland-client just to make sure.

1 Like

nope. rebuild mesa without wayland support and same issue.

Edit3: Starting to lose my mind. The Vulkan Samples ( GitHub - KhronosGroup/Vulkan-Samples: One stop solution for all Vulkan samples ) do build and run way further, and stop later without crashing because it seems the software renderer build but do not work on RV64, which is a completely different kind of problem.
What is that insane problem there? Why VulkanSample build and run, why Godot trigger a segfault while loading library, and no it is also loading the library dynamically for Vulkan Samples, they are also using dlopen.

I’m losing my marbles.

Yeah, I just tried the same (on the VM) and still see the core dumps with vkcube and Vulkan triangle demo.
(I built Wayland, then rebuilt Mesa (and then the Vulkan-demos, just to make sure…)

I didn’t build everything with debug on, but here’s my backtrace:

(gdb) bt
#0  __pthread_kill_implementation (threadid=<optimized out>, signo=<optimized out>, 
    no_tid=no_tid@entry=0) at ./nptl/pthread_kill.c:43
#1  0x00007fff991eec8e in __pthread_kill_internal (signo=<optimized out>, 
    threadid=<optimized out>) at ./nptl/pthread_kill.c:78
#2  0x00007fff991b9b8a in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#3  0x00007fff991aa048 in __GI_abort () at ./stdlib/abort.c:79
#4  0x00007fff98df0194 in __deregister_frame_info_bases (begin=<optimized out>)
    at ../../../src/libgcc/unwind-dw2-fde.c:281
#5  __deregister_frame_info_bases (begin=<optimized out>)
    at ../../../src/libgcc/unwind-dw2-fde.c:219
#6  0x00007fff98df065c in __deregister_frame (begin=<optimized out>)
    at ../../../src/libgcc/unwind-dw2-fde.c:296
#7  0x00007fff947a5926 in llvm::RTDyldMemoryManager::deregisterEHFrames() ()
   from /lib/riscv64-linux-gnu/libLLVM-15.so.1
#8  0x00007fff94705f5c in llvm::MCJIT::~MCJIT() ()
   from /lib/riscv64-linux-gnu/libLLVM-15.so.1
#9  0x00007fff947063c2 in llvm::MCJIT::~MCJIT() ()
   from /lib/riscv64-linux-gnu/libLLVM-15.so.1
#10 0x00007fff984078e4 in ?? ()
   from /opt/install-tbot-riscv64/lib/riscv64-linux-gnu/libvulkan_lvp.so

1 Like

I’m definitely losing my marbles there.

I made a really simple C (compilable in C++) test app that dlopen libvulkan, and dlsym 2 symbols, call them and print their return, basically what Godot is doing (without all the fluff Godot put around)

And it just work.

What godot do around should not matter.

Here is the sample code:

#include <stdio.h>
#include <dlfcn.h>

#include <vulkan/vulkan.h>

typedef VkResult (*vkEnumInstance)(const char *name, uint32_t *propCount, VkExtensionProperties *prop);
typedef VkResult (*vkEnumIVersion)(uint32_t *);

int main(int argc, char *argv[])
{
	printf("Let's try...\n");

	uint32_t count;
	uint32_t version;
	void *handle = dlopen("libvulkan.so.1", RTLD_NOW | RTLD_LOCAL);

	vkEnumInstance enumMe = (vkEnumInstance)dlsym(handle, "vkEnumerateInstanceExtensionProperties");
	vkEnumIVersion enumVMe = (vkEnumIVersion)dlsym(handle, "vkEnumerateInstanceVersion");

	enumVMe(&version);

	printf("We got version %d\n", version);

	enumMe(NULL, &count, NULL);

	printf("We have %d extensions!\n", count);

	print("And if you read this, no crash happened...\n");
	return 0;
}

I’m not using the “official” Vulkan loader function to get the function pointer, which is the next thing I will try as I load them by hand, but again, that should not really matter.

If anyone have any ideas, they are welcome because my brain really can’t cope right now with that complete nonsense.

1 Like

Have you tried to create a issue on mesa official repo?

1 Like

I doubt it is a mesa issue, I think it is a compilation problem, but without more clear understanding of what cause the issue :confused:

I looked up Godot and it appears like they only added Vulkan support, 2023-03. And the very latest Imagination’s GPU updates from StarFive is at most 2003-03-20 (img-gpu-powervr-bin-1.17.6210866.tar.gz). Until Imagination Technologies start to publicly release source code for the IMG BXE-4-32 MC1 GPU, the only source of GPU binaries will be StarFive, maybe in the near future Pine64 for their Star64 SBC and possibly Banana Pi once they release their board based on the JH7110 might also provide GPU updates. But it is totally possible right now, that there is a problem that you can not workaround. Me personally I would shelf the work until the next official StarFive release which should include “Vulkan support (scroll down to ‘What’s Next - WIP’)”. If past history is anything to go by it should be landing “soon”, like towards the end of this month or the start of next.

I built and ran the Vullkan-Samples from Khronos … I get same behaviour as I do with Sascha Willem’s demos… GUI window opens with all black background, then crashes (core dump).

I built/ran your sample code on x86_64 host and in the risc-v vm - both using the default/installed libraries and paths, and secondly with it built/linked against all the things I’d built (Vulkan-SDK, Mesa, Wayland).
All 4 scenarios worked correctly (with lvp) - no errors.
( BTW, you’re missing an “f” in the last print statement in you sample code :wink: )

1 Like

Yes, I understand the GPU situation. Hopefully Imagination releases update source code soon for the kernel and Mesa drivers (they’ve already missed the cut-off for kernel 6.4…).

We are trying to get the Vulkan software renderer - LavaPipe - working on Risc-V, and that is what’s crashing right now… (the same build set-up and libraries that fails on Risc-V is working for me on x86_64, aarch64 [QEMU] and ppc64le [QEMU]) - all with Ubuntu 23.04).

1 Like

@mzs This have nothing to do with it. I have Godot 4 in Vulcan running on even older Imagination GPU.
And the problem is not a performance problem, the problem is: it just don’t run, and this is unrelated to the proprietary driver as you will get the exact same problem with the open source one: the problem occur with Mesa with software render which is not using Imagination’s code.

Again, the problem is not specific to the GPU driver, this is most likely a compiler issue, but I cannot get an idea on how to trigger it be able to properly report it, saying “Godot make it crash” is clearly not a good way of doing.

And as far as I know, the Open source GPU driver should be delivered at some point because the architecture we have on the VF2 is Rogue and Imagination said that basically all or most Rogue GPU would be supported. (check the emails in mesa mailing list, they are just focusing for now on a small set of GPU, and the project started before the announcement of the VisionFive 2.

1 Like

I don’t know if this is helpful - I’m still learning to debug - but I rebuilt the Vulkan loader and vkcube (from the SDK) with debug on, and Mesa with debug on (and using clang/clang++ rather than gcc…).

Could it be an issue with LLVM(15) on Risc-V (that Mesa uses)?

Click here for gdb bt for vkcube (again, using lvp on the VM):
[Thread debugging using libthread_db enabled]                                                                                          
Using host libthread_db library "/lib/riscv64-linux-gnu/libthread_db.so.1".
Core was generated by `vkcube'.
Program terminated with signal SIGABRT, Aborted.
#0  __pthread_kill_implementation (threadid=<optimized out>, signo=<optimized out>, no_tid=no_tid@entry=0) at ./nptl/pthread_kill.c:43
Download failed: Invalid argument.  Continuing without source file ./nptl/./nptl/pthread_kill.c.                                       
43	./nptl/pthread_kill.c: No such file or directory.
[Current thread is 1 (Thread 0x7fff78dff180 (LWP 63683))]
(gdb) bt
#0  __pthread_kill_implementation (threadid=<optimized out>, signo=<optimized out>, no_tid=no_tid@entry=0) at ./nptl/pthread_kill.c:43
#1  0x00007fff819acc8e in __pthread_kill_internal (signo=<optimized out>, threadid=<optimized out>) at ./nptl/pthread_kill.c:78
#2  0x00007fff81977b8a in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#3  0x00007fff81968048 in __GI_abort () at ./stdlib/abort.c:79
#4  0x00007fff7b05c194 in __deregister_frame_info_bases (begin=<optimized out>) at ../../../src/libgcc/unwind-dw2-fde.c:281
#5  __deregister_frame_info_bases (begin=<optimized out>) at ../../../src/libgcc/unwind-dw2-fde.c:219
#6  0x00007fff7b05c65c in __deregister_frame (begin=<optimized out>) at ../../../src/libgcc/unwind-dw2-fde.c:296
#7  0x00007fff7d5a5926 in llvm::RTDyldMemoryManager::deregisterEHFrames() () from /lib/riscv64-linux-gnu/libLLVM-15.so.1
#8  0x00007fff7d505f5c in llvm::MCJIT::~MCJIT() () from /lib/riscv64-linux-gnu/libLLVM-15.so.1
#9  0x00007fff7d5063c2 in llvm::MCJIT::~MCJIT() () from /lib/riscv64-linux-gnu/libLLVM-15.so.1
#10 0x00007fff811ebcdc in gallivm_free_ir (gallivm=0x7fff74011da0) at ../src/gallium/auxiliary/gallivm/lp_bld_init.c:218
#11 0x00007fff8125d8a0 in generate_variant (lp=0x55558a5db4f0, shader=<optimized out>, key=0x7fff78dfd678)
    at ../src/gallium/drivers/llvmpipe/lp_state_fs.c:3958
#12 llvmpipe_update_fs (lp=<optimized out>) at ../src/gallium/drivers/llvmpipe/lp_state_fs.c:4682
#13 0x00007fff81259214 in compute_vertex_info (llvmpipe=0x55558a5db4f0) at ../src/gallium/drivers/llvmpipe/lp_state_derived.c:289
#14 llvmpipe_update_derived (llvmpipe=0x55558a5db4f0) at ../src/gallium/drivers/llvmpipe/lp_state_derived.c:278
#15 0x00007fff8123f768 in llvmpipe_draw_vbo (pipe=0x55558a5db4f0, info=0x55558a5bcd90, drawid_offset=0, indirect=0x0, 
    draws=0x7fff78dfe550, num_draws=1) at ../src/gallium/drivers/llvmpipe/lp_draw_arrays.c:77
#16 0x00007fff811bca80 in lvp_execute_cmd_buffer (cmd_buffer=<optimized out>, state=0x55558a5bcd20, print_cmds=false)
    at ../src/gallium/frontends/lavapipe/lvp_execute.c:2700
#17 0x00007fff811bab04 in lvp_execute_cmds (device=<optimized out>, queue=0x55558a5bcb10, cmd_buffer=0x55558aa5a090)
    at ../src/gallium/frontends/lavapipe/lvp_execute.c:4575
#18 0x00007fff811b7058 in lvp_queue_submit (vk_queue=0x55558a5bcb10, submit=0x55558a5b36c0)
    at ../src/gallium/frontends/lavapipe/lvp_device.c:1323
#19 0x00007fff810c6baa in vk_queue_submit_final (queue=0x55558a5bcb10, submit=0x55558a5b36c0) at ../src/vulkan/runtime/vk_queue.c:377
#20 0x00007fff810c80fe in vk_queue_submit_thread_func (_data=0x55558a5bcb10) at ../src/vulkan/runtime/vk_queue.c:490
#21 0x00007fff810b8ec8 in impl_thrd_routine (p=0x0) at ../src/c11/impl/threads_posix.c:67
#22 0x00007fff819ab582 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:444
#23 0x00007fff819f90ea in __thread_start () at ../sysdeps/unix/sysv/linux/riscv/clone.S:85

Both GCC and clang has this problem? If so, that might be mesa’s fault.

Does an “strace -f insert_program_name_here_with_arguments | grep open” let you know what file it was trying to download ? (and from which directory) into the GPU at a guess.

The problem you have here is a normal exit. the software raster do not support RISC-V, here your app is running, but the rendering part say “sorry cannot work because unsupported CPU” which is a different problem.

Try to run not in GDB you should have a clear message from mesa saying this

WARNING: This target JIT is not designed for the host your are running. If bad thing happen, please choose a different -march switch.

It abort and not segfault, this is a different problem here the ICD library is loaded properly.

I have another vulkan app (sadly not public) that do crash in similar way, and I was looking at the backtrace something going wrong in some C++ init, and was looking at the compile options and notice the use of -static-libgcc and -static-libstdc++ in the compile option, decided to remove then and test, and, well it still crash but that’s in the app itself and not while loading libraries!

And a really quick check, but godot use the same parameters.

I’m going to do a quick test, but if then godot work or at least goes further it mean something is going wrong in the libstdc++, where? no idea, but I think after that I will leave that problem to more qualified C++ people who knows more about compilers & libstdc++ than me!

Edit: rebuliding without static libgcc and stdc++ did run way further. Still some issues, but here driver are more involved so, complete different set of problems.

Edit2: I suppose that type of problem are less common with other arch as the libc/c++ is more stable, but I just found that:
Possible libstdc++ incompatibility on Linux · Issue #388 · conda-forge/conda-forge.github.io · GitHub

not exactly the same problem, but really similar, meaning, static libc++ is a bad idea in general. Oh well, Good to now.

2 Likes

Thanks for your feedback. [ how are you able to ascertain that from the backtrace, if you don’t mind me asking?]

I don’t see any feedback from Mesa in the terminal; where should it be visible?

This bits:

The fact the software rasteriser is being called (the MCJIT) and other part of the trace citing gallivm or llvmpipe shows that the software rasteriser has been properly loaded and it just crash somewhere else, which in this case is likely because it hasn’t been ported/opimised for RISC-V, see this for example, RV is not (yet) supported:

Draft: llvmpipe: add a new jit engine based on llvm orcjit, also add in riscv support (!17801) · Merge requests · Mesa / mesa · GitLab

1 Like

@cordlandwehr Sorry for the ping but do you know how to get vk_khr_surface working with the mesa patches you use for yocto?

My meson flags: