Adventure in Text on the UART

segaloco · December 5, 2022, 5:17am

So in my general lack of success of actually getting ddrinit to restore and recover correctly, I’ve turned to some bare metal programming, sending binaries over the debug UART since I can’t really seem to get anywhere else. I’m taking it as a challenge rather than defeat. In any case, spent yesterday evening preparing an environment for and successfully developing and running a simple exercise in text transmission. I was typing up some notes and realized it’d probably be valuable to someone trying to learn similar things to share the details and experience.

First, for my environment. My primary workstation these days is a Raspberry Pi 400 running a stripped out version of the stock RPi Linux distro (I run TWM for instance…don’t ask…). In any case, I’ve installed the distro standard riscv64-linux-gnu binutils and GCC, although my experiments haven’t featured C yet except prototyping and -S to learn my RISC-V better. I’m connecting to the device over a trivial USB<->TTY. I’m currently using the jh7100-recover tool and just sending my own stuff instead of the recovery loader. Seems to work just fine. I’m working on an sx-ish tool at present, something that can be called from a TTY dialer but also mitigates whatever XMODEM-CRC bug is present in the SiFive recovery shell. With those powers and some source code combined, anything soon becomes possible. I want to write a more traditional tool to integrate this process with utilities such as cu and screen which as I recall did work with the ddrinit, etc. over UART0. Details details, onto the hardware.

Tools assembled, I set out to find the necessary references to write text out on the TX line of the debug header (UART3), as this is the only consistent entrypoint into the device I have right now. I first took a look at how the bootloader recovery tool does it, and it all boils down to a series of character writes on a transmission register on the UART. Additionally, there is a status register to check a bit on, just to verify we can actually write on the UART at the moment. While I haven’t worked with UART before, this is much like the data transmission register on the Ti TMS9918 and descendants, better known as the VDPs of early Sega hardware. You just write a word, wait for clear (or hope you’re slow enough) and write the next. The device scoops it up and sends it along, hopefully faster than you can put another but if not, usually gives you a way to know to wait if you’re careful enough to check. Granted, that IC also had DMA, can you even DMA a UART…

The similarity is likely no coincidence, this UART on the JH7100 is a Ti PC16550D. There is a reference somewhere in the bootloader recovery to the manual, SNLS378C. The manual explains all of the necessary programming interfaces of the UART, although all we’re really interested in are the transmission and line status registers. Page 17 indicates that the transmission register, THR, is at index 0, and the line status register, LSR, is at index 5.

Unfortunately, knowing what registers on the device to use, even with the manual, only shows an IC in isolation. The next step was to find the UART’s location in the JH7100 memory map. This information is in the JH7100 datasheet: STARFIVE-KH-SOC-PD-JH7100-2-01-V01.01.04-EN. Page 30 shows the base address for UART3 as 0x12440000. This, coupled with the specifics of the UART above, provides a transmission register at 0x12440000 and line status register at 0x12440005.

Now the task of writing code to perform this task in an efficient but cautious way. I opted for assembly to understand the primitives involved, although I imagine the process described herein could be applied purely in C. This code is simple. It only issues one character, but it does so in a predictable manner. That’s all one needs to then build string functions which then open up feedback mechanisms.

First, simply, we have an entry point. The name doesn’t truly matter, the linker might just bark at you later depending on the name. The SiFive loader doesn’t see nor care what the name is.

    .text
    .globl _start
_start:
    addi    sp, sp, -8
    sd      rd, 0(sp)

Gotta save the return address, RISC-V clobbers it on every function call. Convention is to grab a stack entry and stash it there. A micro-optimization might be to drop this on the leaf nodes of call stacks. Does your favorite compiler do this? Who knows…

Next, descent into the actual UART write of a character:

    li      a0, 'A'
    call    _putchar

That does the real magic, explained in a moment.

After that’s done, that return address comes in handy:

    ld      ra, 0(sp)
    addi    sp, sp, 8
    ret

Easy stuff, just take back the return address and give back the stack. This little application should result in the emission of one single ‘A’ on the UART TX line then a return to the SiFive loader. Now for the write itself:

    .text
    .globl _putchar
_putchar:
    addi    sp, sp -8
    sd      ra, 0(sp)

You do the hokey pokey, you put your return address in and shake it all about.

    li      t0, UART3_LSR /* 0x12440005 */
1:
    lbu     t1, (t0)
    andi    t1, t1, PC16550D_LSR_THRE /* 0b00100000 */
    bne     t1, zero, 1b

This loads up the address of the line status register then enters a tight loop where I grab the latest status, check it for the transfer blocked bit, and repeat until it’s clear. This is “optional” if you want to assume the UART TX is always open. Perhaps it recovers gracefully from an attempt to write when it isn’t available, perhaps it doesn’t. When in doubt, most details are in a register somewhere.

Now the moment we’ve all been waiting for! Let’s write that byte:

    li      t0, UART3_THR /* 0x12440000 */
    sb      a0, (t0)

That’s it, all that work for two lines of code. I could probably put this right after _start along with loading an immediate character to a0 and get something on the UART, but a little structure helps. Finally, this was predictable, but I didn’t actually have to do any of this here:

    ld      ra, 0(sp)
    addi    sp, sp, 8
    ret

Never altered that return address, but if I ever want to adjust how putchar works, it’s that much easier to skip the boilerplate.

So there you have it! That’s the code to print an arbitrary character over the debug UART on the JH7100 and return to the loader. However, assembly code is not a running binary image on the system. There’s still the matter of building the binary to send with jh7100-recover. This will entail a makefile and a linker script. First, the linker script. The only purposes of this script are to set an entry address, but access to the various mapping features of the linker may prove valuable when using complex memory layouts.

MEMORY {
    intRAM0 (rx) : org = 0x18000000, len = 128k
}

A very simple memory layout. I’m only running PIC code and have no data, so one executable segment suffices. I haven’t experimented with where all the SiFive loader will load from, but 0x18000000 (the base of intRAM0 per page 28 of the JH7100 datasheet) is where the recovery loader is sent to and run from, so why not.

SECTIONS {
    .text : {
        *(.text)
    } > intRAM0
    .data : {
        *(.data)
    } > intRAM0
    .bss : {
        *(.bss)
    } > intRAM0
}

About as generic as it gets. Just puts the memory segments sequentially in intRAM0 at the location we’re going to enter. One unfortunate quirk of the RISC-V implementation of GNU ld is it inexplicably doesn’t support “OUTPUT_FORMAT(binary)” or the equivalent command-line option, complaining that it can’t be done. As will be shown in the makefile, another binutils tool does this easily. Is this the UNIX philosophy at work? Different tools for different jobs but the authors themselves can’t even make them talk to each other.

On to the makefile. This makefile is pretty generic, like the above linker script. It’ll be brief:

main.bin: main.elf
    objcopy -O binary main.elf main.bin

main.elf: src/main.o src/putchar.o
    ld -T link.ld -o main.elf src/main.o src/putchar.o

This, of course, would only work if you’re running on a riscv64. Otherwise, prefixes will be necessary on the utility names. In any case, an effective makefile uses variables to represent most repreated text, but this is just to demonstrate the bare essentials. The assumption here is that the two functions described above are in their own assembly files, src/main.s and src/putchar.s. I’m using GNU make, so mileage may vary on relying on built-in rules elsewhere, POSIX doesn’t require .s.o so I’m cheating a little bit.

As mentioned above, I handle the lack of OUTPUT_FORMAT(binary) support by using objcopy to map the typical elf generated by ld into a binary address space and flatten it out. It’s a shame this is necessary, the same is required when producing flat aarch64 binaries for similar purposes. This directive has worked just fine for m68k and sh targets in the past.

So with all of that, I get main.bin, which I then send via

jh7100-recover -D /dev/ttyUSB0 -r main.bin

and I get:

Don’t take my word for it. Recreate this and try for yourself.

I’m currently working up a little text library for this programming scenario as well as a better XMODEM-CRC tool to allow better interactive use of the debug port via cu. Whenever there’s meaningful progress on either I’ll be sure to share those either in this thread or a new one depending on the passage of time.

segaloco · December 5, 2022, 9:06am

And I already have an update. Got that sx-like tool working as well as a bit of an expanded example using the UART3 for text. Both can be found here: Matthew Gilmore / riscv-bits · GitLab

The tool sxj is like sx in that you can call it from exec !! in screen among other places, but it actually works for the debug UART. I had moderate success with cu as well, although I’m finding that cu can’t very easily provide stdin to the application, only receive stdout from it, so I have to manually type the ‘C’ on my end to initiate the XMODEM-CRC as well as provide a non-NAK character every packet if using cu. It otherwise works, indicating no NAK failures at least. The program is strictly ANSI C, so should work anywhere you can hook up both ends of it to a TTY. Happy to answer any questions.

segaloco · December 19, 2022, 2:38am

Just chiming in with some updates. I’ve refocused to using screen exclusively for now, cu doesn’t seem to have a consistent interface for redirecting both stdin and stdout to a local application (sxj in this case) so screen will have to do.

Ran into a snag when loading multiple binaries successively, hopefully this helps someone else in a similar situation. So I found that sometimes I would send a new copy of my binary and try to enter it again only to get either the same result or a somewhat confusing result. Sometimes the code would behave as if the instructions hadn’t changed but the data had or vice versa. Went and read up on the privileged ISA and execution environment a bit and decided to throw in a ‘fence.i’ on the start of my embedded project. Now I’m not seeing this at all, every execution seems to be a new copy of code at least.

However, my new current snag is some inexplicable behavior that I really, really, REALLY want to solve before I start sinking the time into learning virtual memory. If anyone has seen this doing bare metal programming, please let me know, because I’m not turning up any explanation…

So I’ve got this start:

_start:
    fence.i
    addi    sp, sp, -8
    sd      ra, (sp)

    call    main

    ld      ra, (sp)
    addi    sp, sp, 8
    ret

And this main

const char teststr[] = "test";

int main(void)
{
    int i;
    char c = 'a';

    _puts("arbitrary string");

    for (i = 0; i < 10; i++) {
        _puts(teststr);
        _putchar(c++);
    }

    _puts("arbitrary string 2");

    return 0;
}

Where _puts and _putchar are equivalent to their C counterparts but are asm routines hitting UART directly. In any case, this should print the first string, followed by 10 iterations of the test string and then start the next line with a char. I went with something weird that I could compare with. However, this is what I’m getting

# do 0x18000000
arbitrary string
test
atest
btest
ctest
dtest
etest
ftest
#

So I’m not getting the entirety of what I’ve programmed to get back. I would’ve expected a crash to prevent it from completing, but instead, it’s almost as if it somehow bails and knows how to get back to the prompt, because I can then issue further load/do combinations without fail. I’m incredibly stumped by this, because it should never be getting back if it’s crashing somewhere, but if it’s able to get all the way back to my “ret” in _start, how the heck is it just arbitrarily bailing out of the for loop at random? Well, it’s not completely random, it always stops at that point, but regardless, there’s nothing I can think of that would directly explain this, it seems to just kinda bail from the loop of its own accord. As I was typing this, I then changed the main to this

const char teststr[] = "test";

int main(void)
{
    int i;
    char c = 'a';

    for (i = 0; i < 10; i++) {
        _puts(teststr);
        _putchar(c++);
    }

    return 0;
}

So simply removed the directives to print those two longer strings. Should drop that from the output right? Well…it kinda does, but now I get:

# do 0x18000000                            ��������                                   (                                          a(             
b(
c(
d(
e(
f(
g(
h(
i(
j����
#

That’s not much better…now it’s just spitting out characters that aren’t even in my code. To make matters worse, if I power cycle it, then load and do the exact same binary:

# do 0x18000000
test
atest
btest
ctest
dtest
etest
ftest
gtest
htest
it
#

Now it actually gets through alllllmost all of the iterations, but still just kinda randomly bails out YET also gets back to the prompt. I’m at a loss. It seems to be behaving completely randomly, which is the exact opposite of what a computer should be. It’s a bummer because I really want to start chipping away at a virtual memory system, but I don’t even have the confidence that simple loop of text is going to print, much less that all the heavy lifting of configuring a virtual memory system is going to even be worth my time since it might work perfectly one day and completely bomb the next, all dependent on the order of operations, whether I’m looking at the board or not, if it’s Tuesday, raining, the moon is half full, etc.

Long story short, I can find no rhyme nor reason to the behaviors the system is exhibiting in even the most basic of operations as printing text…configuring a virtual memory system is going to be an uphill battle and if I can’t even confidently print two strings in a row without something completely unexpected happening, that doesn’t bode well for any of the other things I’m trying to learn…hopefully I’m just missing something obvious because it’s not like this is my first time writing assembly language or running in the highest privilege mode on a CPU, usually it’s unforgiving and does EXACTLY what you say…but this is doing things I didn’t explicitly put into code and I really can’t explain what is going on, how, where, why, etc. I’m going to drop back to using just asm and see if I can at least get consistent text handling there. I really don’t want to be locked to just asm for now, but if things are going to get funky when I pull C in, then I’m better off being in asm at least until I get to user mode.

segaloco · December 19, 2022, 2:45am

Actually, I wonder, one of the things I’ve been wondering is if I’m running up against the data/bss segments of whatever loader binary is sitting on the debug UART when you start with the boot button pressed.

Is that information available somewhere? Is there a chunk of RAM that is guaranteed safe to use for a stack pointer? What I’m afraid of is that I’m gobbling up stack data from the loader and maybe that’s somehow why it’s not being consistent. I can’t really find another explanation for this…it’s completely bewildering.

mzs · December 19, 2022, 2:56am

Just thinking out loud here, but what happens if you add a delay at the end of the program before it exits, an insanely long duration of say 1 second, maybe even 2 seconds. The serial interface can only print out information so fast, and if your program terminates before everything in the buffer has fully been flushed why would the remainder be processed at all and not just discarded.

EDIT: Or maybe there is a way to poll the serial interface and find out when it has finished transmitting output and only terminate the program after that has occurred.

segaloco · December 19, 2022, 5:06am

So fixed a few bugs in my last post and the little bit of text I did get went away. Basically the PC16550D has a bit to indicate whether the transmission holding register (the register a character is written out on) is clear to accept a character. If I ignore this, I get UART output but buggy, in the way described, I get a certain amount of characters in a row then it nosedives, but eventually returns, so must be doing the loop and letting the UART fail.

If I await the bit that says this register is empty, I never get anywhere at all. I can’t even write the first character. I just tried checking the empty bit which is set when both the holding and shift registers are clear. That doesn’t work either, awaiting a 1 in either just hangs. There is state involved that I’m not aware of. The UART operates in a regular or FIFO mode. In one, the holding empty bit is set when a character is moved out of the register, presumably by the UART as the spec says the line status register is read only. In FIFO mode, however, the bit is set when the FIFO is completely empty. I’m going to see if I can suss out the state because this information is definitely part of the equation.

Failing that, I may just have to see how much I can get away with in a certain number of characters per burst. In the recovery_bootloader project on Github, there is a commented out wait cycle in the _putc method of the UART driver, otherwise this is where I first saw that it was a matter of register bits to set the properties of the UART. I wonder if the author of that code also needed to just wait out time between UART writes too…

segaloco · December 19, 2022, 5:21am

I leave you with this, this is what I’ll be contemplating for a while:

putchar:
    addi    sp, sp, -16
    sd      t0, (sp)
    sd      t1, 8(sp)

    mv      t0, zero
    li      t1, UART3

1:
    lbu     t0, PC16550D_LSR(t1)
    andi    t0, t0, PC16550D_LSR_THRE
    beqz    t0, 1b

    sb      a0, PC16550D_THR(t1)

    ld      t1, 8(sp)
    ld      t0, (sp)
    addi    sp, sp, 16
    ret

This should just await that holding register empty bit to be set then spit out a character in a0 on the holding register. Without the three instruction loop there it prints a character just fine, it only poops out if I try to do too many in a row. Above, UART3 = 0x12440000, PC16550D_LSR = 5, PC16550D_THR = 0, PC16550D_LSR_THRE = 0x20, and a0 is a character to print. I can verify t1 is set right in that a character from a0 is written without the catch. That line status register bit just never seems to be set. Neither that nor the empty bit 0x40.

RiscVFan · December 23, 2022, 4:49am

The Tx fifo is small, around 16 bytes. If you have it enabled and write bytes without. The LSR only tells you if the txfifo is FULL (you need to stop writing to it NOW) or if it’s empty (you can write 15 bytes into it) and there’s really not a way you can know if it’s safe to drop characters in between these two points. (Well, there is and at time time, I was listed on the patent for it…) On these parts, it’s normal to let the tx empty interrupt trigger to start your writes then you fill until you run out of data or until LSR bit 5 THRE says it’s full.

If you really insist on polling it and letting theh host software control the pacing, you’ll pretty much have to busy-wait on tx-empty, LSR bit 6.

The dirty icache thing stinks. Those are nasty to find. Sorry.

segaloco · December 23, 2022, 5:28am

Well the good news is I’ve fully confirmed that all the weirdness I was seeing in my early tests was not using the UART entirely correct, sounds like there’s some sort of interrupt I can attach and that’ll keep characters flowing, the expectation probably being that program code loads a buffer and that buffer is then constantly dumped to UART as it comes free.

In any case, I’ve moved on to privilege modes so another thread incoming on experiences had there.

RiscVFan · December 23, 2022, 5:48am

What you’re describing (write() goes to a kernel buffer. kernel buffer check if tx fifo is empty and starts loading characters into tx fifo until it’s full or the buffer is empty; return; another tx empty interrupt eventually happens) is pretty much the way the transmit half of the driver for all this class of UARTs work in every OS. Alternately, if you need simplicity over performance (e.g. a kernel debugger port) you just busy-wait the UART and don’t return until it’s left the FIFO, if not the shift register.

I used to do comms drivers professionally, including for this part, though long ago. I may not recall all the pitfalls of this part (and notice that the above falls apart if you have a UART with a stinking brain that can DMA from memory and/or do flow-control natively) but that block diagram is the right starting place.

It’s cool/horrifying that your’e (ab)using jh-7100-recover in this way. I’d really hoped to be able to tftpboot custom OS binary images in an automated way, but have never gotten that to really work right, even on 7100.

Good luck, @segaloco !

segaloco · December 23, 2022, 6:55am

Worth mentioning, I actually wrote a new tool, jh-7100-recover is very specific to loading those recovery binaries, so is a pain trying to do anything interactive with. I instead wrote this util/sxj.c · master · Matthew Gilmore / riscv-bits · GitLab which acts as a drop in ‘sx’ XMODEM replacement, that way it can be used with any TTY workflow like GNU screen or even old school uucp ‘cu’ (with some caveats).

At this point what I can do with the UART works, I just need to drop a string here and there to verify I got to a particular line, I’m not sure if there’s an easy way to do actual breakpoint debugging on this thing in this mode. Losing the bootloader means I can’t setup gdb remote under Linux or something, but that also wouldn’t help with M-mode anyway, I don’t think.

My background in this level of stuff is all 90s game consoles and such, so I’m learning quite a bit about the current state of bare metal dev, it’s kinda thrilling when it’s not frustrating…

RiscVFan · December 23, 2022, 7:51am

Cool!

I’ll only add that figuring out JTAG the first time is a pain, but it’s So Awesome to be able to step through boot and VM setup and interrupt handlers and others where puts(func line_); will fail you.