Privilege Modes Experiences

So after deciding futzing with the UART wasn’t something I felt like putting much more time into, I decided to turn my efforts to understanding the privilege and exception system as it pertains to things like a supervisor and system calls.

The relevant literature is here: https://github.com/riscv/riscv-isa-manual/releases/download/Priv-v1.12/riscv-privileged-20211203.pdf

Where I’ve landed is a set of primitives in my codebase at Matthew Gilmore / riscv-bits · GitLab that, while built in C and not the kind of thing for production workflows, seem to result in a consistent entry into supervisor, then user mode, complete with the necessary configuration to then use the ecall instruction for traditional UNIX-style syscalls. Of course, in a more sophisticated project, most of this environmental setup would be done in assembly for efficiency’s sake. I’ve got functions nesting 2-3 calls deep just to hit one opcode for crying out loud! But it does make it easier to read for educational reasons. The other thing to keep in mind throughout my description here is this is done via the debug interface on the JH7100 and is not necessarily indicative of what would need to be done from the RESET vector on any old RISC-V. Once I get a bit further along with this I do mean to nab an emulator and at least work out some handlers for lack of S/U support or presence of H.

Anywho, at the time the debug console enters a binary, the CPU is in machine mode with the MMU enabled in Sv39 mode. I haven’t bothered to look up the memory protection state at this point, but it does need to be adjusted from whatever the startup state is.

Upon entry, I disable interrupts as to cut down on the potential unexpected state changes. This is done by clearing both the mip and mie registers.

Next, I clear any memory protection and turn off the MMU. The former is done by setting all bits (basically writing -1) to pmpaddr0 and then setting pmpcfg0 to 0x0F. I also set pmpcfg2 to 0 for good measure. Basically pmpcfg0 and pmpcfg2 amount to 16 packed 1-byte registers governing adjustable memory ranges which can have different protection levels (e.g. r - read, w - write, x - execute). I only want one memory range, the entirety of the memory map, and I want full, normal permissions. The PMP handler establishes ranges of memory in pmpaddr0, pmpaddr1, etc, that govern from the address written down to the previous one according to the corresponding pmpcfg entry. That said, if the pmpaddr in question is 0, then the configuration in pmpcfg0 is considered to be from 0x0 to the address in pmpaddr0. So what the above register writes result in is memory from 0 to the highest possible address are the first and only PMP group and the entire range is rwx, so full access. So just to review, because this tripped me up a bit, pmpaddr0 is the top of the map, so the theoretical pmpaddr-1 is 0. Then, pmpcfg0’s lowest byte 0x0F which essentially means for pmpaddr0, allow read, write, and execute from this address all the way down to the next lowest pmpaddr register. This is pmpaddr0, so the “lowest” is 0. To put a different spin on it, if I wanted to just read addresses 0x18000000-0x19000000, pmpaddr0 would be 0x18000000 and pmpaddr1 would be 0x19000000 (both shifted right 2, 4 byte granularity). Then the lowest byte of pmpcfg0 would have cleared rwx bits and the next byte would have these bits set, essentially saying from 0 - 0x18000000, no access, then from 0x18000000 to 0x19000000, rwx access. Not sure what happens above 0x19000000 in this scenario, the spec probably says.

With PMP essentially disabled, next the MMU should be turned off. This is done by writing 0 to the satp register. Writing instead the highest bit re-enables Sv39, but I won’t be doing that until I get to virtual memory, which having figured out this privilege switching is second on my list after formalizing a better understanding of PMP.

So now interrupts are disabled and memory is completely exposed and unmapped. This is the typical environment that a UNIX-like kernel would expect to start up in.

At this point I run a fence.i for good measure, just to ensure nothing is amiss. I’m not sure if this matters in this memory configuration since it’s stripped down to physical addressing. This may have some influence on cache, hard to say, most of my experience is with M68k and back, so cutting my teeth on many things, RISC, caches, memory protection, VM, etc.

In any case, the CPU is finally in a controlled state which will allow a supervisor application to operate. Now the CPU simply needs to be told a few things about this supervisor environment and it can be entered. The supervisor is going to need to be able to catch calls from user mode and service them. Typically all exceptions and interrupts elevate to machine mode by default, and must be explicitly delegated to lower privilege modes. The various bits in the medeleg and mideleg registers can be used to reroute various exceptions and interrupts. In this case, the 8 bit (0x100) in medeleg is the ecall from user mode bit, which is precisely what is needed here. So this bit is set and thus ecalls in user mode will call into the supervisor exception vector. Speaking of which, the supervisor exception vector is then written to stvec. This is the address any exceptions and interrupts delegated to supervisor mode will branch to. The mcause and scause registers then contain the code for the reason the vector was called. The high bit indicates an interrupt or exception, and the remaining bits are an enumeration of various possible explanations. Checking this cause is crucial to determining if a particular entry to the vector is due to an ecall or some other reason.

After the exception delegation is setup, the supervisor environment can finally be entered. The way I’ve come to understand execution on CPUs is that you’re technically always in an exception vector, even the RESET vector could be seen as an exception, it simply is the first place you go so there isn’t a stack frame or other semantics to say what to do when a RESET exception is done. In any case, this general understanding lends itself well to the steps involved in descending a privilege level. Unlike typical code execution where a return address (in RISC-V, the ra register) is branched to after the completion of execution, switching between privilege levels requires a little bit of extra information since there multiple levels. In this case, the CPU needs to know what privilege mode it is going to be jumping into and where, but the places these are set may not be immediately obvious. Thinking of the current thread of execution in machine mode as an exception, if the exception’s goal was to complete and return to supervisor mode, then it stands to reason that the last mode recorded would be supervisor mode and the place to return from the exception would be the place to enter in S mode.

Well, what this boils down to is filling the MPP field in the mstatus register with the desired “previous” mode, S mode in this case, and then setting the mepc register to the entrypoint of the supervisor. The reason this works is the MPP field represents the last privilege mode before the machine mode exception was entered, and mepc represents the program counter of the instruction to return to. So this is really saying this program starts in a machine exception that, upon completion, should return to the beginning of the kernel.

Since this is now the state of the CPU, simply issuing an mret instruction enters the kernel. In a roundabout way, this return is actually an entry.

At this point, execution is now taking place in supervisor mode as opposed to machine mode. As memory was completely unrestricted in machine mode, the lower privilege modes aren’t really that restricted, but for the purposes of demonstration they do exhibit the traits that matter. Supervisor mode, however, isn’t the end of the story.

Supervisor mode, like machine mode, has a number of registers controlling similar aspects of the environment’s state. In fact, the supervisor registers are simply the machine registers with certain bits masked off and/or made irrelevant. Given that, I can now disable interrupts in supervisor mode by setting the sip and sie registers to zero. These are the analogues to the mip and mie registers in M mode above. Given that these were cleared in M mode already, I don’t know if the kernel entry in this context would need that too, but if the kernel entry was expected to be entered from an unknown M mode executive, then creating a controlled environment would ensure consistent operation.

In any case, the configuration of memory was already performed in M mode and is applicable to all modes. Still, just for consistency, I issue an sfence.vma just to be absolutely sure the CPUs view of cache and memory and such is all coherent. As an aside, I haven’t touched the second core or the 32-bit RISC-V that’s buried in the JH7100, so all of this is in a single-threaded context, essentially single CPU. Much of this would probably be done different with a RESET vector immediately starting multiple cores. At the very least, any core where mhartid != 0 would probably just want to spin until it can be put in a sane, cooperative state with the 0 hart.

Luckily S mode upfront setup is a lot less since I’m not even thinking about memory protection and management yet, so I don’t have to do anything with the MMU. That said, it’s at this stage most of that business would be done, setting up page tables, kicking off paging processes, and assigning page fault handlers. The most I’ve studied here thus far is that page faults just require setting further bits in the medeleg so the supervisor trap vector will catch them, then you can do your MMU operations. Should be typical MMU with a TLB and all that jazz. Catch faults, find pages, kick out old ones, shoot down other cores and fence when necessary.

So supervisor is configured the way needed for this flat mapped test, so it is time to descend to user mode. This process is virtually identical to the process used to drop from M mode to S mode. All that is needed is to set the previous mode in the SPP field of the sstatus register to 0, U mode. Next, the sepc is set to the entry of the user routine. Finally, an sret is issued, checking the SPP field, seeing user mode, and navigating the sepc pointer to the entrypoint.

At this point, the CPU is now operating in user mode. No memory protection is enabled, but user mode can be verified by attempting to interact with a privileged register like sstatus or mepc.

However, user mode is hardly where a computer is going to stay. User mode needs to eventually be able to get back into the supervisor somehow. Well, up above when setting medeleg to pass on ecall exceptions to the supervisor and providing an stvec value, I inform the CPU precisely what to do when encountering ecalls. As such, the ecall instruction can then be used to branch to the stvec pointer and raise the CPU privilege level to S. Within the body of this function, the CPU is once again in the supervisor privilege level and can interact with the system in ways the user process cannot. While I haven’t done this in my own testing yet, a further ecall could then be configured to branch back into machine mode. However, this would require setting the mtvec register to the location of the handler for this, like the stvec for entry into the supervisor mode. This can be used to elevate to even higher capabilities from the supervisor kernel. While this can probably be used for some sort of hypervisor functionality, there is also the fact that the RISC-V specification defines a hypervisor mode, so machine mode may not be perfectly suited to this.

So finally when all is said and done, the stvec routine completes the syscall operation and needs to return to user mode. You’ll recall that the process for descending privileges is what I’ve already used to enter those privilege modes for the first time. In this case, the sepc register has been set in the ecall process to the address of the ecall itself. This can be a bit confusing, because this means that completing the vector and issuing an sret will actually result in an infinite loop! The reason being is that sepc gets set to the offset of the ecall itself, not the offset immediately following, so the sret would just go back to the ecall, back to the vector, back to the ecall, etc. The reason the sepc is not automatically incremented to the next address is due to the nature of exception handling. The ecall instruction is treated like an exception but it is…well…rather exceptional for one. Most exceptions indicate a legitimate problem with something the CPU just tried to do. In other words, an exceptional circumstance the CPU cannot navigate itself. These include illegal instructions, page faults, privilege violations, divide by zeros, and other such situations. Well, an ecall is probably the only exception where it is always and forever acceptable to not “handle” a problem but just move on. With all other exceptional circumstances, the usual approach is to rectify the problem that lead to the exception then return to the instruction that caused it and retry it. This is how page faults work; execution is redirected to the trap vector, the page is pulled into memory and inserted in the TLB, and then control returns back to the instruction that triggered the page fault because now it should be able to access the data.

So all of that in mind, a caveat on returning from syscalls is that the sepc (and mepc for machine interrupts) must be incremented by one machine word. For the JH7100, a 64-bit processor, this is 8 bytes. When in doubt, sizeof (void *) should return the size of a machine word on a given platform, so if operating in C, the *epc value is just the current value + sizeof (void *). By adding this one machine word, the sepc is incremented to the following instruction after the ecall, and execution proceeds. Seeing as there is no c.ecall instruction, and this is a RISC processor, it should be a safe assumption that a ecall trap into the supervisor is always going to return one machine word after. Be careful applying this same sort of reasoning on other CPUs, as this is making the assumption that the calling opcode fits in precisely one word, a standard in RISC CPUs, but not so in CISC. That said, CISC instruction sets with privilege levels are likely to have different mechanisms from what a RISC system uses.

So with that adjusted sepc, a sret then returns to execution in user mode. That is really all that is needed to implement an operating system. Even memory protection and virtual memory aren’t crucial, although they make the life of a kernel much, much easier by taking the brunt of complex memory handling. Using the code in my repository up there as a library, this should result in the process described herein:

void user(void)
{
    rv_syscall();
    puts("full circle");
}

void kernelvec(void)
{
    rv_syscall_ret();
}

void kernel(void)
{
    rv_sint_disable();
    rv_smem_coherent();
    rv_user_enter(user);
}

void main(void)
{
    rv_mint_disable();
    rv_mem_unprotected();
    rv_mem_unmapped();
    rv_mmem_coherent();
    rv_supervisor_delegate(_kernelvec);
    rv_supervisor_enter(kernel);
}

I haven’t pushed this to my repo as I keep the main file there clean for scratch pad stuff, just using it as I assemble a library around it. In any case, implementing a switch on a0 in kernelvec up there would be the only remaining hurdle to creating a syscall list that user code could then call. Typical RISC-V ABI is to pass the syscall in a0 and other args in the subsequent a registers. However, this is just the suggested ABI, in reality anything is possible. The only caveat is any reputable projects out there are going to use the standard ABI, as will compilers, so crafting a custom ABI outside of a strictly-asm project of one’s own design is an exercise in futility.

3 Likes

One correction worth noting, the mepc and sepc must be incremented by 4, not 8, when returning from an ecall. I was mistaken under the assumption that riscv64 opcodes are 64-bit, they’re still 32-bit. Discovered this while building a clock interrupt, if that hints at the next topic I’ll be posting a lengthy exposition of :wink:

1 Like

There’s some nuance here.

*syscall is only ever a 32-bit opcode. There is no *syscall.c, even on a device that supports compressed opcodes.

An operating system (or hypervisor or handler for emulating opcoded like unaligned accesses, floating point instructions if you have no FPU, or pretending to have new extensions/opcodes that aren’t there in hardware). Code has to look at mcause to determine WHY you’re in an exception and be prepared to handle interrupts vs. exceptions. For interrupts, you write the EPC back that you got and carry on. For exceptions, it depends on the exception. If you’re emulating an opcode, you have to disassemble the opccode that got you here, mock up the registers, look at the compressed flag and restart at the next opcode, which may be 2 or 4. If you’re doctoring up an alignment or fault, you similarly have to decode the opcode so you can advance an appropriate amount after you’ve done your magic. An environment call is one of the rare times you DON’T have to disassemble it and can advance a constant amount.

For security and stability, you want to be sure to keep your exceptions (ecall, page fault, interrupts) carefully separated and not carelessly “forget” or rerun the opcode following your return. Those can be beasts to track down, especially if you don’t have functional JTAG. RISC-V, like MIPS before it, make it easy to think exception, interrupts, and software interrupts are the same, but they’re subtly different and all too easy to get almost right.

Love the writing, BTW. Thank you.

1 Like