Towards OS9 operating system on P2?

Christof Eb. · 2022-08-27 11:05

Hi forum members,
P2 has 512k memory and can do multitasking with it's 8 cogs, so I am thinking about times, when this was plenty of memory.
A realtime operating system, that works with 512k, has plenty of software available, can do multitasking, would be OS9, written for 6809 CPU and used for Tandy Coco3 computers, which led to a rather large amount of software. (I had been thinking of CPM/68k, but there is not much software for this platform and it cannot do multitasking. Os9/68k is not available.)
Access for 512k is done via a MMU tcc1041 in the Coco3.

So the idea is to emulate at least the 6809, the MMU, the main timer, and 2 ACIA 6522 (for which there a drivers in OS9/Nitros9). In an additional step the video system, which is designed for multiple windows, could be emulated. I think, it should be possible to directly boot into OS9 on a SD card, without a Coco3 Rom. As far as I understand, there is no bios. Hardware is handled by by device drivers.

Starting points are:
A PC emulator, written in C, part of: https://github.com/VCCE/VCC/releases
The "Ease of Use" project of Nitros9, which is a free version of OS9 and comes with a predefined harddisk image. lcurtisboyle.com/nitros9/nitros9.html
Plenty of informations: https://colorcomputerarchive.com/repo/
https://lomont.org/software/misc/coco/Lomont_CoCoHardware.pdf

Actual status is, that I was able to extract the 6809 and MMU emulators from vcc and compile them with FlexProp. A tiny mod in the emulator to split the case construct into 4 parts makes FlexProp generate a jump table. (Big ups for FlexProp!) Hardware port registers are in the range >$ff00, where a routine or a second cog could grab the values from the ACIA for serial transmit. Interrupts are polled before each operation of the 6809. This core of an emulator can load and execute a minimal 3 instruction loop.

Unfortunately a NOP instruction, which would be 2 cycles, needs 714 cycles without MMU and 938 cycles with MMU. So a kiss board @200MHz is only about 1/4 the speed of the original (1.8MHz) with this code. I have been staring at the source code and the listings, but there is nothing obvious for me to make NOP faster. (The condition flags seem to be in an array, so there is perhaps some speed gain possible, but NOP does not touch these of course.)
INCB needs 1083cycles with MMU instead of 2. 1083 is only 150 more than NOP, so the time seems to be lost outside of the actual emulation of the instruction.
I think, the emulated machine should/must be at least as fast as the original to be fun or useful.

Any thoughts or comments?
Christof

Wuerfel_21 · 2022-08-27 11:45

If you post something that flexspin can compile as-is, I could look at where bottlenecks are...

But the actual solution to that sort of thing is to write it from scratch in assembly, making use of EXECF and hand-tuned assembly. Check macca's 6502 core or my SPCcog Sony SPC700 core) for similar 8bit CPU implementations. I just calculated it, SPCcog's SPC700 takes up to 66 cycles to run a NOP from RAM, (whose implementation is a single line _ret_ sub sp_cycles,#1), a lot of which is spent in the memory handler reading the opcode. The SPC700 NOP is also 2-cycle, so that's the equiv. of a 6MHz processor on a 200MHz P2 (which turns out is enough leeway to run it at 1Mhz alongside the S-DSP emulation in the same cog).

Christof Eb. · 2022-08-27 15:23

@Wuerfel_21 said:
If you post something that flexspin can compile as-is, I could look at where bottlenecks are...

But the actual solution to that sort of thing is to write it from scratch in assembly, making use of EXECF and hand-tuned assembly. Check macca's 6502 core or my SPCcog Sony SPC700 core) for similar 8bit CPU implementations. I just calculated it, SPCcog's SPC700 takes up to 66 cycles to run a NOP from RAM, (whose implementation is a single line _ret_ sub sp_cycles,#1), a lot of which is spent in the memory handler reading the opcode. The SPC700 NOP is also 2-cycle, so that's the equiv. of a 6MHz processor on a 200MHz P2 (which turns out is enough leeway to run it at 1Mhz alongside the S-DSP emulation in the same cog).

Well, I already have had a look at Macca's emulators. I would need HUGE time to do something like this....
Here is my attempt with Vcc so far, if you would like to have a look at it. Main code is P2Vcc.c
Perhaps there is some way to split MC6809Exec() into tasks for more than one cog?
Is there a way to define the variables in a way, that they can be accessed via inline assembler with their names? I think, that the MMU access Memwrite8() and MemRead8() could be done via inline assembler perhaps? (I have yet to learn to use inline assembler with FlexC...)

ke4pjw · 2022-08-27 15:44

I just woke up and this still feels like a dream. Do you have the SAM or the MMU part of the GIME working? Either way WOW! WOW! WOW!

I am down for this. Two years ago now, I started working on adding a P2 to a real CoCo3. Looking to augment the CoCo SDC (not replace) with an I/O board that would add a 2nd monitor, SID chip, 6551 ASCIAs, PC Keyboard, and Hi-Res joystick adapter via a PC mouse, and have a an expansion to add click boards.

Being a fanboy of OS9/NitrOS9 and the CoCo, I want to help.

Wuerfel_21 · 2022-08-27 15:57

@"Christof Eb." said:.
Here is my attempt with Vcc so far, if you would like to have a look at it. Main code is P2Vcc.c

Main issue is that all the registers are global variables (-> hub ram). That generates a lot of add/access/sub sequences. Worse the hub layout ends up so that they are very far into the data section, so it will actually do augs/add/access/augs/sub every time. AFAIK there isn't a way to put a global into cog ram... @ersmith ?

Well, I already have had a look at Macca's emulators. I would need HUGE time to do something like this....

Took me like, a long weekend for both the Z80 and SPC700. It's not that bad.

Christof Eb. · 2022-08-27 16:36

Added a link in the first post.

Christof Eb. · 2022-08-27 17:29

Thanks, Ada, for your fast response!
(I am not a professional programmer, so all of this is something like a foreign language....)
What I don't understand, is where the time for NOP gets lost apart from MMU.
If I start MC6809Exec(140), then it is 43033 P2cycles for 10 loops (140 cycles) that would be 0,65MHz with MMU.
Just calling MC6809Exec() seems to need 300 cycles??

Hi Terry,
well, this is not my code as you can see. I only use vcc with some tiny modification. And perhaps this has to be restarted from scratch using assembler, because it might be at a dead end using C. Though I still wonder, if the code could be split up to use more than 1 cog and using some inline assembler....

Wuerfel_21 · 2022-08-27 17:55

@"Christof Eb." said:
Thanks, Ada, for your fast response!
(I am not a professional programmer, so all of this is something like a foreign language....)
What I don't understand, is where the time for NOP gets lost apart from MMU.
If I start MC6809Exec(140), then it is 43033 P2cycles for 10 loops (140 cycles) that would be 0,65MHz with MMU.
Just calling MC6809Exec() seems to need 300 cycles??

I just told you - all the variables are in hub, thus something as simple as incrementing the PC register will compile into a 7-instruction sequence like

add     ptr__dat__, ##12345
rdword  temp,ptr__dat__
add     temp,#1
wrword  temp,ptr__dat__
sub     ptr__dat__, ##12345

instead of smth like

add     pc,#1
zerox   pc,#15

Then there's overhead from pushing/popping stack frames when calling non-leaf functions.

You should move the memory array such that mc6809P2.c is included before the MMU include and the memory array:

#include <stdio.h>
#include <propeller.h>
#include "mc6809P2.c"
//#include "directMem.c"
#include "tcc1014mmu.c"

char memo[2*65536];

Note how that immediately makes the binary ~7K smaller. You could make similar reorderings inside the MMU file, too - move the page array after all the scalars.

Unrelatedly, there's some innefficiency in the MMU code that flexspin's CSE isn't good enough deal with. Note the repeated use of the MmuRegisters[MmuState][address>>13] subexpression.

unsigned char MemRead8( unsigned short address)
{
    if (address<0xFE00)
    {
        if (MemPageOffsets[MmuRegisters[MmuState][address>>13]]==1)
            return(MemPages[MmuRegisters[MmuState][address>>13]][address & 0x1FFF]);
        //return( PackMem8Read( MemPageOffsets[MmuRegisters[MmuState][address>>13]] + (address & 0x1FFF) ));
    }
    //if (address>0xFEFF)
        //return (port_read(address));
    if (RamVectors) //Address must be $FE00 - $FEFF
        return(memory[(0x2000*VectorMask[CurrentRamConfig])|(address & 0x1FFF)]); 
    if (MemPageOffsets[MmuRegisters[MmuState][address>>13]]==1)
        return(MemPages[MmuRegisters[MmuState][address>>13]][address & 0x1FFF]);
    //return( PackMem8Read( MemPageOffsets[MmuRegisters[MmuState][address>>13]] + (address & 0x1FFF) ));
}

Christof Eb. · 2022-08-27 18:53

Ok, so I could try to start the 6809 with local variables in separate cog. Apart from reset, only for debugging the variables have to be global, I think.
(A riddle: why the order of includes changes the results. Always learning....)
Thanks!

Wuerfel_21 · 2022-08-27 18:57

@"Christof Eb." said:
(A riddle: why the order of includes changes the results. Always learning....)

What is defined first goes into the data section first. To access the data section requires an add/sub sequence. Data that is not in the first 511 bytes of its module's data section requires AUGS instructions to express the offset.

Christof Eb. · 2022-08-27 19:08

@Wuerfel_21 said:

@"Christof Eb." said:
(A riddle: why the order of includes changes the results. Always learning....)

What is defined first goes into the data section first. To access the data section requires an add/sub sequence. Data that is not in the first 511 bytes of its module's data section requires AUGS instructions to express the offset.

Aha, got it. Thanks!

Christof Eb. · 2022-08-27 19:13

Would a local pointer to a global variable make sence?

Wuerfel_21 · 2022-08-27 19:18

@"Christof Eb." said:
Would a local pointer to a global variable make sence?

Sometimes yes. Examine ASM output before/after.

TonyB_ · 2022-08-27 20:35

@Wuerfel_21 said:

@"Christof Eb." said:
Thanks, Ada, for your fast response!
(I am not a professional programmer, so all of this is something like a foreign language....)
What I don't understand, is where the time for NOP gets lost apart from MMU.
If I start MC6809Exec(140), then it is 43033 P2cycles for 10 loops (140 cycles) that would be 0,65MHz with MMU.
Just calling MC6809Exec() seems to need 300 cycles??

I just told you - all the variables are in hub, thus something as simple as incrementing the PC register will compile into a 7-instruction sequence like
add     ptr__dat__, ##12345
rdword  temp,ptr__dat__
add     temp,#1
wrword  temp,ptr__dat__
sub     ptr__dat__, ##12345
instead of smth like
add     pc,#1
zerox   pc,#15

I know nothing about the 6809, but a P2 Z80 or 8086 emulator doesn't need a PC or IP variable at all with XBYTE. Use GETPTR instead and only when required.

EDIT:
Added "with XBYTE"

Wuerfel_21 · 2022-08-27 21:04

@TonyB_ said:
I know nothing about the 6809, but a P2 Z80 or 8086 emulator doesn't need a PC or IP variable at all. Use GETPTR instead and only when required.

None of the CPU emulators for P2 use the FIFO. Doesn't work with SMC or non-linear memory and prevents hubexec. We've been over this.

TonyB_ · 2022-08-27 21:20

@Wuerfel_21 said:

@TonyB_ said:
I know nothing about the 6809, but a P2 Z80 or 8086 emulator doesn't need a PC or IP variable at all. Use GETPTR instead and only when required.

None of the CPU emulators for P2 use the FIFO. Doesn't work with SMC or non-linear memory and prevents hubexec. We've been over this.

My 8086 and some versions of my Z80 emulators use the FIFO as part of XBYTE, which I should have mentioned (previous post corrected).

cgracey · 2022-08-27 21:39

@Wuerfel_21 said:

@TonyB_ said:
I know nothing about the 6809, but a P2 Z80 or 8086 emulator doesn't need a PC or IP variable at all. Use GETPTR instead and only when required.

None of the CPU emulators for P2 use the FIFO. Doesn't work with SMC or non-linear memory and prevents hubexec. We've been over this.

The 8080 emulator that Baggers and I worked on uses XBYTE/FIFO. The emulator code wound up being very compact and fit into the LUT. The XBYTE table was 256 longs and the emulator code was 160 longs. To handle the video for complete console emulation, we would read in full lines of pixels using SETQ+RDLONG, which didn't interfere with the FIFO.

Here is the 8080 emulator code:

Wuerfel_21 · 2022-08-27 22:01

@cgracey said:

@Wuerfel_21 said:

@TonyB_ said:
I know nothing about the 6809, but a P2 Z80 or 8086 emulator doesn't need a PC or IP variable at all. Use GETPTR instead and only when required.

None of the CPU emulators for P2 use the FIFO. Doesn't work with SMC or non-linear memory and prevents hubexec. We've been over this.

The 8080 emulator that Baggers and I worked on uses XBYTE/FIFO. The emulator code wound up being very compact and fit into the LUT. The XBYTE table was 256 longs and the emulator code was 160 longs. To handle the video for complete console emulation, we would read in full lines of pixels using SETQ+RDLONG, which didn't interfere with the FIFO.

Here is the 8080 emulator code:

Well, I stand slightly corrected. Only checking IRQs on branch is pretty clever as a HLE-style optimization.
But point was that once you add any sort of complex memory handling (write-sensitive I/O, memory banking, etc) the code starts to overflow the cog RAM and you can't use the FIFO anymore (without obnoxiously saving/restoring its state). (My Z80 emulation barely fits as is...). And the aforementioned self-modification / bank crossing issues.

Then again, maybe I'm just a stickler.

@TonyB_ said:
My 8086 and some versions of my Z80 emulators use the FIFO as part of XBYTE, which I should have mentioned (previous post corrected).

Isn't your x86 vaporware?

Christof Eb. · 2022-08-28 10:50

Ok, so now mc6809exec() running freely in it's own cog, other order of includes and with local
register cpuregister pc,x,y,u,s,dp,d;
(and with a huge amount of things not working anymore, commented out.)
with MMU: 0,77MHz; without MMU 4.6MHz (!!!!) I don't know, if I should trust this.....
If this is real, then it is doable again! Ada, what shall I say!!!

( Will be not able to work on this for 2 weeks. )

Christof Eb. · 2022-09-15 07:18

Some update. I have renewed my efforts with this project. The emulator of VCC written by Joseph Forgione in C is usable I think with just minor changes. I really like, that it is so very well readable for me. And it is proven to work. :-) At the moment the MC6809-registers are still global variables in hub ram. I only split the switch-case and transferred temp8, temp16, temp32 to local and the CycleCounter (do we need it at all?) to the cog variable _PR0.
Without MMU my little assembler loop was running at 3.6MHz (P2@200MHz), which I think, could be good enough.
So now I am struggling with the MMU. Transferring the variable MmuState to _PR7 gave a performance of 1.36MHz. (Goal is >1.79) This MMU is a real bottleneck....
Regarding the MMU, we have 3 worlds here.
Coco-3 hardware has 128k or 512k fully usable ram. 2*8 MMU registers with 6 bits. OS9 Level II does not run with 128k.

PC VCC can have up to 8Meg fully usable ram. 4(!)*8 MMU registers with 16 bits. (!) It uses 2 additional tables *MemPages and MemPageOffsets.

P2 has 512k, but only ???k will be usable. At the moment, I think, it will have 4*8 registers with 8 bits. At the moment I hope, that I can get rid of these additional tables somehow.

Perhaps someone can answer some questions? Terry?
1. Will OS9 Level II run with 256k of Ram?
2. How can we tell OS9, that only xxx k of 512k is available?
3. There must be some table somewhere, that says, which pages are in use? Perhaps we can have a startup file that marks the pages, which cannot be used?
4. In what order will OS9 use the Ram at startup?
5. Am I right, when I think, that after the initial reading of a bootsector, no ROM is used anymore?

Thanks for any hints! Christof

Christof Eb. · 2022-09-15 07:57

@ke4pjw said:
I just woke up and this still feels like a dream. Do you have the SAM or the MMU part of the GIME working? Either way WOW! WOW! WOW!

I am down for this. Two years ago now, I started working on adding a P2 to a real CoCo3. Looking to augment the CoCo SDC (not replace) with an I/O board that would add a 2nd monitor, SID chip, 6551 ASCIAs, PC Keyboard, and Hi-Res joystick adapter via a PC mouse, and have a an expansion to add click boards.

Being a fanboy of OS9/NitrOS9 and the CoCo, I want to help.

Hi Terry,
you offered some help here. Thank you! This project is just started and I am not sure, if I am capable of doing it and willing to invest all the time. So I don't feel able to propose a kind of sharing the work at this moment.
The plan is for now:
1. Get the MC6809-emulator working and fast enough - seems to be done.
2. Get the Mmu working and fast enough - active
3. Get the main timer working. I think that is needed for OS9.
3. Get P2Vcc to boot OS9 from an image file on SD card. This shall work with no ROM and no Video. Blink some diode. Or do something else, that can be detected.
4. Have a shell on a serial 6522 line using the original 6522 drivers.

Perhaps you could answer some questions in the post above or give some comments for steps 3 and 4?
Christof

Christof Eb. · 2022-09-15 16:17

1.81MHz! MMU registers in cog ram.

Christof Eb. · 2022-09-16 10:19

Oh, I could have thought of this earlier :
Caching the last MMU-page-offset and using it, if still relevant:
MC6809 running at >2.6MHz for my little loop.
And if everything is in the same page: 3MHz
(It is only capable of reading ram and ports though)

Christof Eb. · 2022-09-17 06:51

Would it be worth, switching over from MC6809 to HC6309?
In Vcc there is the source for this, and the modifications seem to be doable in about 2 days or even less? Of course it will only be more powerful, if there is code or tools for it.
Tempting. But I should now tackle the main timer with interrupts....
Are there tools for HC6309 in OS9?

hinv · 2022-09-18 09:47

According to Wikipedia NitrOS-9 takes advantages of the features of the 6309 if available. Otherwise I think the 6309 can be in compatible mode.
https://en.wikipedia.org/wiki/Hitachi_6309 says that one of the advantages is 32bit math..which can be good at possibly lower cost on the propeller than 8-bit math because it could be closer to native instructions.

ke4pjw · 2022-09-19 01:53

@"Christof Eb." said:
Hi Terry,
you offered some help here. Thank you! This project is just started and I am not sure, if I am capable of doing it and willing to invest all the time. So I don't feel able to propose a kind of sharing the work at this moment.
The plan is for now:
1. Get the MC6809-emulator working and fast enough - seems to be done.
2. Get the Mmu working and fast enough - active
3. Get the main timer working. I think that is needed for OS9.
4. Get P2Vcc to boot OS9 from an image file on SD card. This shall work with no ROM and no Video. Blink some diode. Or do something else, that can be detected.
5. Have a shell on a serial 6522 line using the original 6522 drivers.

Perhaps you could answer some questions in the post above or give some comments for steps 3 and 4?
Christof

Hey Christof, I have been on vacation and haven't had an opportunity to respond.

So let me begin by saying I know nothing of the mechanics of emulators. I can offer up what I think should be the basic building blocks.

Yes, a 6809/6309 core up and running would be good. In native mode, the 6309 has several op codes that execute faster than the 6809 and also has the hardware multiply and divide. NitrOS9 and several of the newer games will take advantage of this. All of my real CoCo3's have 6309's.
You want more than MMU. You will want to fully emulate the GIME chip. It is both the MMU and Graphics chip. It is based on the Motorola MC6847 (video) and SN74LS785 (MMU). Getting the 32x16 text graphics up would be the first step, You would be able to boot Color Basic, which would be first steps toward OS9. I lot less hardware to emulate, however since the code is reentrant, you could just boot directly into OS9, but the problem is all of the utils needed to do anything useful will require a disk or be loaded into RAM disk.
To get OS9 to boot the "normal" way, you have the have all of the emulated drive hardware. Typically you boot to RSDOS (Disk Extended Color Basic) and type DOS. It reads the boot kernel from Track 34 of drive 0 into memory at $2600. There are $1200 bytes loaded into RAM. It then JMPs to $2602. The first two bytes are ASCII O and S.

We would probably want to boot one of Curtis Boyle's Ease of Use distros. They are like Ragù, "it's in there"

I would look at the CoCo SDC code for dealing with virtual disks. We would most likely want to emulate it exactly as it is. Curtis already has OS9 disks specifically for the CoCo SDC.
I have a friend that wrote TRXMon. Instead of just throwing TSMon out on the serial ports, it would actually prompt for user login, though I can't recall if it required his special OS9P3 module or not. Just getting a shell via serial would be cool!

Christof Eb. · 2022-09-20 06:22

@"Christof Eb." said:

Will OS9 Level II run with 256k of Ram?

How can we tell OS9, that only xxx k of 512k is available?

There must be some table somewhere, that says, which pages are in use? Perhaps we can have a startup file that marks the pages, which cannot be used?

In what order will OS9 use the Ram at startup?

So after digging into several sources, and having a more closely look at what OS9 does, for the time being I have a very ugly workaround:
The most significant bit is ignored by MMU. OS9 takes code and data pages from the bottom and video buffers from the top. So as long as these don't collide....

Edit: Minimum 152k for Os9 without graphics.

ke4pjw · 2022-09-21 04:57

OS9 Level 2 will boot in 128K. I believe it is determined at boot time. I do not recall having to do anything special when I upgraded from 128K to 512k on my CoCo. It will boot into the 32x16 text video mode.

Christof Eb. · 2022-09-23 10:01

So the last days have been rather deflating:
1. There are a really lot of information sources -books, manuals,... about Coco-3 or Os9 or Nitros-9, but if you want to have clear in depth information about some internals, then it will become really difficult. For example, I have not found clear information, how the 60Hz timer works. (At the moment I believe it works through INT, not NMI, not FIRQ. And it seems to use the vertical blank interrupt, not HSYNC and not the hardware timer.)
2. I have not found any way to tell OS-9 that not 128k and not 512k but 256k are available. I assume, that you have to patch the kernel of OS-9 somehow. 256k should be sufficient but not comfortable.
3. It is more complex to get the harddisk access working, as I had guessed.
4. I had been happy to achieve 2.3MHz for Mc8609 caching the last memory index of MMU. But when I introduced the next layer of port access, I fell back very much<1MHz. After 2 days I am back at 1.3MHz (data in a different page than code but not in ports, which is worse). As I have now used global locals for the flag variables with rather little effect, I don't have any good ideas left to regain the speed. FlexProp is of course the marvelous tool, that enables everything. But it then does something and you are back to xyz. Vcc on my PC running i7-8550U @ 3.5GHz says, that it can do can do mc6809 @ 89MHz. The slow hub access does get in the way very much. Well, yes, one could rewrite everything in assembler to execute much in cog.....
5. I have learnt a lot, for example that not the operations but the calculation of flags are the time consuming parts.
So likely I will abandon this project.

Thanks to all, who have given help here!!!!!!
Christof

evanh · 2022-09-23 10:08

Expect NMI never to be used. Any system that uses it is a broken design. It's a wasted pin.

ke4pjw · 2022-09-23 20:28

As I recall, the default clock module in OS-9 uses the h-sync to generate the ticks. The h-sync and v-sync timers are exposed in the GIME. I remember having to coble a new boot disk when I got my DISTO supercontroller with RTC, because the clock module needed to be updated to use the RTC. As I recall that module is the heart of the multitasking and each 60hz "tic" it would scan for virtual IRQs in a table to service the various tasks. It has been 30 years since I last thought about any of that. If you emulate the h-sync and v-sync counters, I don't think it would be a problem.

Yeah, I don't think the P2 could emulate things fast enough using spin. You can't use the streamer FIFO with interpreted Spin/Prop Tool, and that latency is gonna kill your performance.

If I ever get time, I need to learn the overall architecture of these emulators. Seems like a lot of fun.

Thanks for your efforts Christof!

Towards OS9 operating system on P2?

Comments