Towards OS9 operating system on P2?
Hi forum members,
P2 has 512k memory and can do multitasking with it's 8 cogs, so I am thinking about times, when this was plenty of memory.
A realtime operating system, that works with 512k, has plenty of software available, can do multitasking, would be OS9, written for 6809 CPU and used for Tandy Coco3 computers, which led to a rather large amount of software. (I had been thinking of CPM/68k, but there is not much software for this platform and it cannot do multitasking. Os9/68k is not available.)
Access for 512k is done via a MMU tcc1041 in the Coco3.
So the idea is to emulate at least the 6809, the MMU, the main timer, and 2 ACIA 6522 (for which there a drivers in OS9/Nitros9). In an additional step the video system, which is designed for multiple windows, could be emulated. I think, it should be possible to directly boot into OS9 on a SD card, without a Coco3 Rom. As far as I understand, there is no bios. Hardware is handled by by device drivers.
Starting points are:
A PC emulator, written in C, part of: https://github.com/VCCE/VCC/releases
The "Ease of Use" project of Nitros9, which is a free version of OS9 and comes with a predefined harddisk image. lcurtisboyle.com/nitros9/nitros9.html
Plenty of informations: https://colorcomputerarchive.com/repo/
https://lomont.org/software/misc/coco/Lomont_CoCoHardware.pdf
Actual status is, that I was able to extract the 6809 and MMU emulators from vcc and compile them with FlexProp. A tiny mod in the emulator to split the case construct into 4 parts makes FlexProp generate a jump table. (Big ups for FlexProp!) Hardware port registers are in the range >$ff00, where a routine or a second cog could grab the values from the ACIA for serial transmit. Interrupts are polled before each operation of the 6809. This core of an emulator can load and execute a minimal 3 instruction loop.
Unfortunately a NOP instruction, which would be 2 cycles, needs 714 cycles without MMU and 938 cycles with MMU. So a kiss board @200MHz is only about 1/4 the speed of the original (1.8MHz) with this code. I have been staring at the source code and the listings, but there is nothing obvious for me to make NOP faster. (The condition flags seem to be in an array, so there is perhaps some speed gain possible, but NOP does not touch these of course.)
INCB needs 1083cycles with MMU instead of 2. 1083 is only 150 more than NOP, so the time seems to be lost outside of the actual emulation of the instruction.
I think, the emulated machine should/must be at least as fast as the original to be fun or useful.
Any thoughts or comments?
Christof
Comments
If you post something that flexspin can compile as-is, I could look at where bottlenecks are...
But the actual solution to that sort of thing is to write it from scratch in assembly, making use of EXECF and hand-tuned assembly. Check macca's 6502 core or my SPCcog Sony SPC700 core) for similar 8bit CPU implementations. I just calculated it, SPCcog's SPC700 takes up to 66 cycles to run a NOP from RAM, (whose implementation is a single line
_ret_ sub sp_cycles,#1
), a lot of which is spent in the memory handler reading the opcode. The SPC700 NOP is also 2-cycle, so that's the equiv. of a 6MHz processor on a 200MHz P2 (which turns out is enough leeway to run it at 1Mhz alongside the S-DSP emulation in the same cog).Well, I already have had a look at Macca's emulators. I would need HUGE time to do something like this....
Here is my attempt with Vcc so far, if you would like to have a look at it. Main code is P2Vcc.c
Perhaps there is some way to split MC6809Exec() into tasks for more than one cog?
Is there a way to define the variables in a way, that they can be accessed via inline assembler with their names? I think, that the MMU access Memwrite8() and MemRead8() could be done via inline assembler perhaps? (I have yet to learn to use inline assembler with FlexC...)
I just woke up and this still feels like a dream. Do you have the SAM or the MMU part of the GIME working? Either way WOW! WOW! WOW!
I am down for this. Two years ago now, I started working on adding a P2 to a real CoCo3. Looking to augment the CoCo SDC (not replace) with an I/O board that would add a 2nd monitor, SID chip, 6551 ASCIAs, PC Keyboard, and Hi-Res joystick adapter via a PC mouse, and have a an expansion to add click boards.
Being a fanboy of OS9/NitrOS9 and the CoCo, I want to help.
Main issue is that all the registers are global variables (-> hub ram). That generates a lot of add/access/sub sequences. Worse the hub layout ends up so that they are very far into the data section, so it will actually do augs/add/access/augs/sub every time. AFAIK there isn't a way to put a global into cog ram... @ersmith ?
Took me like, a long weekend for both the Z80 and SPC700. It's not that bad.
Added a link in the first post.
Thanks, Ada, for your fast response!
(I am not a professional programmer, so all of this is something like a foreign language....)
What I don't understand, is where the time for NOP gets lost apart from MMU.
If I start MC6809Exec(140), then it is 43033 P2cycles for 10 loops (140 cycles) that would be 0,65MHz with MMU.
Just calling MC6809Exec() seems to need 300 cycles??
Hi Terry,
well, this is not my code as you can see. I only use vcc with some tiny modification. And perhaps this has to be restarted from scratch using assembler, because it might be at a dead end using C. Though I still wonder, if the code could be split up to use more than 1 cog and using some inline assembler....
I just told you - all the variables are in hub, thus something as simple as incrementing the PC register will compile into a 7-instruction sequence like
instead of smth like
Then there's overhead from pushing/popping stack frames when calling non-leaf functions.
You should move the memory array such that
mc6809P2.c
is included before the MMU include and the memory array:Note how that immediately makes the binary ~7K smaller. You could make similar reorderings inside the MMU file, too - move the page array after all the scalars.
Unrelatedly, there's some innefficiency in the MMU code that flexspin's CSE isn't good enough deal with. Note the repeated use of the
MmuRegisters[MmuState][address>>13]
subexpression.Ok, so I could try to start the 6809 with local variables in separate cog. Apart from reset, only for debugging the variables have to be global, I think.
(A riddle: why the order of includes changes the results. Always learning....)
Thanks!
What is defined first goes into the data section first. To access the data section requires an add/sub sequence. Data that is not in the first 511 bytes of its module's data section requires AUGS instructions to express the offset.
Aha, got it. Thanks!
Would a local pointer to a global variable make sence?
Sometimes yes. Examine ASM output before/after.
I know nothing about the 6809, but a P2 Z80 or 8086 emulator doesn't need a PC or IP variable at all with XBYTE. Use GETPTR instead and only when required.
EDIT:
Added "with XBYTE"
None of the CPU emulators for P2 use the FIFO. Doesn't work with SMC or non-linear memory and prevents hubexec. We've been over this.
My 8086 and some versions of my Z80 emulators use the FIFO as part of XBYTE, which I should have mentioned (previous post corrected).
The 8080 emulator that Baggers and I worked on uses XBYTE/FIFO. The emulator code wound up being very compact and fit into the LUT. The XBYTE table was 256 longs and the emulator code was 160 longs. To handle the video for complete console emulation, we would read in full lines of pixels using SETQ+RDLONG, which didn't interfere with the FIFO.
Here is the 8080 emulator code:
Well, I stand slightly corrected. Only checking IRQs on branch is pretty clever as a HLE-style optimization.
But point was that once you add any sort of complex memory handling (write-sensitive I/O, memory banking, etc) the code starts to overflow the cog RAM and you can't use the FIFO anymore (without obnoxiously saving/restoring its state). (My Z80 emulation barely fits as is...). And the aforementioned self-modification / bank crossing issues.
Then again, maybe I'm just a stickler.
Isn't your x86 vaporware?
Ok, so now mc6809exec() running freely in it's own cog, other order of includes and with local
register cpuregister pc,x,y,u,s,dp,d;
(and with a huge amount of things not working anymore, commented out.)
with MMU: 0,77MHz; without MMU 4.6MHz (!!!!) I don't know, if I should trust this.....
If this is real, then it is doable again! Ada, what shall I say!!!
( Will be not able to work on this for 2 weeks. )
Some update. I have renewed my efforts with this project. The emulator of VCC written by Joseph Forgione in C is usable I think with just minor changes. I really like, that it is so very well readable for me. And it is proven to work. :-) At the moment the MC6809-registers are still global variables in hub ram. I only split the switch-case and transferred temp8, temp16, temp32 to local and the CycleCounter (do we need it at all?) to the cog variable _PR0.
Without MMU my little assembler loop was running at 3.6MHz (P2@200MHz), which I think, could be good enough.
So now I am struggling with the MMU. Transferring the variable MmuState to _PR7 gave a performance of 1.36MHz. (Goal is >1.79) This MMU is a real bottleneck....
Regarding the MMU, we have 3 worlds here.
Coco-3 hardware has 128k or 512k fully usable ram. 2*8 MMU registers with 6 bits. OS9 Level II does not run with 128k.
PC VCC can have up to 8Meg fully usable ram. 4(!)*8 MMU registers with 16 bits. (!) It uses 2 additional tables *MemPages and MemPageOffsets.
P2 has 512k, but only ???k will be usable. At the moment, I think, it will have 4*8 registers with 8 bits. At the moment I hope, that I can get rid of these additional tables somehow.
Perhaps someone can answer some questions? Terry?
1. Will OS9 Level II run with 256k of Ram?
2. How can we tell OS9, that only xxx k of 512k is available?
3. There must be some table somewhere, that says, which pages are in use? Perhaps we can have a startup file that marks the pages, which cannot be used?
4. In what order will OS9 use the Ram at startup?
5. Am I right, when I think, that after the initial reading of a bootsector, no ROM is used anymore?
Thanks for any hints! Christof
Hi Terry,
you offered some help here. Thank you! This project is just started and I am not sure, if I am capable of doing it and willing to invest all the time. So I don't feel able to propose a kind of sharing the work at this moment.
The plan is for now:
1. Get the MC6809-emulator working and fast enough - seems to be done.
2. Get the Mmu working and fast enough - active
3. Get the main timer working. I think that is needed for OS9.
3. Get P2Vcc to boot OS9 from an image file on SD card. This shall work with no ROM and no Video. Blink some diode. Or do something else, that can be detected.
4. Have a shell on a serial 6522 line using the original 6522 drivers.
Perhaps you could answer some questions in the post above or give some comments for steps 3 and 4?
Christof
1.81MHz! MMU registers in cog ram.
Oh, I could have thought of this earlier :
Caching the last MMU-page-offset and using it, if still relevant:
MC6809 running at >2.6MHz for my little loop.
And if everything is in the same page: 3MHz
(It is only capable of reading ram and ports though)
Would it be worth, switching over from MC6809 to HC6309?
In Vcc there is the source for this, and the modifications seem to be doable in about 2 days or even less? Of course it will only be more powerful, if there is code or tools for it.
Tempting. But I should now tackle the main timer with interrupts....
Are there tools for HC6309 in OS9?
According to Wikipedia NitrOS-9 takes advantages of the features of the 6309 if available. Otherwise I think the 6309 can be in compatible mode.
https://en.wikipedia.org/wiki/Hitachi_6309 says that one of the advantages is 32bit math..which can be good at possibly lower cost on the propeller than 8-bit math because it could be closer to native instructions.
Hey Christof, I have been on vacation and haven't had an opportunity to respond.
So let me begin by saying I know nothing of the mechanics of emulators. I can offer up what I think should be the basic building blocks.
We would probably want to boot one of Curtis Boyle's Ease of Use distros. They are like Ragù, "it's in there"
I would look at the CoCo SDC code for dealing with virtual disks. We would most likely want to emulate it exactly as it is. Curtis already has OS9 disks specifically for the CoCo SDC.
I have a friend that wrote TRXMon. Instead of just throwing TSMon out on the serial ports, it would actually prompt for user login, though I can't recall if it required his special OS9P3 module or not. Just getting a shell via serial would be cool!
So after digging into several sources, and having a more closely look at what OS9 does, for the time being I have a very ugly workaround:
The most significant bit is ignored by MMU. OS9 takes code and data pages from the bottom and video buffers from the top. So as long as these don't collide....
Edit: Minimum 152k for Os9 without graphics.
OS9 Level 2 will boot in 128K. I believe it is determined at boot time. I do not recall having to do anything special when I upgraded from 128K to 512k on my CoCo. It will boot into the 32x16 text video mode.
So the last days have been rather deflating:
1. There are a really lot of information sources -books, manuals,... about Coco-3 or Os9 or Nitros-9, but if you want to have clear in depth information about some internals, then it will become really difficult. For example, I have not found clear information, how the 60Hz timer works. (At the moment I believe it works through INT, not NMI, not FIRQ. And it seems to use the vertical blank interrupt, not HSYNC and not the hardware timer.)
2. I have not found any way to tell OS-9 that not 128k and not 512k but 256k are available. I assume, that you have to patch the kernel of OS-9 somehow. 256k should be sufficient but not comfortable.
3. It is more complex to get the harddisk access working, as I had guessed.
4. I had been happy to achieve 2.3MHz for Mc8609 caching the last memory index of MMU. But when I introduced the next layer of port access, I fell back very much<1MHz. After 2 days I am back at 1.3MHz (data in a different page than code but not in ports, which is worse). As I have now used global locals for the flag variables with rather little effect, I don't have any good ideas left to regain the speed. FlexProp is of course the marvelous tool, that enables everything. But it then does something and you are back to xyz. Vcc on my PC running i7-8550U @ 3.5GHz says, that it can do can do mc6809 @ 89MHz. The slow hub access does get in the way very much. Well, yes, one could rewrite everything in assembler to execute much in cog.....
5. I have learnt a lot, for example that not the operations but the calculation of flags are the time consuming parts.
So likely I will abandon this project.
Thanks to all, who have given help here!!!!!!
Christof
Expect NMI never to be used. Any system that uses it is a broken design. It's a wasted pin.
As I recall, the default clock module in OS-9 uses the h-sync to generate the ticks. The h-sync and v-sync timers are exposed in the GIME. I remember having to coble a new boot disk when I got my DISTO supercontroller with RTC, because the clock module needed to be updated to use the RTC. As I recall that module is the heart of the multitasking and each 60hz "tic" it would scan for virtual IRQs in a table to service the various tasks. It has been 30 years since I last thought about any of that. If you emulate the h-sync and v-sync counters, I don't think it would be a problem.
Yeah, I don't think the P2 could emulate things fast enough using spin. You can't use the streamer FIFO with interpreted Spin/Prop Tool, and that latency is gonna kill your performance.
If I ever get time, I need to learn the overall architecture of these emulators. Seems like a lot of fun.
Thanks for your efforts Christof!