towards a P2 Virtual Machine using XBYTE and Subsystems
This is some experiment exploring possibilities of P2.
Overall goal is a self contained system like Taqoz with SD card using PC with Teraterm as terminal.
One aspect is about the virtual machine: I asked myself, what would be the most "natural way" to implement a dual stack machine (Yes, my implementation shall run Forth, once more) on P2? So, yes we have two index registers PTRA + PTRB for the stacks, but how about the program counter?
Well, why not try XBYTE, which has it's hidden PC, promises to give a fast interpreter and uses the microcache of the streamer.
Downside is, that you can only use XBYTE, if you dedicate it's cog to pure assembler.
XBYTE is new for me and I will have to learn much here. I will start a link-list in the next post about XBYTE.
So, if we want to use libraries from FlexProp or write some code in C, we have to have "intelligent" Subsystems in other Cogs. The question is, if this can be turned into a the benefit of overall throughput. After all it has not been too easy for me to make good use of additional COGs in the past. So at the moment, the idea is something like this:
COG0: Compiler, runs C code, compiles Forth to Bytecode
COG1: Serial buffered input
COG2: Console buffered output (?? process decimal output, string output ??)
COG3: SD Driver (?? with read ahead and write caching ??)
COG4: XBYTE Machine executing Bytecode
COG5, COG6: Spare XBYTE Machines executing Bytecode
COG7: VGA tile driver
I would be interested in your thoughts about such "intelligent subsystems". What is really helpful?
Christof
Comments
Link List about XBYTE
Propeller P2 assembler instructions:
https://docs.google.com/spreadsheets/d/1_vJk-Ad569UMwgXTKTdfJkHYHpc1rZwxB-DcIiAZNdk/edit#gid=0
Incomplete Assembler Manual:
https://docs.google.com/document/d/1rQ_2FPLebT9GL6cZxgwHFbqxO9Kh9SfO6yn9gGTT6EE/edit
P2 Docu:
https://docs.google.com/document/d/1gn6oaT5Ib7CytvlZHacmrSbVBJsD9t_-kmvjd7nUR6o/edit
Obex:
https://obex.parallax.com/
@Wuerfel_21 's P2 Instructions Docu:
https://p2docs.github.io/
About XBYTE Zylin ZPU:
https://forums.parallax.com/discussion/166563/zpu-and-riscv-emulation-for-p2-now-with-xbyte
https://github.com/totalspectrum/zog/blob/master/zog_p2.spin
https://github.com/zylin/zpu/blob/master/zpu/docs/zpu_arch.html
Uses execf but not XBYTE:
https://forums.parallax.com/discussion/173342/6502-cpu-emulator/p1
https://forums.parallax.com/discussion/174344/p2-native-avr-cpu-emulation-with-external-memory-xbyte-etc/p1
Chips XBYTE machine:
https://forums.parallax.com/discussion/172813/p2-space-invaders-8080-emulation/p1
There must be some first discussion of XBYTE somewhere?
Other interesting links?
Do you need XBYTE? Catalina already has everything you need including the plugins you listed, apart from the Forth stuff. If you really do want Forth, there are Forth compilers written in ANSI C - e.g. https://gist.github.com/lbruder/10007431
Seriously, though - I have been giving some thought to adding an XBYTE back end for LCC (this would be similar to Catalina's COMPACT mode, except that COMPACT mode would still be useful because it also works on the P1). Fairly easy to do, and you then get all the other Catalina stuff thrown in for nothing.
If there would be any interest in such a thing, let me know. I've pretty much run out of things to add to Catalina, and Winter is Coming!
Thanks, Ross, for your comment. Though there seems to be some misunderstanding. I already have a Forth, written in C. https://forums.parallax.com/discussion/176167/p2ccforth-another-forth-written-in-c#latest And lbruder's version was one of it's sources of inspirations. A downside of P2CCForth is it's low speed in comparison to Taqoz and it's low code density.
So a Virtual Forth Machine written in PASM and using XBYTE might combine some good features?
Christof
you could try these AUGS instructions to use any stackpointer for your data stack. compiler needs to add AUGs before opcodes and in some cases add DUP/DROP before/after ... should be the fastest while still compact. use normal returnstack for 2nd stack. can play with _RET to produce tight code
No worries. I don't know or use Forth, but you triggered my interest by mentioning XBYTE. I looked at XBYTE a while ago and ruled it out for Catalina, but now I can't recall exactly why. I'll have another look.
Ross.
I've used XBYTE a lot.
Pros:
GETPTR
when PC required.Cons:
Thanks. How do you use XBYTE for 16 bit bytecodes? Also, could you mix 8 and 16 bit bytecodes?
Ross.
>
You don't. You just handle the 2nd byte separately.
Another important limitation of XBYTE vs a regular interpreter loop is that you do not get to run any code in-between bytecode impls (unless you manually insert a lot of calls). This has made it not very useful for emulating real CPUs, but it shouldn't matter for a purpose-built bytecode interpreter.
Thanks for your inputs!
The project has taken a direction, which is a little bit shifted.
It turns out, that it is indeed possible to mix a (small) XBYTE machine with normal compiled code. My assumption had been, that this would not be possible. The machine including the code table is permanently loaded into LUT from $200 to $300 and when it encounters a TRAP code, it returns to the calling code, which executes the TRAP code and will then restart the XBYTE machine to go on. Of course this switching of modes is not very attractive for speed, so the more instructions you squeeze into the XBYTE machine, the better.
Both stacks in HUB, which makes task switching simple and fast.
The XBYTE machine and a routine to load it is in a SPIN2 file, the calling code shall be in C and I hope, that I can recycle a lot of code from P2CCForth. At the moment though the machine holds only TOS (top of stack) in a register, while in P2CC NOS (next to TOS) is also a register. Taqoz holds the top 4 in register.
Looking forward to get a Fibonacci(46) running for speed comparison....
Ok, I could do that. Do you just use RFBYTE to consume the 2nd byte and so the next XBYTE will process the byte after that?
Yes, this is more of an issue. But Catalina's COMPACT mode routinely has instances of quite a few 16 bit instructions between each 32 bit instruction (typically a call or jump instruction). So there may be scope for a decent speed improvement here. However, what happens to the FIFO if you need to execute a random hub read or write to implement the bytecode? Do you have to explicitly restart the XBYTE processing again afterwards? **
** It is probably becoming clear here that I have little or no idea what I'm talking about!
Yes.
FIFO and regular RDLONG/WRLONG-type operations don't logically interfere with each other (except when they do!!!)
I've written XBYTE interpreters for the 8086 and the Z80 and they work very well. In fact, using XBYTE results in the fastest possible emulation. I'm not quite sure what limitation you're referring to.
As the name suggest, XBYTE is designed for bytes but can handle 16-bit or wider codes provided the most important byte for decoding is the low byte and the byte order is little-endian. Current version of XBYTE is not really suitable for emulating the 68000, but a future version could be if it had an option for big-endian 16-bit or even 32-bit codes.
No. The FIFO and random hub reads or writes don't conflict. When the FIFO is reloaded by RDFAST, the FIFO continues to fill after the RDFAST instruction ends and a hub read/write could be stalled if it executes soon after RDFAST (within ~10 cycles from memory, possibly less).
P.S. Good to have the forum up and running again.
Then you haven't implemented interrupt handling
It needs to check for an interrupt condition between each instruction.
I have implemented interrupt handling and it works!
Overhead is only two cycles for the check at the end of an emulated instruction if no actual interrupt:
_ret_ tjf INT_enable,INT_addr 'start new XBYTE if no hw int
INT_enable
must beFFFF_FFFF
to jump to interrupt routine at[INT addr]
. Emulation code forDI/EI
orCLI/STI
clears/sets low word ofINT_enable
and a P2 interrupt sets the high word for maskable interrupts or whole long for NMIs._ret_
starts new XBYTE ifINT_enable
<FFFF_FFFF
.I spent some time re-examining this, to refresh my memory. The limitation is not implicit to XBYTE, it is implicit in the reduced instruction set I use in Catalina's COMPACT mode. In my defence, this mode was designed in the days of the P1, before XBYTE was even a twinkle in Chip's eye.
When I first saw XBYTE (and specifically the RFVAR instruction) I thought - "Great! I can use that" - but it turns out to be not nearly so easy. In brief, encoding the instructions I use in my COMPACT mode would require 7 bits if I used RFVAR XBYTE @, which does not leave enough bits in a 2 byte instruction (7 bits) or in a 4 byte instruction (22 bits) to encode all the other information I need (my own encoding scheme leaves 10 bits in a 2 byte instruction and 24 bits in a 4 byte instruction). The result is that if I use RFVAR XBYTE @ pretty nearly all my 2 byte instructions have to become 3 bytes, and there are some 4 byte instructions that I would have to drop altogether and instead use two instructions to implement (so effectively making them take 6, 7 or 8 bytes instead of 4). And to use less than 7 bits to encode the instruction, I would have to drop about 20 instructions (or about 25% of the instruction set) which would also make the code larger and slower.
So to use XBYTE I would have to redesign and rewrite almost everything from the ground up. Even then, while it might execute faster (or it might not - I'd have to do some tests to find out @@) - it would not encode as compactly as my current COMPACT mode, which makes the whole exercise a bit pointless.
EDIT: Maybe take this to the Catalina thread for any further discussion - it is getting off-topic for this thread!
@ should have said RFVAR here, not XBYTE - now fixed.
@@ a quick test show that just using the FIFO (i.e. RDFAST and RFLONG) instead of RDLONG makes the code execution slightly faster, but not enough to offset the less efficient encoding (that XBYTE would need) over the current COMPACT mode
EDIT2: Using RFLONG is slightly faster than using RDLONG, not slightly slower as I originally posted.
That's pretty clever! I've avoided using the native P2 interrupts. Not 100% sure why.
Yeah, I found that polling the "interrupt" status between each instruction ideally wanted an custom emulator loop (non-XBYTE) for my own Z80 emulator.
That has me wondering though, what does a P2 interrupt do to an XBYTE processing loop? I guess it could still be used to do something useful, although it could interrupt part way through the emulated instruction which is not particularly nice, unless the code can somehow can go off to process the interrupt and can always restart the interrupted instruction somehow without a problem. That is, that it supports instruction level re-entrancy which may not always be possible depending on how things being emulated and depends on which operations are irreversible if an interrupt occurs in some critical section.
If it was possible, maybe through disabling P2 interrupt at key times, then the other COGs could simply ATN the XBYTE emulator COG to trigger a P2 interrupt and try to process async interrupt/events that way.
EDIT: just read the earlier posts that happened while I was composing this. Got ninja'd a little bit there.
I think, that these difficulties about interrupts (executing hub code) together with any streamer action is the reason, that FlexProp does not support them. There is also probably a bug in Taqoz, when it is using the streamer for FILL, which could eventually be interrupted.
About my Forth XBYTE machine, which is now able to execute hand-compiled code: Enough for the Fibonacci(46). Funnily it is almost exactly as fast as Taqoz, just a tiny bit faster, with the TOS in register, but the stacks in HUB. This is about 4 times faster, than P2CCForth! Without these hub accesses there should be some nice speedup possible against Taqoz. (But not for my actual mixed machine, because there is no room for the stacks in LUT. And how would you access a stack in LUT in C?)
ROT is awful, 4 hub accesses:
What I really don't like about XBYTE, is this skip encoding in the table, which completely alters the behaviour of the code at some other place, totally obscuring things at both places. (Huh, and there are people, who say that Forth has a bad readability....)
Using the streamer, maybe - but AFAIK there is no problems just using interrupts with hub execution. This is how Catalina implements multi-threading on the P2.
The limitations you refer to seem to be due to RFVAR. Perhaps you could try XBYTE + RFBYTE / RFWORD / RFLONG instead? TBH, I don't find RFVAR useful enough to actually use.
One, two or all three of INT1/2/3 can be used to set the high word of
INT_enable
. If there are multiple interrupt sources, the single interrupt routine at[INT_addr]
must be able to differentiate them. In my 8086 emulator, I simulate a simplified 8259 by using a third interrupt-related register calledINT_NMI_IRQ
which has bits for IRQ7-IRQ0 and NMI. A P2 timer interrupt, e.g. for 18.2Hz system clock, could look something like this:The interrupt routine at [INT_addr] resets bits in
INT_NMI_IRQ
and high word ofINT_enable
(the latter only if not single-stepping, which I also support). My aim when emulating interrupts was to execute the interrupt routine only when really needed (e.g. maskable interrupt requested and interrupts enabled) and to use as few cycles as possible for the interrupt check when there is no interrupt.Yes, this is true. But I have tried just using RFLONG, and while this is slightly faster (not slightly slower, as I originally posted above) it is not enough faster to justify also changing over to XBYTE, which would require less efficient instruction encoding.
If speed and not space were the only criteria it might be worth it, but COMPACT mode is primarily intended to save space, and an XBYTE program would not save as much. And it would not be as fast as NATIVE (Hub Execution) mode. So it would be a lot of work, but not add much benefit.
Ross.
I think a well designed bytecode using XBYTE could be both smaller and faster than the current COMPACT mode. For example, the Dhrystone Proc_1 function is 146 bytes in COMPACT mode. The same function compiled with flexspin to bytecode is 92 bytes, and compiled with gcc to ZOG ZPU bytecode is 136 bytes. The 3 underlying compilers are quite different, of course, so some of this may be due to other factors, but bytecode can certainly be compact.
I did notice that the COMPACT mode has about 12 bytes of padding due to "alignl" inserted in various places. For P2 that shouldn't be necessary, RDLONG doesn't have to be aligned on P2.
Yes, the padding is not necessary on the P2. I added it to replicate how COMPACT code is encoded on the P1, where the alignment of longs is done automatically. Partly because I did not want the P2 to be different to the P1. It is currently identical, which makes development and maintenance much easier. But I may revisit this.
As for whether XBYTE can be smaller than COMPACT, yes it can in specific cases - but (probably!) not in the general case when it has to implement all the required Propeller functionality. Even now, Catalina has to temporarily exit COMPACT mode to implement some of the more arcane propeller functionality (particularly some of the new functionality offered by the P2) and doing so means those sections of code are even larger than the equivalent NATIVE code would be.
Finally, COMPACT mode was originally designed to support memory sizes up to 16Mb even on the P1 - which never got implemented because the P2 arrived and pretty much ended the search for expanded memory solutions for the P1. The P2's 512Kb seemed like it would be enough for anyone ... but just like Bill Gates was about DOS and 640kb, we were wrong!
Ross.
I will once again take the blame for bringing the 96MB memory expander upon this realm.
May you rot in heck!
No, bytecode is still smaller in the general case, at least if the bytecode instructions are well designed. Chip's Spin2 bytecode interpreter is an example -- it has all of the P2 functionality in there, and the Spin2 compiler produes pretty small binaries. flexspin's nucode mode does too. nucode is a bit of an odd duck because the bytecode interpreter is customized to the application (the compiler generates a unique interpreter for each build, that compresses the most common instructions in that compile). But even a traditional bytecode machine like the Spin2 one can compress the code quite well.
Don't get me wrong -- Catalina has many virtues, and I'm not knocking it in general. But code density, alas, is not among those virtues, and I'm sure you could improve upon it. Heck, even getting rid of the alignment padding in the current COMPACT mode would give something like a 10% density improvement. This is trivial for P2, and could even be done on P1 by synthesizing a RDLONG with two RDWORDs. But writing an XBYTE interpreter is a fun challenge.
The discussion is not about byte encoding in general, it is about the encoding required to use XBYTE. I don't know anything about nucode, but it sounds impressive. Does it use XBYTE?
As for Spin2 (which I've never used either, but I assume it DOES use XBYTE), AFAIK it only supports 512kb, so addresses only need to be 19 bits. If I chopped 5 bits off all Catalina's COMPACT addresses then it could be smaller too.
Yes, the padding can be removed on the P2 at very little cost. Now that COMPACT mode is so stable (it hasn't changed at all for years now) it is no longer important to keep the P1 and P2 versions identical, so I will look at removing the padding on the P2. But not on the P1 - having to use two RDWORDs in place of one RDLONG (because the P1 cannot read a long that is not aligned on a long boundary) would more than double the program execution times, so it is definitely not worth it.
I agree XBYTE would be fun, but it is probably not worth implementing an XBYTE back-end for LCC. However, I'm now looking at using it instead to build a bytecode interpreter for Lua. If I can do that it would make Lua absolutely fly on the P2, and make all other P2 languages look positively neanderthal!
Ross.
I make it 146 bytes with 16 bytes of "alignl" padding, so it will end up as 130 bytes. So better than gcc but not as good as flexspin