I'd like to see something that uses external parallel RAM. And yes, if I go with a 16-bitter, that would be expensive in terms of GPIO lines. 2M parallel RAM (1M x 16) would involve 41 lines. That would use 20 address lines, 16 data lines, and 5 control lines. Once you get past a byte in width, RAM adds a control line per byte.
So if one does that, they would certainly need to multiplex lines.
As for multiple cores, it might be an option to use different types of cores for this. Like what if you have your own custom core, a Z80, a 6502, etc?
Another idea comes to mind. Why not the ability to take a word and assign each bit an opcode, and maybe have a list of lists? Setup could be hard, but once you set it up, you could read a number of instructions in an entire memory cycle. Maybe leave 1 bit as a supervisor placeholder to go back to individual opcodes when done. So things are done in the order of the bits and only the things set to 1 are run. So one would have to only use that mode when you have instructions you use in a certain order a lot.
I meant more like assign opcodes to individual bits as an operating mode. I'd probably leave 1 bit as an exit pattern bit. So it goes down the list and for each bit that is enabled, it runs the instruction in that slot. So it stays in that mode and you can use that same 1-bit opcode template with a different mask and do different of those instructions. And that just might work well with serial RAM. So this could probably process between bits. So no need for spinlocks when you can structure the useful code to where the RAM has enough time.
So you are assigning an opcode to each bit in advance and every position means a certain operation. So you are choosing to run or not run each predefined opcode. And there could be multiple ways to map the bits to different opcodes. You could do a "butterfly" where you have various opcodes in one order, then in the reverse order, and maybe have 1 that is used only once, and one for pattern breaking. Like 0-6 being in one order, 7 being a single opcode, 8-14 being 0-6 in reverse, and 15 being whether to exit this mode or not. Or do it where half is opcodes and half is the address of a different list, thus staying in that mode and choosing a new opcode set.
But yes, would the overhead be worth it? And if you don't utilize it well, it would functionally be no better than a bunch of NOPs/Skips. I am trying to think of a "compression" scheme for dealing with low bandwidth memory, and one that doesn't require a large dictionary or large packet sizes.
@PurpleGirl said:
I meant more like assign opcodes to individual bits as an operating mode. I'd probably leave 1 bit as an exit pattern bit. So it goes down the list and for each bit that is enabled, it runs the instruction in that slot. So it stays in that mode and you can use that same 1-bit opcode template with a different mask and do different of those instructions. And that just might work well with serial RAM. So this could probably process between bits. So no need for spinlocks when you can structure the useful code to where the RAM has enough time.
So you are assigning an opcode to each bit in advance and every position means a certain operation. So you are choosing to run or not run each predefined opcode. And there could be multiple ways to map the bits to different opcodes. You could do a "butterfly" where you have various opcodes in one order, then in the reverse order, and maybe have 1 that is used only once, and one for pattern breaking. Like 0-6 being in one order, 7 being a single opcode, 8-14 being 0-6 in reverse, and 15 being whether to exit this mode or not. Or do it where half is opcodes and half is the address of a different list, thus staying in that mode and choosing a new opcode set.
But yes, would the overhead be worth it? And if you don't utilize it well, it would functionally better than a bunch of NOPs/Skips. I am trying to think of a "compression" scheme for dealing with low bandwidth memory, and one that doesn't require a large dictionary or large packet sizes.
You could make such 'set-of-opcodes' compression/mapping compile-time dynamic, rather than fixed in an opcode ?
You would need a small number of tables that the compiler would fill with useful opcode groups, like a cache lock. Maybe users also need a means to lock time critical tables, for stability.
That would map well onto a P2 resource, and the P2 SKIPF could run that table.
@jmg said:
You could make such 'set-of-opcodes' compression/mapping compile-time dynamic, rather than fixed in an opcode ?
You would need a small number of tables that the compiler would fill with useful opcode groups, like a cache lock. Maybe users also need a means to lock time critical tables, for stability.
That would map well onto a P2 resource, and the P2 SKIPF could run that table.
Yeah, I'm still trying to hash it out to see if it would be feasible. The downside of making each bit mean something in a unary fashion is that the 0's would function like NOPs rather than not be used. I don't know how one could do that with jumplists. If that can't be done, then one is stuck with polling.
And for getting such a list in, I don't know if it should be a series of immediates, or a memory location in the form of a table.
And then, what instructions would be eligible? Those with operands likely wouldn't be, unless you use only 1 (or intend to use it multiple times), like 8 bits for the opcode group mask, and 8 for an operand. And then loops, well, maybe only if it jumps to the start of the list of instructions and uses the same mask. Similar with conditionals, with only branching to the beginning of the list mask or aborting.
And then, what instructions would be eligible? Those with operands likely wouldn't be, unless you use only 1 (or intend to use it multiple times), like 8 bits for the opcode group mask, and 8 for an operand. And then loops, well, maybe only if it jumps to the start of the list of instructions and uses the same mask. Similar with conditionals, with only branching to the beginning of the list mask or aborting.
Hi,
(I have to admit, that I am a rather lost, what the goal is at your project. As far as I understand, you want to create something new, but I do not know the direction or purpose.)
Still to find out which operations are really useful, perhaps it might be a good idea to find out which operations are used how often. This will give a sorted list of words/instructions in file:
You have to sort out the words, which come from comments....
It seems, that ret, mov, jmp and CALL are used very often in assembler.
In Taqoz Forth: DUP, =, DROP, EXIT, C@, SWAP
Of course this depends on the task and the writer.
From this it seems not to be very useful to have 32 instructions coded into a long so that these can only be done in a certain order. What I have seen, is 5bit codes packed into 32bit longs in a Forth stack processor, which has implicit operand only. So you have the 32 most used operations in any order. As said before, Peter optimised his Taqoz coding over years and found out, that to use always 16bit codes is best, which corresponds to the ARM thumb instruction set, which was invented after the 32bit instruction set as an optimisation.
@Wuerfel_21 said:
For some average flexspin compiled code:
MOV
ADD
SUB
RDLONG
JMP
CALL
WRLONG
CMP
RDBYTE
WRBYTE
It seems there are way too many hub access instructions there. I encountered the same in a compiled ARM (RPi) code : LDR, STR, LDR, STR... where a proper assembly code gave me 20x speed gain over the compiler by simply placing the variables in registers where possible... LDR is 3 ns on a Pi3, where the CPU can do several instructions per ~0.7 ns clock. The same here, ~13 clock per RD, 2 clock per normal instruction.
@pik33 said:
It seems there are way too many hub access instructions there. I encountered the same in a compiled ARM (RPi) code : LDR, STR, LDR, STR... where a proper assembly code gave me 20x speed gain over the compiler by simply placing the variables in registers where possible... LDR is 3 ns on a Pi3, where the CPU can do several instructions per ~0.7 ns clock. The same here, ~13 clock per RD, 2 clock per normal instruction.
Yea. When a function takes the address of a local variable, compiler goes dumdum and forces all locals onto the stack, emitting loads of RD/WR. This is necessary for Spin, but it's pretty bad for C.
Hi,
(I have to admit, that I am a rather lost, what the goal is at your project. As far as I understand, you want to create something new, but I do not know the direction or purpose.)
Still to find out which operations are really useful, perhaps it might be a good idea to find out which operations are used how often. This will give a sorted list of words/instructions in file:
You have to sort out the words, which come from comments....
It seems, that ret, mov, jmp and CALL are used very often in assembler.
In Taqoz Forth: DUP, =, DROP, EXIT, C@, SWAP
Of course this depends on the task and the writer.
From this it seems not to be very useful to have 32 instructions coded into a long so that these can only be done in a certain order. What I have seen, is 5bit codes packed into 32bit longs in a Forth stack processor, which has implicit operand only. So you have the 32 most used operations in any order. As said before, Peter optimised his Taqoz coding over years and found out, that to use always 16bit codes is best, which corresponds to the ARM thumb instruction set, which was invented after the 32bit instruction set as an optimisation.
I am not driven by goals per se. I tend to need to do things on a whim and not be pushed in any direction. Right now, discussion (as a part of learning and preparation) is the focus. Then I can sort through things and go from there when I am ready. I tend to like to start with the pure and move to the applied.
All that said, I just want a new ISA with different ideas and an overall machine that is retro-like. And because a P2 is the intended target, it would be nice to have a set that works with the underlying machine. I'd like to go mostly 16-bit, though longer and shorter could be left as options. I still want to try to do what I said about parallel SRAM, though I know that will be a monster to work out. For the ISA I am after, I want to pretty much enforce alignment and go with word addressing. And that can result in some challenges when working with ASCII, so some byte instructions would be needed.
I would not create a 32-bit ISA. The idea I proposed was having a special mode so that each bit of the word would represent an opcode, and it is either run or not.
At this point, my earlier ideas might be better, such as paired opcodes and things like that. I am curious what the longest but most common usages of instructions are.
It is neat, for instance, that the Gigatron has a few specialized instructions to improve bit-banging efficiency. The X register has an auto-increment mode, and you can modify things on read. So you can do an AND/OR of memory specify the Out port as the destination, with auto-increment. The AND and OR are used to toggle the sync bits, leaving the lower 6 bits untouched, sending them to the monitor, and incrementing the X register, all in a single cycle. I don't think the 6502 had auto-increment. That does come in handy with arrays accessed in loops and things like that. That's better than indexed memory modes in such a use case.
Speaking of coding in higher languages, yeah, you certainly want to learn some assembly and learn to read list files. I remember a mystery I had in Quick Basic code I wrote (x86 real mode). When I had compound IF statements, while they looked smaller, they were about 16 bytes longer than nested IF statements. After studying the difference, I avoided using logical terms in IF statements and used only nested ones where possible. What it was doing was evaluating the cases as arithmetic. It would evaluate both parts, assign numbers, then evaluate the numbers and branch based on that. So after that, I nested the statements to where the most likely to fail or the simplest to execute would be tested first. So give precedence to integers before strings, before floats, before doubles, and if things are roughly equal, give precedence to things most likely to fail. After all, a half-truth is still a lie.
@PurpleGirl said:
So you are assigning an opcode to each bit in advance and every position means a certain operation. So you are choosing to run or not run each predefined opcode. And there could be multiple ways to map the bits to different opcodes. You could do a "butterfly" where you have various opcodes in one order, then in the reverse order, and maybe have 1 that is used only once, and one for pattern breaking. Like 0-6 being in one order, 7 being a single opcode, 8-14 being 0-6 in reverse, and 15 being whether to exit this mode or not. Or do it where half is opcodes and half is the address of a different list, thus staying in that mode and choosing a new opcode set.
Interesting idea.
I think there is a matter of practicality: Any program large enough to need external program memory would be compiled from a high level language. Personally, I am far to lazy to add a new instruction set to a compiler.
A 16 bit PSRAM interface at sysclock/2 can nearly feed 1 cog. 16 PSRAM clock cycles for random rdlong, so 32 propeller clock cycles. Maybe 40 cycles for PSRAM jump compared to 13-20 cycles for hubex jump operation.
Much of the work for designing modern processors goes into the cache. I've put some thought into a low latency PSRAM controller. It would take about 100 clocks to do a random read with the PSRAM operating at sysclock/4. I'm trying not to use the streamer because that only writes to hub ram. The emulator would communicate to the PSRAM/cache controller via smart pin long repositories or shared LUT. That would save at least 7 cycles waiting for the hub. The psram controller would run in parallel with the emulator, trying to have the next instruction available for immediate use. Memory writes should not stall the emulator, just hand off the write request if the write path is not busy already.
In my Linux experiment, I had great success with a 16 long instruction cache. Most instructions were read from the cache.
I've had my head in riscv emulators lately, most from @ersmith . A lot of cycles are used because the riscv instruction set has 32 registers. That means 5 bits to specify the register. If we simplify it to 4 bits, then we can use getnib instead. Also some immediate values have a funny bit ordering as well. 16 instructions is enough, right? Basically we would corrupt the riscv compressed instruction set to make the P2 emulator/JIT compiler run as quickly as possible.
EDIT: Looks like the ARM instruction set meets most of these considerations. But not Thumb2. Darn 3 bit register addressing.
I've had my head in riscv emulators lately, most from @ersmith . A lot of cycles are used because the riscv instruction set has 32 registers. That means 5 bits to specify the register. If we simplify it to 4 bits, then we can use getnib instead. Also some immediate values have a funny bit ordering as well. 16 instructions is enough, right? Basically we would corrupt the riscv compressed instruction set to make the P2 emulator/JIT compiler run as quickly as possible.
EDIT: Looks like the ARM instruction set meets most of these considerations. But not Thumb2. Darn 3 bit register addressing.
I was thinking, if the 18 bits of register addresses in P2 code could be compacted or better expanded somehow in a fast way. Have 2 bits, which say this is a 16bit code of type A, B, C (or 32bit original P2 code). Depending from those fill up the 32bit P2 opcode with a fixed accumulator and another address.
I don't know if the P2 instructions are mapped to codes in a way that you can easily use a subset of most of the most used instructions to get down from 14bits to ?8bits?.
One other, independent, idea is, if two or more cogs could be combined somehow to build a more powerful core. They could read the same code in parallel. Perhaps one core could do stack- handling for example.
I was thinking, if the 18 bits of register addresses in P2 code could be compacted or better expanded somehow in a fast way. Have 2 bits, which say this is a 16bit code of type A, B, C (or 32bit original P2 code). Depending from those fill up the 32bit P2 opcode with a fixed accumulator and another address.
I don't know if the P2 instructions are mapped to codes in a way that you can easily use a subset of most of the most used instructions to get down from 14bits to ?8bits?.
One other, independent, idea is, if two or more cogs could be combined somehow to build a more powerful core. They could read the same code in parallel. Perhaps one core could do stack- handling for example.
I've been pondering that sort of thing. I've wondered if, for instance, if one cog could be dedicated to interrupt handling for a retro-like CPU. And I guess one could add a blocking mode if that is needed to emulate older behavior or in case there could be software races. There would be no "hardware" races for sure.
And on the x86 emulator project, it appears only one cog is used for the 8086. I don't see why not maybe, it could be split across 2 cogs for better throughput on the slower instructions. I mean, what would happen if it was split out closer to how Intel split it out? That way, have the BIU in one cog and the EU in the other, and preferably in neighboring cogs so that the LUTs can be shared.
I've also thought about width. For instance, if you read the same opcodes as you said, one could implement a 64-bit core. However, an inherent issue would be that it won't be 100% symmetric due to how the middle carry would be implemented. So I don't know if that would be a feasible use for 2 cogs since while both cogs could do math on part of the result, the lower one would need to send a carry flag, and the upper one would need to add it. So that is a bit of a bottleneck.
Wide multiplication could be more of a bottleneck even when dealing with 2 cogs since you would need to do 4x the multiplication when you double the bit sizes, and then a minimum of 2 additions (in the gross sense as one may need to be broken down further). When emulating wider, the strategy I think of is much like the algebraic FOIL (First, Outer, Inner, Last) Method. With the highest and lowest order multiplications of the 4, you can just work and place those into the result register into opposite halves (when using real hardware, this is quite an optimization since you can do that simultaneously, and you don't have to add those since they don't overlap anyway). Then you work the 2 "middle" multiplications. They are of the same order of magnitude, so you just add their results together into a temporary register. Then you add that to the upper 3/4 of the result, treating it as an accumulator. So the largest addition for 64/64/128 multiplication on 32-bit hardware is only 96 bits. The lowest half of the lowest-order partial multiplication remains from the beginning to the end unchanged.
The above is how you do it if your hardware has some multiplication, just not wide enough for your application. So you do 5-6 times the work that you'd have to do if you had dedicated hardware for the size you need. And I included a couple of strategy optimizations in my description. One was working the end numbers as I described as 2 pieces of hardware can do that simultaneously, since you don't have to add those to each other, merely place them. The 2 middle partials would be added 1/4 the way up with the lowest partial being untouched, thus needing a smaller adder (or adder chain).
Otherwise, one would have to test all the 1's in the 2nd number and shift and add the top number that many times. So for an 8-8-16 multiplication, you'd have up to 7 different shift operations, and 7 different additions. And there are likely other strategies, and the idea is to find a strategy that could be split amongst multiple cogs (or even multiple P2s, though I'd avoid that if possible due to potential timing issues you wouldn't have in a single P2).
@PurpleGirl said:
And on the x86 emulator project, it appears only one cog is used for the 8086. I don't see why not maybe, it could be split across 2 cogs for better throughput on the slower instructions. I mean, what would happen if it was split out closer to how Intel split it out? That way, have the BIU in one cog and the EU in the other, and preferably in neighboring cogs so that the LUTs can be shared.
Two cogs aren't needed to do an 8086 and would be slower than one cog using XBYTE as the FIFO emulates the BIU instruction fetching and queue very well.
@PurpleGirl said:
And on the x86 emulator project, it appears only one cog is used for the 8086. I don't see why not maybe, it could be split across 2 cogs for better throughput on the slower instructions. I mean, what would happen if it was split out closer to how Intel split it out? That way, have the BIU in one cog and the EU in the other, and preferably in neighboring cogs so that the LUTs can be shared.
Two cogs aren't needed to do an 8086 and would be slower than one cog using XBYTE as the FIFO emulates the BIU instruction fetching and queue very well.
But any other ways to get it to perform better? I just threw that out there and didn't know if it could help or not if one could use 2 cogs and pipeline things a bit and help it that way. So have one working with the hub and the other doing the EU. So it won't. Okay.
@PurpleGirl said:
And on the x86 emulator project, it appears only one cog is used for the 8086. I don't see why not maybe, it could be split across 2 cogs for better throughput on the slower instructions. I mean, what would happen if it was split out closer to how Intel split it out? That way, have the BIU in one cog and the EU in the other, and preferably in neighboring cogs so that the LUTs can be shared.
Two cogs aren't needed to do an 8086 and would be slower than one cog using XBYTE as the FIFO emulates the BIU instruction fetching and queue very well.
But any other ways to get it to perform better? I just threw that out there and didn't know if it could help or not if one could use 2 cogs and pipeline things a bit and help it that way. So have one working with the hub and the other doing the EU. So it won't. Okay.
I think BIU emulation wouldn't benefit from being in a separate cog. The slowest instructions are ADD/ADC/SUB/SBB with register source and destination. ADD/SUB takes only two cycles and ADC/SBB two more but updating all the flags needs an extra 15 instructions or 30 cycles! A second ALU cog could speed things up in theory at least. 8086 + 8087 emulation would be faster with two cogs than one, although a single cog should cope if all the 8087 code is in hub RAM (where it would have to be).
I think BIU emulation wouldn't benefit from being in a separate cog. The slowest instructions are ADD/ADC/SUB/SBB with register source and destination. ADD/SUB takes only two cycles and ADC/SBB two more but updating all the flags needs an extra 15 instructions or 30 cycles! A second ALU cog could speed things up in theory at least. 8086 + 8087 emulation would be faster with two cogs than one, although a single cog should cope if all the 8087 code is in hub RAM (where it would have to be).
What about BIU and decoding being in a separate cog and maybe using the LUT to pipeline it? It seems the goal in this case is to keep the hub memory utilized as much as possible.
And what about a flexible PFQ? I mean, sure, keep the 4-6 bytes as the default limit and let branches empty it as usual, etc., but for longer instructions, let it fill even longer. So 4-6 bytes are the limit at which it can stall things, but let it get larger during long instructions.
And how can the flags be updated in fewer cycles? And it seems like that should be simple, just apply an appropriate AND/OR mask that updates them all at once in a single instruction.
As for 8087, I saw someone mention the hub's CORDIC instructions. I don't think that is efficient for most 8087 instructions. I think most can be done using cog instructions in fewer than 58 cycles. Although, TBH, not many used an 8087 and there isn't much software for it.
I sorta debated the ReactOS team over their Fast486 library. That way, since it is emulated, if they ever get their x64 build of ReactOS working, you'd be able to run 16-bit code. That is one disadvantage of an x64 OS, that you can't run older PC games since x64 can run only 32-bit and 64-bit code, unlike the 32-bit Intel CPUs that could also run 16-bit legacy code. Anyway, I suggested that they put the x87 emulator in another thread to take advantage of multi-core and allow some concurrency. In that case, they would need to fully emulate the FWAIT instruction for software race prevention. I mean, if you code for x86 and x87 in assembly, and you know how long the slowest FPU for the given CPU took, you could structure your code to allow concurrency without using FWAIT or having software races. FWAIT acts as a spinlock so that the CPU cannot continue until the FPU is finished. And I mentioned assembly because in higher languages, the libraries would likely always use FWAIT since they are designed for a general audience with different configurations, and their goal would be reliability more than performance. But if you are coding it at a low level, you should know what you can get by with and literally let the FPU be a coprocessor. So having an emulator that can allow concurrency would allow for more performance in cases where the code is structured to where the CPU will be very busy with unrelated stuff while waiting for the result. And interestingly, in most x87 libraries included in software, FWAIT is only decoded with an immediate return or far jump. That makes sense in that if you are in a single-threaded environment and have no x87, there is no way races can occur, since things are forced to be sequential. Since the same core is doing the FPU stuff, there is no way that things can get out of sequence, even with no FWAITs used. FWAIT is only useful if you have an FPU in case you need to immediately use the result.
What about BIU and decoding being in a separate cog and maybe using the LUT to pipeline it? It seems the goal in this case is to keep the hub memory utilized as much as possible.
Hi, have you already stumbled over: https://forums.parallax.com/discussion/170040/interpreter-toolkit-for-p2
@ersmith has explored optimum ways to do interpreting on P2. It is
1. compiling into a string of P2 instructions
2. store them into cog ram
3. optimize them
4. execute them
Benefit comes from the probability that there is a loop fully cached in cog ram. The string of instructions must be long enough. The compiling/interpreting is done only once, no hub access, the pipeline is not emptied by too many jumps.
plain interpreter: 336_000_321 cycles
xbyte interpreter: 160_000_129 cycles
JIT w. HUB cache: 168_001_865 cycles
JIT w. LUT cache: 120_002_177 cycles
optimized JIT w. HUB: 64_002_363 cycles
optimized JIT w. LUT: 48_002_425 cycles
The gain between the plain interpreter and the JIT HUB cache seems to correspond to the findings I had these days, when I looked at a Forth compiler that produces P2-instructions instead of Forth wordcodes. I am still astonished, that it is so much better to handle much more bytes of code this way.
As far as I understand a interpreter for a virtual machine of a language compiler can be much faster than an emulator of a processor, because it does not need to calculate all these flags, which are so very likely not to be used.
I wonder, if a JIT compiler emulating a CPU could somehow find out, if a flag is needed. Perhaps looking forward a short number of instructions, if the flag is overwritten unused.
This is from the MC6809 Coco3 emulator, ZTEST(), NTEST() are macros.
Yes, the additional load calculating the flags is significant. (Although, in this case MemRead8 is non-trivial, which has to do the work of the MMU and check if RAM, or ROM or IO is meant.)
The above makes sense. If one is emulating, then analyzing and converting (recompiling) the code in advance of executing it would speed things up. And you wouldn't need to test for all flags for underlying instructions, just those that are used. So no need to update 6 flags if you only need 1. And if the wrong opcodes were used initially (like ADC for ADD) then such analysis would catch that and not manipulate the flags at all.
And I think multiple interpreters could be used based on CPU modes. I mean, for instance, interrupt emulation and handling. So if you turn off the interrupts, you could return from that interpreted instruction by jumping to the interpreter handler loop that doesn't poll for them at all. And when they are turned back on, the handler from that would return to the interpreter handler loop that does poll for them.
I wonder how feasible that self-modifying code is on the P2. That can help save space and in some cases, time.
I wonder how feasible that self-modifying code is on the P2.
Because of longer pipeline, we have to change the instruction at least 3 NOPs before it is executed instead of 1 in P1
The soluion can be using ALT instructions that change the opcode and arguments already inside the pipeline.
Somewhat OT, but still within the target of a new architecture and emulating it. Let's suppose one wants to run 16-bit SRAM and have 20 address lines (ie., 2Mb SRAM). Now, what is the correct strategy to access that? In the example I gave, that takes 41 pins (20 address, 16 data, and 5 control lines). In what order do the lines need to be accessed? And do any of the lines need to be strobed?
Whether signals are edge, level, or strobe sensitive generally comes from the datasheet of the selected memory. Level sensitivity for Chip Enable, Address, and Data Lines are common.
First thing to consider is whether this is going to be a pure memory interface to only one chip.
If so, and power reduction isn’t a requirement, you might be able to tie the Chip Enable pins and save a control pin.
Secondly, some SRAMs allow Write Enable to override Output Enable, while some may not. If you select one that does then you can tie Output Enable and that can save another control pin.
Thirdly, the individual byte control lines only require separate pins if you plan to allow byte level control of the interface. If you are always reading and writing 16 bits they can be tied to the enabled state, saving another 2 control pins.
So, a dedicated memory interface to a single permanently enabled 16 bit SRAM that allows WE to override OE, using 16 bit access only, could consume only 20 + 16 + 1 pins (WE)
As far as order of pin control, again the datasheet is the best source of information, but generally it is:
Chip Enable
Address
Control and Data if writing
@"Christof Eb." said:
I wonder, if a JIT compiler emulating a CPU could somehow find out, if a flag is needed. Perhaps looking forward a short number of instructions, if the flag is overwritten unused.
This is from the MC6809 Coco3 emulator, ZTEST(), NTEST() are macros.
Yes, the additional load calculating the flags is significant. (Although, in this case MemRead8 is non-trivial, which has to do the work of the MMU and check if RAM, or ROM or IO is meant.)
I think it just needs to keep track of the last register written. Then if there is an instruction that uses the flag, insert the instructions to evaluate it. Here https://github.com/totalspectrum/riscvp2/blob/master/riscvtrace_p2.spin there is a feature OPTIMIZE_CMP_ZERO that does some of this. Riscv isn't that flag intensive.
Cycle counter might be interesting. But I'd really hate to loose up to 66% performance by incrementing a 64 bit counter every instruction. Maybe just keep track of the instructions at compile time and update the counters before branching or reading the counter.
The above stuff might be good to do with multiple P2s. Maybe have one just for dynamic recompilation of multiple CPUs, and one to actually run the code, provide the peripherals, etc. And managing storage could be tricky. I mean, with a multi-P2 arrangement like that, you may need some redundancy. For instance, give the chip with the recompiling cores a FAT loader (or whatever FS you will use) so the main P2 can tell it to load whatever. Then it can send the recompiled code to the main chip. But for random storage reads/writes, have a cog on the main one to handle all of that. I was trying to work out how to handle video in that case, but really, that's no big deal since the translator chip does not need to be Turing Complete and doesn't need to transmit anything other than the recompiled data, or error codes. So except for command/control lines, data will be mostly in one direction. So if you have a menu on the main one, it can read the directory of the media, the user makes a selection, and then the SPI bus or whatever is handed to it while the main one displays a throbber or whatever (Loading: | / -- \ ), the translator chip loads and decompiles what it is told to do, then transmits it to the main one which then executes the code and then masters the SPI bus for data reads/writes, user data, etc. The main one handles all of the video tasks.
And I've wondered about more massive multi-P2 projects. Like what if you have 4 wired into a "baseball diamond" except maybe an extra bus between "Home" and "Second Base?" So "Home" could be the main CPU group. "First" could be a complete GPU solution, "Second" could be a general-purpose coprocessor, and "Third" can be a massive audio solution with PSGs, FM synthesis, PCM, percussion engine, MIDI, etc. Third can operate as a slave of either Home or Second, or be programmed to operate autonomously. Both Home and Second can have their own SPI bus, making it possible to have a program card and a media card, and for both to operate simultaneously. So have "elevator music" or soundtracks during loads, have fonts and textures, new waveforms, or whatever. So Home is in control and Second mostly helps First and Third, though Home is always the boss. I wonder what could be done with 32 cogs? Hmm.
And overlay management features could be added. Like what do you do with games that can only display so much of the user map at once (like Postal 2, for instance)? On that, it would be nice to have a cog on Home and/or Second to load the new map in advance when the player gets close to the transition point and hold it on standby. So once you pass the threshold, you instantly have the new map loaded, and not the dreaded disk drive symbol or "Loading" coming up. That would even help with side-scroller games. And of course, "overlays" also refer to programs that are too large to fit in memory and you have an overlay manager running in the background acting as a mini-OS to decide what part of the program needs to be loaded at any point and to instantly swap in new data as necessary.
Of course, if wants to do "retro" in a simple way, one could have a BASIC interpreter (or some other language like QBASIC which can compile on use), and a core to read a "cartridge" (or SD or CF card) and run binary programs designed with a cross-compiler. That cog can either be a legacy core, a homebrew core, or even raw P2 code. So if all you want to do is play games and code in BASIC, that could be a rather simple retro-like build. So even if you use an 8-bit core like a 6502 for running binary code, you'd still have a BASIC core that is written in P2 code and provide better BASIC performance than the old stuff. So BASIC with an effective 160 Mhz in places and a bottleneck in the 14-20 Mhz range of real-world speed at best in other places.
Comments
I'd like to see something that uses external parallel RAM. And yes, if I go with a 16-bitter, that would be expensive in terms of GPIO lines. 2M parallel RAM (1M x 16) would involve 41 lines. That would use 20 address lines, 16 data lines, and 5 control lines. Once you get past a byte in width, RAM adds a control line per byte.
So if one does that, they would certainly need to multiplex lines.
As for multiple cores, it might be an option to use different types of cores for this. Like what if you have your own custom core, a Z80, a 6502, etc?
Silly suggestion: Write a "Matter*" node device. Throw in some Zigbee hardware, and off you go... Perhaps I should try writing one myself.
https://www.theverge.com/23390726/matter-smart-home-faq-questions-answers
Another idea comes to mind. Why not the ability to take a word and assign each bit an opcode, and maybe have a list of lists? Setup could be hard, but once you set it up, you could read a number of instructions in an entire memory cycle. Maybe leave 1 bit as a supervisor placeholder to go back to individual opcodes when done. So things are done in the order of the bits and only the things set to 1 are run. So one would have to only use that mode when you have instructions you use in a certain order a lot.
isn't that what SKIPF does?
Mike
I meant more like assign opcodes to individual bits as an operating mode. I'd probably leave 1 bit as an exit pattern bit. So it goes down the list and for each bit that is enabled, it runs the instruction in that slot. So it stays in that mode and you can use that same 1-bit opcode template with a different mask and do different of those instructions. And that just might work well with serial RAM. So this could probably process between bits. So no need for spinlocks when you can structure the useful code to where the RAM has enough time.
So you are assigning an opcode to each bit in advance and every position means a certain operation. So you are choosing to run or not run each predefined opcode. And there could be multiple ways to map the bits to different opcodes. You could do a "butterfly" where you have various opcodes in one order, then in the reverse order, and maybe have 1 that is used only once, and one for pattern breaking. Like 0-6 being in one order, 7 being a single opcode, 8-14 being 0-6 in reverse, and 15 being whether to exit this mode or not. Or do it where half is opcodes and half is the address of a different list, thus staying in that mode and choosing a new opcode set.
But yes, would the overhead be worth it? And if you don't utilize it well, it would functionally be no better than a bunch of NOPs/Skips. I am trying to think of a "compression" scheme for dealing with low bandwidth memory, and one that doesn't require a large dictionary or large packet sizes.
You could make such 'set-of-opcodes' compression/mapping compile-time dynamic, rather than fixed in an opcode ?
You would need a small number of tables that the compiler would fill with useful opcode groups, like a cache lock. Maybe users also need a means to lock time critical tables, for stability.
That would map well onto a P2 resource, and the P2 SKIPF could run that table.
Yeah, I'm still trying to hash it out to see if it would be feasible. The downside of making each bit mean something in a unary fashion is that the 0's would function like NOPs rather than not be used. I don't know how one could do that with jumplists. If that can't be done, then one is stuck with polling.
And for getting such a list in, I don't know if it should be a series of immediates, or a memory location in the form of a table.
And then, what instructions would be eligible? Those with operands likely wouldn't be, unless you use only 1 (or intend to use it multiple times), like 8 bits for the opcode group mask, and 8 for an operand. And then loops, well, maybe only if it jumps to the start of the list of instructions and uses the same mask. Similar with conditionals, with only branching to the beginning of the list mask or aborting.
Hi,
(I have to admit, that I am a rather lost, what the goal is at your project. As far as I understand, you want to create something new, but I do not know the direction or purpose.)
Still to find out which operations are really useful, perhaps it might be a good idea to find out which operations are used how often. This will give a sorted list of words/instructions in file:
From: https://wingsoftechnology.com/unix-calculate-frequency-each-word-text-file/
You have to sort out the words, which come from comments....
It seems, that ret, mov, jmp and CALL are used very often in assembler.
In Taqoz Forth: DUP, =, DROP, EXIT, C@, SWAP
Of course this depends on the task and the writer.
From this it seems not to be very useful to have 32 instructions coded into a long so that these can only be done in a certain order. What I have seen, is 5bit codes packed into 32bit longs in a Forth stack processor, which has implicit operand only. So you have the 32 most used operations in any order. As said before, Peter optimised his Taqoz coding over years and found out, that to use always 16bit codes is best, which corresponds to the ARM thumb instruction set, which was invented after the 32bit instruction set as an optimisation.
For reference, the top 10 menmonics in my emulators (lower code only):
MegaYume:
1. MOV
2. JMP
3. TESTB
4. RET (including _RET_)
5. ADD
6. CALL
7. AND
8. CMP
9. SHR
10. TEST
NeoYume:
1. MOV
2. JMP
3. TESTB
4. ADD
5. RET (including _RET_)
6. CALL
7. RDLUT
8. AND
9. CMP
10. GETNIB
(The occurrence of GETNIB and RDLUT is due to some unrolled blitting sequences...)
For some average flexspin compiled code:
It seems there are way too many hub access instructions there. I encountered the same in a compiled ARM (RPi) code : LDR, STR, LDR, STR... where a proper assembly code gave me 20x speed gain over the compiler by simply placing the variables in registers where possible... LDR is 3 ns on a Pi3, where the CPU can do several instructions per ~0.7 ns clock. The same here, ~13 clock per RD, 2 clock per normal instruction.
Yea. When a function takes the address of a local variable, compiler goes dumdum and forces all locals onto the stack, emitting loads of RD/WR. This is necessary for Spin, but it's pretty bad for C.
I am not driven by goals per se. I tend to need to do things on a whim and not be pushed in any direction. Right now, discussion (as a part of learning and preparation) is the focus. Then I can sort through things and go from there when I am ready. I tend to like to start with the pure and move to the applied.
All that said, I just want a new ISA with different ideas and an overall machine that is retro-like. And because a P2 is the intended target, it would be nice to have a set that works with the underlying machine. I'd like to go mostly 16-bit, though longer and shorter could be left as options. I still want to try to do what I said about parallel SRAM, though I know that will be a monster to work out. For the ISA I am after, I want to pretty much enforce alignment and go with word addressing. And that can result in some challenges when working with ASCII, so some byte instructions would be needed.
I would not create a 32-bit ISA. The idea I proposed was having a special mode so that each bit of the word would represent an opcode, and it is either run or not.
At this point, my earlier ideas might be better, such as paired opcodes and things like that. I am curious what the longest but most common usages of instructions are.
It is neat, for instance, that the Gigatron has a few specialized instructions to improve bit-banging efficiency. The X register has an auto-increment mode, and you can modify things on read. So you can do an AND/OR of memory specify the Out port as the destination, with auto-increment. The AND and OR are used to toggle the sync bits, leaving the lower 6 bits untouched, sending them to the monitor, and incrementing the X register, all in a single cycle. I don't think the 6502 had auto-increment. That does come in handy with arrays accessed in loops and things like that. That's better than indexed memory modes in such a use case.
Speaking of coding in higher languages, yeah, you certainly want to learn some assembly and learn to read list files. I remember a mystery I had in Quick Basic code I wrote (x86 real mode). When I had compound IF statements, while they looked smaller, they were about 16 bytes longer than nested IF statements. After studying the difference, I avoided using logical terms in IF statements and used only nested ones where possible. What it was doing was evaluating the cases as arithmetic. It would evaluate both parts, assign numbers, then evaluate the numbers and branch based on that. So after that, I nested the statements to where the most likely to fail or the simplest to execute would be tested first. So give precedence to integers before strings, before floats, before doubles, and if things are roughly equal, give precedence to things most likely to fail. After all, a half-truth is still a lie.
Interesting idea.
I think there is a matter of practicality: Any program large enough to need external program memory would be compiled from a high level language. Personally, I am far to lazy to add a new instruction set to a compiler.
A 16 bit PSRAM interface at sysclock/2 can nearly feed 1 cog. 16 PSRAM clock cycles for random rdlong, so 32 propeller clock cycles. Maybe 40 cycles for PSRAM jump compared to 13-20 cycles for hubex jump operation.
Much of the work for designing modern processors goes into the cache. I've put some thought into a low latency PSRAM controller. It would take about 100 clocks to do a random read with the PSRAM operating at sysclock/4. I'm trying not to use the streamer because that only writes to hub ram. The emulator would communicate to the PSRAM/cache controller via smart pin long repositories or shared LUT. That would save at least 7 cycles waiting for the hub. The psram controller would run in parallel with the emulator, trying to have the next instruction available for immediate use. Memory writes should not stall the emulator, just hand off the write request if the write path is not busy already.
In my Linux experiment, I had great success with a 16 long instruction cache. Most instructions were read from the cache.
I've had my head in riscv emulators lately, most from @ersmith . A lot of cycles are used because the riscv instruction set has 32 registers. That means 5 bits to specify the register. If we simplify it to 4 bits, then we can use getnib instead. Also some immediate values have a funny bit ordering as well. 16 instructions is enough, right? Basically we would corrupt the riscv compressed instruction set to make the P2 emulator/JIT compiler run as quickly as possible.
EDIT: Looks like the ARM instruction set meets most of these considerations. But not Thumb2. Darn 3 bit register addressing.
I was thinking, if the 18 bits of register addresses in P2 code could be compacted or better expanded somehow in a fast way. Have 2 bits, which say this is a 16bit code of type A, B, C (or 32bit original P2 code). Depending from those fill up the 32bit P2 opcode with a fixed accumulator and another address.
I don't know if the P2 instructions are mapped to codes in a way that you can easily use a subset of most of the most used instructions to get down from 14bits to ?8bits?.
One other, independent, idea is, if two or more cogs could be combined somehow to build a more powerful core. They could read the same code in parallel. Perhaps one core could do stack- handling for example.
I've been pondering that sort of thing. I've wondered if, for instance, if one cog could be dedicated to interrupt handling for a retro-like CPU. And I guess one could add a blocking mode if that is needed to emulate older behavior or in case there could be software races. There would be no "hardware" races for sure.
And on the x86 emulator project, it appears only one cog is used for the 8086. I don't see why not maybe, it could be split across 2 cogs for better throughput on the slower instructions. I mean, what would happen if it was split out closer to how Intel split it out? That way, have the BIU in one cog and the EU in the other, and preferably in neighboring cogs so that the LUTs can be shared.
I've also thought about width. For instance, if you read the same opcodes as you said, one could implement a 64-bit core. However, an inherent issue would be that it won't be 100% symmetric due to how the middle carry would be implemented. So I don't know if that would be a feasible use for 2 cogs since while both cogs could do math on part of the result, the lower one would need to send a carry flag, and the upper one would need to add it. So that is a bit of a bottleneck.
Wide multiplication could be more of a bottleneck even when dealing with 2 cogs since you would need to do 4x the multiplication when you double the bit sizes, and then a minimum of 2 additions (in the gross sense as one may need to be broken down further). When emulating wider, the strategy I think of is much like the algebraic FOIL (First, Outer, Inner, Last) Method. With the highest and lowest order multiplications of the 4, you can just work and place those into the result register into opposite halves (when using real hardware, this is quite an optimization since you can do that simultaneously, and you don't have to add those since they don't overlap anyway). Then you work the 2 "middle" multiplications. They are of the same order of magnitude, so you just add their results together into a temporary register. Then you add that to the upper 3/4 of the result, treating it as an accumulator. So the largest addition for 64/64/128 multiplication on 32-bit hardware is only 96 bits. The lowest half of the lowest-order partial multiplication remains from the beginning to the end unchanged.
The above is how you do it if your hardware has some multiplication, just not wide enough for your application. So you do 5-6 times the work that you'd have to do if you had dedicated hardware for the size you need. And I included a couple of strategy optimizations in my description. One was working the end numbers as I described as 2 pieces of hardware can do that simultaneously, since you don't have to add those to each other, merely place them. The 2 middle partials would be added 1/4 the way up with the lowest partial being untouched, thus needing a smaller adder (or adder chain).
Otherwise, one would have to test all the 1's in the 2nd number and shift and add the top number that many times. So for an 8-8-16 multiplication, you'd have up to 7 different shift operations, and 7 different additions. And there are likely other strategies, and the idea is to find a strategy that could be split amongst multiple cogs (or even multiple P2s, though I'd avoid that if possible due to potential timing issues you wouldn't have in a single P2).
Two cogs aren't needed to do an 8086 and would be slower than one cog using XBYTE as the FIFO emulates the BIU instruction fetching and queue very well.
But any other ways to get it to perform better? I just threw that out there and didn't know if it could help or not if one could use 2 cogs and pipeline things a bit and help it that way. So have one working with the hub and the other doing the EU. So it won't. Okay.
I think BIU emulation wouldn't benefit from being in a separate cog. The slowest instructions are ADD/ADC/SUB/SBB with register source and destination. ADD/SUB takes only two cycles and ADC/SBB two more but updating all the flags needs an extra 15 instructions or 30 cycles! A second ALU cog could speed things up in theory at least. 8086 + 8087 emulation would be faster with two cogs than one, although a single cog should cope if all the 8087 code is in hub RAM (where it would have to be).
What about BIU and decoding being in a separate cog and maybe using the LUT to pipeline it? It seems the goal in this case is to keep the hub memory utilized as much as possible.
And what about a flexible PFQ? I mean, sure, keep the 4-6 bytes as the default limit and let branches empty it as usual, etc., but for longer instructions, let it fill even longer. So 4-6 bytes are the limit at which it can stall things, but let it get larger during long instructions.
And how can the flags be updated in fewer cycles? And it seems like that should be simple, just apply an appropriate AND/OR mask that updates them all at once in a single instruction.
As for 8087, I saw someone mention the hub's CORDIC instructions. I don't think that is efficient for most 8087 instructions. I think most can be done using cog instructions in fewer than 58 cycles. Although, TBH, not many used an 8087 and there isn't much software for it.
I sorta debated the ReactOS team over their Fast486 library. That way, since it is emulated, if they ever get their x64 build of ReactOS working, you'd be able to run 16-bit code. That is one disadvantage of an x64 OS, that you can't run older PC games since x64 can run only 32-bit and 64-bit code, unlike the 32-bit Intel CPUs that could also run 16-bit legacy code. Anyway, I suggested that they put the x87 emulator in another thread to take advantage of multi-core and allow some concurrency. In that case, they would need to fully emulate the FWAIT instruction for software race prevention. I mean, if you code for x86 and x87 in assembly, and you know how long the slowest FPU for the given CPU took, you could structure your code to allow concurrency without using FWAIT or having software races. FWAIT acts as a spinlock so that the CPU cannot continue until the FPU is finished. And I mentioned assembly because in higher languages, the libraries would likely always use FWAIT since they are designed for a general audience with different configurations, and their goal would be reliability more than performance. But if you are coding it at a low level, you should know what you can get by with and literally let the FPU be a coprocessor. So having an emulator that can allow concurrency would allow for more performance in cases where the code is structured to where the CPU will be very busy with unrelated stuff while waiting for the result. And interestingly, in most x87 libraries included in software, FWAIT is only decoded with an immediate return or far jump. That makes sense in that if you are in a single-threaded environment and have no x87, there is no way races can occur, since things are forced to be sequential. Since the same core is doing the FPU stuff, there is no way that things can get out of sequence, even with no FWAITs used. FWAIT is only useful if you have an FPU in case you need to immediately use the result.
Hi, have you already stumbled over: https://forums.parallax.com/discussion/170040/interpreter-toolkit-for-p2
@ersmith has explored optimum ways to do interpreting on P2. It is
1. compiling into a string of P2 instructions
2. store them into cog ram
3. optimize them
4. execute them
Benefit comes from the probability that there is a loop fully cached in cog ram. The string of instructions must be long enough. The compiling/interpreting is done only once, no hub access, the pipeline is not emptied by too many jumps.
The gain between the plain interpreter and the JIT HUB cache seems to correspond to the findings I had these days, when I looked at a Forth compiler that produces P2-instructions instead of Forth wordcodes. I am still astonished, that it is so much better to handle much more bytes of code this way.
As far as I understand a interpreter for a virtual machine of a language compiler can be much faster than an emulator of a processor, because it does not need to calculate all these flags, which are so very likely not to be used.
Thanks for finding this thread!
I think the latest JIT version is here https://github.com/totalspectrum/riscvp2
Older JIT and Non-JIT riscv versions here https://github.com/totalspectrum/riscvemu
With enough pain we could have an 8086 JIT.
I wonder, if a JIT compiler emulating a CPU could somehow find out, if a flag is needed. Perhaps looking forward a short number of instructions, if the flag is overwritten unused.
This is from the MC6809 Coco3 emulator, ZTEST(), NTEST() are macros.
Yes, the additional load calculating the flags is significant. (Although, in this case MemRead8 is non-trivial, which has to do the work of the MMU and check if RAM, or ROM or IO is meant.)
The above makes sense. If one is emulating, then analyzing and converting (recompiling) the code in advance of executing it would speed things up. And you wouldn't need to test for all flags for underlying instructions, just those that are used. So no need to update 6 flags if you only need 1. And if the wrong opcodes were used initially (like ADC for ADD) then such analysis would catch that and not manipulate the flags at all.
And I think multiple interpreters could be used based on CPU modes. I mean, for instance, interrupt emulation and handling. So if you turn off the interrupts, you could return from that interpreted instruction by jumping to the interpreter handler loop that doesn't poll for them at all. And when they are turned back on, the handler from that would return to the interpreter handler loop that does poll for them.
I wonder how feasible that self-modifying code is on the P2. That can help save space and in some cases, time.
Because of longer pipeline, we have to change the instruction at least 3 NOPs before it is executed instead of 1 in P1
The soluion can be using ALT instructions that change the opcode and arguments already inside the pipeline.
Somewhat OT, but still within the target of a new architecture and emulating it. Let's suppose one wants to run 16-bit SRAM and have 20 address lines (ie., 2Mb SRAM). Now, what is the correct strategy to access that? In the example I gave, that takes 41 pins (20 address, 16 data, and 5 control lines). In what order do the lines need to be accessed? And do any of the lines need to be strobed?
Whether signals are edge, level, or strobe sensitive generally comes from the datasheet of the selected memory. Level sensitivity for Chip Enable, Address, and Data Lines are common.
First thing to consider is whether this is going to be a pure memory interface to only one chip.
If so, and power reduction isn’t a requirement, you might be able to tie the Chip Enable pins and save a control pin.
Secondly, some SRAMs allow Write Enable to override Output Enable, while some may not. If you select one that does then you can tie Output Enable and that can save another control pin.
Thirdly, the individual byte control lines only require separate pins if you plan to allow byte level control of the interface. If you are always reading and writing 16 bits they can be tied to the enabled state, saving another 2 control pins.
So, a dedicated memory interface to a single permanently enabled 16 bit SRAM that allows WE to override OE, using 16 bit access only, could consume only 20 + 16 + 1 pins (WE)
As far as order of pin control, again the datasheet is the best source of information, but generally it is:
Chip Enable
Address
Control and Data if writing
I think it just needs to keep track of the last register written. Then if there is an instruction that uses the flag, insert the instructions to evaluate it. Here https://github.com/totalspectrum/riscvp2/blob/master/riscvtrace_p2.spin there is a feature OPTIMIZE_CMP_ZERO that does some of this. Riscv isn't that flag intensive.
Cycle counter might be interesting. But I'd really hate to loose up to 66% performance by incrementing a 64 bit counter every instruction. Maybe just keep track of the instructions at compile time and update the counters before branching or reading the counter.
The above stuff might be good to do with multiple P2s. Maybe have one just for dynamic recompilation of multiple CPUs, and one to actually run the code, provide the peripherals, etc. And managing storage could be tricky. I mean, with a multi-P2 arrangement like that, you may need some redundancy. For instance, give the chip with the recompiling cores a FAT loader (or whatever FS you will use) so the main P2 can tell it to load whatever. Then it can send the recompiled code to the main chip. But for random storage reads/writes, have a cog on the main one to handle all of that. I was trying to work out how to handle video in that case, but really, that's no big deal since the translator chip does not need to be Turing Complete and doesn't need to transmit anything other than the recompiled data, or error codes. So except for command/control lines, data will be mostly in one direction. So if you have a menu on the main one, it can read the directory of the media, the user makes a selection, and then the SPI bus or whatever is handed to it while the main one displays a throbber or whatever (Loading: | / -- \ ), the translator chip loads and decompiles what it is told to do, then transmits it to the main one which then executes the code and then masters the SPI bus for data reads/writes, user data, etc. The main one handles all of the video tasks.
And I've wondered about more massive multi-P2 projects. Like what if you have 4 wired into a "baseball diamond" except maybe an extra bus between "Home" and "Second Base?" So "Home" could be the main CPU group. "First" could be a complete GPU solution, "Second" could be a general-purpose coprocessor, and "Third" can be a massive audio solution with PSGs, FM synthesis, PCM, percussion engine, MIDI, etc. Third can operate as a slave of either Home or Second, or be programmed to operate autonomously. Both Home and Second can have their own SPI bus, making it possible to have a program card and a media card, and for both to operate simultaneously. So have "elevator music" or soundtracks during loads, have fonts and textures, new waveforms, or whatever. So Home is in control and Second mostly helps First and Third, though Home is always the boss. I wonder what could be done with 32 cogs? Hmm.
And overlay management features could be added. Like what do you do with games that can only display so much of the user map at once (like Postal 2, for instance)? On that, it would be nice to have a cog on Home and/or Second to load the new map in advance when the player gets close to the transition point and hold it on standby. So once you pass the threshold, you instantly have the new map loaded, and not the dreaded disk drive symbol or "Loading" coming up. That would even help with side-scroller games. And of course, "overlays" also refer to programs that are too large to fit in memory and you have an overlay manager running in the background acting as a mini-OS to decide what part of the program needs to be loaded at any point and to instantly swap in new data as necessary.
Of course, if wants to do "retro" in a simple way, one could have a BASIC interpreter (or some other language like QBASIC which can compile on use), and a core to read a "cartridge" (or SD or CF card) and run binary programs designed with a cross-compiler. That cog can either be a legacy core, a homebrew core, or even raw P2 code. So if all you want to do is play games and code in BASIC, that could be a rather simple retro-like build. So even if you use an 8-bit core like a 6502 for running binary code, you'd still have a BASIC core that is written in P2 code and provide better BASIC performance than the old stuff. So BASIC with an effective 160 Mhz in places and a bottleneck in the 14-20 Mhz range of real-world speed at best in other places.