Extending P1V cogs's memory and opcode space
Ale
Posts: 2,363
I started to think again about extending the cog's memory in two ways in size so we have say 4 or 8 kbytes (1024 or 2048 kopcodes) and to be able to address that extra space directly.One way would be bank switching, I don't want to go this route.Another would be to be able to address that extra space directly, that means we need either wider opcodes or another encoding with the same width, both have the drawback of loss compatibility with existing tools.Let's talk about making wider opcodes, and why I thought about that...Many FPGAs have 9 or 10 bits "bytes" in their block RAMs (Spartan2 and such, don't), that means that we are just losing some space here.
Using 9 bit/byte block RAMs we could get 4 extra bits per long, making it 36 bits wide providing 2 extra bits for the S & D fields. (How to load that from HUB is another matter to be considered carefully).Using CV 10 bit/byte we could get 4 extra bits per field.
That way one could get a nice longer address space...
The idea was to get a COG that could fit a for example Z80 emulator in COG RAM.
The opcodes would be coded like this:
35 30 29 28 27 26 25 22 21 11 10 0+
+---+---+---+---+
+
+
+| Opcode | Z | C | R | I | CCCC | D field | S field |+
+---+---+---+---+
+
+
+ 6 bits 1 1 1 1 4 bits 11 bits 11 bits 39 34 33 32 31 30 29 26 25 13 12 0+
+---+---+---+---+
+
+
+| Opcode | Z | C | R | I | CCCC | D field | S field |+
+---+---+---+---+
+
+
+ 6 bits 1 1 1 1 4 bits 13 bits 13 bits
With 11 bits you get 2048 kops and with 13 8 kops.
How to load the thing is.... a bit more difficult now because you need 4 1/2 or 5 byte "longs"... the easiest would be to pack them and pad at the end as needed, I think.
The impact on execution speed may not be that big, I think. A couple of fields are wider but the idea is that the ALU and as much as possible remains fixed at 32 bits.
Using 9 bit/byte block RAMs we could get 4 extra bits per long, making it 36 bits wide providing 2 extra bits for the S & D fields. (How to load that from HUB is another matter to be considered carefully).Using CV 10 bit/byte we could get 4 extra bits per field.
That way one could get a nice longer address space...
The idea was to get a COG that could fit a for example Z80 emulator in COG RAM.
The opcodes would be coded like this:
35 30 29 28 27 26 25 22 21 11 10 0+
+---+---+---+---+
+
+
+| Opcode | Z | C | R | I | CCCC | D field | S field |+
+---+---+---+---+
+
+
+ 6 bits 1 1 1 1 4 bits 11 bits 11 bits 39 34 33 32 31 30 29 26 25 13 12 0+
+---+---+---+---+
+
+
+| Opcode | Z | C | R | I | CCCC | D field | S field |+
+---+---+---+---+
+
+
+ 6 bits 1 1 1 1 4 bits 13 bits 13 bits
With 11 bits you get 2048 kops and with 13 8 kops.
How to load the thing is.... a bit more difficult now because you need 4 1/2 or 5 byte "longs"... the easiest would be to pack them and pad at the end as needed, I think.
The impact on execution speed may not be that big, I think. A couple of fields are wider but the idea is that the ALU and as much as possible remains fixed at 32 bits.
Comments
How to load the thing is.... a bit more difficult now because you need 4 1/2 or 5 byte "longs"... the easiest would be to pack them and pad at the end as needed, I think.
The HUB could also be wider, making the word moves wide too.Mostly the upper bits are ignored, except for your extended reach opcodes.That shifts the problem to the loader, which has to be able to fill the HUB memory
Another simpler/interim approach would be to ad local COG memory as indirect only.That would mean no opcode changes, and a small change to memory decode to map some spare addresses to local COG memory.
The registers would still be only 512 max.
I had the above working using AUGD/AUGS/AUGDS instructions instead until I broke something. There is a thread for this somewhere.
010, which only leaves a single 32 bit register to contain PAR, a HUB address, a new flag & a Cog ID.
My suggestion is to use the I flag in the opcode to select between COGINIT and the other HUBOPS and the R flag to select between COGNEW (Result = COGID) and RELOAD (which is the same as COGINIT( COGID ) ). Yes, this means you can't COGINIT a specific COG, but I'm not certain how common that is as a requirement.
The upshot is now the HUB address (SRC) and PAR value (DEST) are a full 32 bits. (The HUB address would be masked to a long address, same as RD/WRLONG.)
I'd done a bunch of earlier P1V work in the hopes to combine it with Cluso's AUGDS stuff and get the >2kB COG barrier overcome. IIRC I believe I had a JMPX instruction working that could do jumps beyond 2kB. The tricky bit is getting the tools to interwork with this.
You might find some related info here:
http://forums.parallax.com/discussion/157423/new-p1v-capabilities-added-a-cogram-stack/p1
http://forums.parallax.com/discussion/157520/new-enhanced-ldptr-stptr-operations-for-direct-cogram-stack-access/p1
It uses two sequential Longs for each instruction. First long has the 6bit instruction, CZ flags, 3 Literal Flags for S1 and S2, 5 condition bits and a 16 bit destination field. Second long has two 16 bit Source1 and Source2 fields. This gives each core a 64K memory range. With the 3 literal flags you can use S1 and/or S2 as 16 bit literal values or combine them into one 32 bit literal. With the extra condition bit I added some additional comparisons such as S1 > S2 etc.
Using the Altera Dual Port memory It executes instructions in 4 clocks.
1) Read Instruction Longs (2 reads)
2) Read Source Long Values (2 reads)
3) Evaluate
4) Write Result
I have seen reference to a Dual Read with Write Thru Altera memory layout which could drop that to 3 clocks per instruction but have not tried it.
When I get my BeMicroCVA9 I will be able to try 16 Cores with 64K ram each + a shared 64K hub ram.
Also added in a set of COMM commands where each Core has 32 longs that can be read by the other cores for Core-Core communications. Similar to how on the Prop they can all see INA and OUTA. This allows for quick data sharing without the hub delays.
The existing logic would catch the opcode=0 and/or the condition=if_never and won't do anything, so it will give you a lot of extra opcodes.
===Jac
Wow, quite a morph there .
Is there any reason the the 64K HUB size ?
The existing logic would catch the opcode=0 and/or the condition=if_never and won't do anything, so it will give you a lot of extra opcodes.
===Jac
See here also
http://forums.parallax.com/discussion/157748/expanding-the-p1v-s-instruction-set-a-poor-mans-p2
Cluso's idea using a relative jump is probably the less disrupting, all registers in the lower 512 longs (also the HW registers) is maybe not a great loss..
First, opcode space. There are lots and lots and lots of encodings that make no sense in the present encoding space. For most things, harvesting the senseless encodings is pretty simple and won't have affect on compatibility because senseless encodings aren't used by sensible programs. This has an impact, of course, on the tools, which should now flag senseless encodings as illegal and generate new instructions as defined.
Second, there's the issue of increasing the addressability of the processor cores. This can't really be done easily or compatibly. However the present encoding of the 32-bit instructions is so wasteful that its not much trouble to figure out how to free two bits, which allows the S and D fields to then be extended by one bit each. There are a number of ways to do this that are sufficiently "culturally compatible" that revised tools can rebuild almost all source successfully and flag almost all errors in other source code. Self modifying code is problematic, but there's very, very little of that. This is a major change to the tools.
Finally, there's the issue of hub ram addressability. The best solution here is reduce the addressable ROM to 2KB for the bootstrap and interpreter only and increase the addressable RAM from 32K to 62K. This is completely compatible except for the tiny bit of code that makes use of the sine, log and character generator tables in the ROM; those applications that make use of those tables can simply include the tables as source and reference them from RAM. This is a trivial change to the tools.
Without that feature a lot of looped code would have be written as sequential code which world result in less efficient use of register space.
Now on self-modifying code: we talk about the propeller here, and yes it needs self-modifying-code. every call/ret use it, movi, movs, movd where designed exactly for that... Programs reside in RAM they can be altered at will. There are no pointer registers, and that forces you to modify your running program, registers are in RAM.