Extending P1V cogs's memory and opcode space

Ale · 2015-07-31 07:52

I started to think again about extending the cog's memory in two ways in size so we have say 4 or 8 kbytes (1024 or 2048 kopcodes) and to be able to address that extra space directly.One way would be bank switching, I don't want to go this route.Another would be to be able to address that extra space directly, that means we need either wider opcodes or another encoding with the same width, both have the drawback of loss compatibility with existing tools.Let's talk about making wider opcodes, and why I thought about that...Many FPGAs have 9 or 10 bits "bytes" in their block RAMs (Spartan2 and such, don't), that means that we are just losing some space here.
Using 9 bit/byte block RAMs we could get 4 extra bits per long, making it 36 bits wide providing 2 extra bits for the S & D fields. (How to load that from HUB is another matter to be considered carefully).Using CV 10 bit/byte we could get 4 extra bits per field.
That way one could get a nice longer address space...
The idea was to get a COG that could fit a for example Z80 emulator in COG RAM.
The opcodes would be coded like this:

35 30 29 28 27 26 25 22 21 11 10 0+

+---+---+---+---+

+

+| Opcode | Z | C | R | I | CCCC | D field | S field |+

+---+---+---+---+

+

+ 6 bits 1 1 1 1 4 bits 11 bits 11 bits 39 34 33 32 31 30 29 26 25 13 12 0+

+---+---+---+---+

+

+| Opcode | Z | C | R | I | CCCC | D field | S field |+

+---+---+---+---+

+

+ 6 bits 1 1 1 1 4 bits 13 bits 13 bits

With 11 bits you get 2048 kops and with 13 8 kops.
How to load the thing is.... a bit more difficult now because you need 4 1/2 or 5 byte "longs"... the easiest would be to pack them and pad at the end as needed, I think.
The impact on execution speed may not be that big, I think. A couple of fields are wider but the idea is that the ALU and as much as possible remains fixed at 32 bits.

jmg · 2015-07-31 08:07

How to load the thing is.... a bit more difficult now because you need 4 1/2 or 5 byte "longs"... the easiest would be to pack them and pad at the end as needed, I think.

The HUB could also be wider, making the word moves wide too.Mostly the upper bits are ignored, except for your extended reach opcodes.That shifts the problem to the loader, which has to be able to fill the HUB memory
Another simpler/interim approach would be to ad local COG memory as indirect only.That would mean no opcode changes, and a small change to memory decode to map some spare addresses to local COG memory.

Cluso99 · 2015-07-31 10:29

I thought the easiest way was to either change the JMPRET instruction to be relative +/- 256, or add a new JMPRETR instruction to be relative.
The registers would still be only 512 max.

I had the above working using AUGD/AUGS/AUGDS instructions instead until I broke something. There is a thread for this somewhere.

ericball · 2015-07-31 12:36

What about HUB RAM address space? The big change would be to modify COGINIT, and I think it can be done fairly easily. The current instruction set encodes COGINIT as instruction 000011 with a source of

010, which only leaves a single 32 bit register to contain PAR, a HUB address, a new flag & a Cog ID.
My suggestion is to use the I flag in the opcode to select between COGINIT and the other HUBOPS and the R flag to select between COGNEW (Result = COGID) and RELOAD (which is the same as COGINIT( COGID ) ). Yes, this means you can't COGINIT a specific COG, but I'm not certain how common that is as a requirement.
The upshot is now the HUB address (SRC) and PAR value (DEST) are a full 32 bits. (The HUB address would be masked to a long address, same as RD/WRLONG.)

rogloh · 2015-07-31 13:30

@Ale,
I'd done a bunch of earlier P1V work in the hopes to combine it with Cluso's AUGDS stuff and get the >2kB COG barrier overcome. IIRC I believe I had a JMPX instruction working that could do jumps beyond 2kB. The tricky bit is getting the tools to interwork with this.

You might find some related info here:
http://forums.parallax.com/discussion/157423/new-p1v-capabilities-added-a-cogram-stack/p1
http://forums.parallax.com/discussion/157520/new-enhanced-ldptr-stptr-operations-for-direct-cogram-stack-access/p1

Kerry S · 2015-07-31 13:47

I am playing with a design I call the "RISCy Prop" sort of a hybrid in the design concepts.

It uses two sequential Longs for each instruction. First long has the 6bit instruction, CZ flags, 3 Literal Flags for S1 and S2, 5 condition bits and a 16 bit destination field. Second long has two 16 bit Source1 and Source2 fields. This gives each core a 64K memory range. With the 3 literal flags you can use S1 and/or S2 as 16 bit literal values or combine them into one 32 bit literal. With the extra condition bit I added some additional comparisons such as S1 > S2 etc.

Using the Altera Dual Port memory It executes instructions in 4 clocks.

1) Read Instruction Longs (2 reads)
2) Read Source Long Values (2 reads)
3) Evaluate
4) Write Result

I have seen reference to a Dual Read with Write Thru Altera memory layout which could drop that to 3 clocks per instruction but have not tried it.

When I get my BeMicroCVA9 I will be able to try 16 Cores with 64K ram each + a shared 64K hub ram.

Also added in a set of COMM commands where each Core has 32 longs that can be read by the other cores for Core-Core communications. Similar to how on the Prop they can all see INA and OUTA. This allows for quick data sharing without the hub delays.

Cluso99 · 2015-07-31 14:36

The way I tested out extended cog was using AUGxx as DECODE (or any of the 4 unimplemented op ode instructions) and compile in separate blocks of 2KB. PRopTool supports this and it works.

jac_goudsmit · 2015-07-31 20:00

I've been thinking of various ways to expand opcode space. For example you could use all the nonzero opcodes for NOP as extra opcodes and you don't even have to go by the usual IIIIIIZCRICCCCDDDDDDDDDSSSSSSSSS scheme. Of course, to prevent problems with existing code, you would probably use one of the registered bits in a video register or a write to CNT or something, to enable the feature first.

The existing logic would catch the opcode=0 and/or the condition=if_never and won't do anything, so it will give you a lot of extra opcodes.

===Jac

jmg · 2015-08-01 00:02

I am playing with a design I call the "RISCy Prop" sort of a hybrid in the design concepts.

Wow, quite a morph there .

When I get my BeMicroCVA9 I will be able to try 16 Cores with 64K ram each + a shared 64K hub ram.

Is there any reason the the 64K HUB size ?

ozpropdev · 2015-08-01 01:38

I've been thinking of various ways to expand opcode space. For example you could use all the nonzero opcodes for NOP as extra opcodes and you don't even have to go by the usual IIIIIIZCRICCCCDDDDDDDDDSSSSSSSSS scheme. Of course, to prevent problems with existing code, you would probably use one of the registered bits in a video register or a write to CNT or something, to enable the feature first.

The existing logic would catch the opcode=0 and/or the condition=if_never and won't do anything, so it will give you a lot of extra opcodes.

===Jac

See here also

http://forums.parallax.com/discussion/157748/expanding-the-p1v-s-instruction-set-a-poor-mans-p2

Ale · 2015-08-01 05:14

I thought I was going to get a facepalm or something like this

. but I see that there are many ideas circulating... very exciting ! and very different ideas indeeed !
Cluso's idea using a relative jump is probably the less disrupting, all registers in the lower 512 longs (also the HW registers) is maybe not a great loss..

ksltd · 2015-08-06 19:16

There's several different things all being conflated here.

First, opcode space. There are lots and lots and lots of encodings that make no sense in the present encoding space. For most things, harvesting the senseless encodings is pretty simple and won't have affect on compatibility because senseless encodings aren't used by sensible programs. This has an impact, of course, on the tools, which should now flag senseless encodings as illegal and generate new instructions as defined.

Second, there's the issue of increasing the addressability of the processor cores. This can't really be done easily or compatibly. However the present encoding of the 32-bit instructions is so wasteful that its not much trouble to figure out how to free two bits, which allows the S and D fields to then be extended by one bit each. There are a number of ways to do this that are sufficiently "culturally compatible" that revised tools can rebuild almost all source successfully and flag almost all errors in other source code. Self modifying code is problematic, but there's very, very little of that. This is a major change to the tools.

Finally, there's the issue of hub ram addressability. The best solution here is reduce the addressable ROM to 2KB for the bootstrap and interpreter only and increase the addressable RAM from 32K to 62K. This is completely compatible except for the tiny bit of code that makes use of the sine, log and character generator tables in the ROM; those applications that make use of those tables can simply include the tables as source and reference them from RAM. This is a trivial change to the tools.

Cluso99 · 2015-08-06 23:48

Actually, self-modifying code is quite common. That is how we address tables etc.

ksltd · 2015-08-07 02:37

No, it's not quite common; it's relatively rare. There are only a few places its used in all of the OBEX.

Cluso99 · 2015-08-07 04:06

ksltd wrote: »

No, it's not quite common; it's relatively rare. There are only a few places its used in all of the OBEX.

It's also actually used in the Spin Interpreter which means every spin program effectively uses it, even though it is not exposed to the user.

ozpropdev · 2015-08-07 04:21

Every PASM program I write uses self modifying code to index cog registers.
Without that feature a lot of looped code would have be written as sequential code which world result in less efficient use of register space.

ksltd · 2015-08-07 22:26

Ugh ... these conversations are so worthless. Dynamic frequency of events is not a consideration when one is anticipating incompatible changes to macro architecture from the perspective of source code maintenance. What matters is the static frequency of the offending construct. In all cases, the static frequency of self modifying code is close to zero.

Ale · 2015-08-11 05:16

ksltd: If this conversation is worthless for you, you can skip it, not a problem, honest.
Now on self-modifying code: we talk about the propeller here, and yes it needs self-modifying-code. every call/ret use it, movi, movs, movd where designed exactly for that... Programs reside in RAM they can be altered at will. There are no pointer registers, and that forces you to modify your running program, registers are in RAM.

Extending P1V cogs's memory and opcode space

Comments