4KB Cog(s) - Pasm requirements

Cluso99 · 2014-08-25 17:06

As you may have seen, I am implementing an AUGDS instruction somewhat similar to the P2. While I have coded it, I havenot had time to test it yet.
The other requirement for addressing above 2KBof cog ram, is to have a JMP/CALL/RET instruction where the jump-to or return address is in the lower 2KB, plus of course an increased bit size of the PC (program counter).

While I will require a larger PC forhubexec, 10 bits will be fine for a 4KB Cog.

The current JMPRET (JMP/CALL/RET) instruction doesn't really make sensible use of the WC or WZ bits. I can use these bits to expand the JMPRET instruction to 10 bits for both S and D cog addresses.
So, I am going to try this method first.

David Betz · 2014-08-25 17:21

Cluso99 wrote: »

The other requirement for addressing above 2KBof cog ram, is to have a JMP/CALL/RET instruction where the jump-to or return address is in the lower 2KB, plus of course an increased bit size of the PC (program counter).

Is there a bit that can be repurposed in the JMPRET instruction, maybe WC or WZ? What it would do is to store the entire 10 bit or more PC into the destination register. That would, of course, mean that you couldn't use RET to return from such a call but you could use a JMP indirect. For example:

lr      LONG    0

...

' some address above 2K

main
        ' some interesting code

        ' the following instruction expands into your new AUGS instruction
        ' followed by the funky JMPRET that stores the entire PC in D.
        CALLR #my_function

        ' more interesting code
        ' etc.

my_function
        ' some interesting PASM code

        ' the following instruction would have to also have the repurposed bit
        ' set so that it would load the full expanded PC rather than the low 9 bits only.
        JMP lr

The "lr" variable would, of course, have to be in the first 2K so it can be addressed by a 9 bit D field but the CALLR instruction could address all of the expanded COG memory.

rogloh · 2014-08-25 17:52

Interesting one David. If the LR was hard defined to be at some fixed address, (eg 0 or $1ef), then maybe you could encode the big COG addresses directly in the instruction using the D field as the expansion.

eg. your CALLR example would expand the CALLR macro to this...

JMPRET #(my_function>>9), #(my_function&$1ff) WC

The D field now holds the top nine bits of the called function address, S is low 9 bits. WC is used as an instruction modifier to the original JMPRET behavior which now means interpret D field as part of S. No need for an additional preceding AUGDS instruction either.

But of course you'd need expanded compiler support for generating code like that. The P1 syntax doesn't support constants for D field today (unlike what we had for the previous P2). However this method would allow up to 18 bit Program counters in a COG!

Note. You'd also need to push the LR somewhere if you had nested or recursive calls.

The JMP behaviour could be expanded in the same way with the WC flag and the (currently unused) D field.

Roger.

David Betz · 2014-08-25 18:05

rogloh wrote: »

Interesting one David. If the LR was hard defined to be at some fixed address, (eg 0 or $1fe), then maybe you could encode the big COG addresses directly in the instruction using the D field as the expansion.

eg. your CALLR example would expand the CALLR opcode to this...

JMPRET #(my_function>>9), #(my_function&$1ff) WC

The D field now holds the top nine bits of the called function address, S is low 9 bits. WC is used as an instruction modifier to the original JMPRET behavior which now means interpret D field as part of S. No need for an additional preceding AUGDS instruction either.

But of course you'd need expanded compiler support for generating code like that. The P1 syntax doesn't support constants for D field today (unlike what we had for the previous P2). However this method would allow up to 18 bit Program counters in a COG!

Note. You'd also need to push the LR somewhere if you had nested or recursive calls.

The JMP behaviour could be exanded in the same way with the WC flag and the (currently unused) D field.

Roger.

Yes, that would probably be a better solution if you are willing to be limited to 1MB for COG RAM. :-)

One advantage of leaving the D field alone though is that you can have different LR registers and implement a sort of static stack for handling nested calls. The wouldn't work with recursive calls though. The idea would be to have a bunch of locations reserved for return addresses. LR would be used by leaf functions (that don't call any other functions), LR2 would be used for functions that call only leaf functions, LR3 would be used by functions that call LR2 functions, etc. Then you don't have to store move the return address from the fixed LR onto a stack or into some other location.

Having said that, I think I like your approach better because it doesn't require 64 bits for a call.

David Betz · 2014-08-25 18:08

Cluso99 wrote: »

The current JMPRET (JMP/CALL/RET) instruction doesn't really make sensible use of the WC or WZ bits. I can use these bits to expand the JMPRET instruction to 10 bits for both S and D cog addresses.
So, I am going to try this method first.

This is the more Propeller-like solution although it maxes out at 4K of hub memory and also involves self-modifying code which I never really liked. Mostly though I think a solution that has a hard limit at 4K isn't much better than the current solution that has a hard limit at 2K. I'd suggest trying one of the other approaches.

Cluso99 · 2014-08-26 00:59

David,
I thought I would try the method I detailed above first. This allows me to test out memory above 2KB. Hopefully that will give me some more insight before I tackle the harder solution with a new jmpret for hubexec, which will also work for cog >2KB and also for cog >4KB.

For any P1 variants, 1MB hub restriiction is fine as far as I am concerned. Even thecurrent targetted P2 onlyhas 512KB,and the hot P2 wasonly going to have 128KB.

Anticipate getting some testing time tomorrow

David Betz · 2014-08-26 01:38

Cluso99 wrote: »

David,
I thought I would try the method I detailed above first. This allows me to test out memory above 2KB. Hopefully that will give me some more insight before I tackle the harder solution with a new jmpret for hubexec, which will also work for cog >2KB and also for cog >4KB.

For any P1 variants, 1MB hub restriiction is fine as far as I am concerned. Even thecurrent targetted P2 onlyhas 512KB,and the hot P2 wasonly going to have 128KB.

Anticipate getting some testing time tomorrow

I guess your scheme could be an interesting first step and who am I to object? I haven't even written a single line of Verilog to enhance the P1v yet! :-)

Cluso99 · 2014-08-26 03:09

David,
Back home again. It's a bit of a pain to type on my Xoom and worse on a train

Anyway, it seemed an easier way to test that the >2KB cog is really working, and also to get a small program running above the 2KB limit. The same logic can almost equally apply to cog vs hub above 2KB. And it also permits me to try the AUGDS as well, although initially I can test the AUGDS without any jmpret mods.

You really do need to give this Verilog a try. My biggest problem is getting the syntax correct as that part is not explained very well from the docs I have seen. A lot is by trial and error. I have found Quartus complains about a line I added, and when I finally comment out that line (effectively giving up on the lines syntax) I find the error still persists. Then I realise I have left a comma out on the previous line, or forgot to remove a semicolon, or something equally as stupid. But the error report is misleading me.

Anyway, its great to get some of these simple things done. With 44KB of usable hub ram and 4KB of hub as rom (interpreter, booter and runner), there is a lot we can try on the DE0. Then its a simple matter of increasing the hub ram to a full 60KB ram and 4KB of ram/rom.

Looking forward to tomorrow

thoth · 2014-08-26 18:58

Cluso99 wrote: »

David,

Anyway, its great to get some of these simple things done. With 44KB of usable hub ram and 4KB of hub as rom (interpreter, booter and runner), there is a lot we can try on the DE0. Then its a simple matter of increasing the hub ram to a full 60KB ram and 4KB of ram/rom.

Looking forward to tomorrow

One thing that 4K buys is the possibility of a tremendously augmented Spin interpretor. How many hours did Chip spend beating against a 2K limit?

Dave Hein · 2014-08-27 07:16

Based on some tests I did with an unrolled Spin interpreter using a jump table, I think we should expect close to a 2X speed up with 4K of cog RAM.

thoth · 2014-08-27 08:12

Dave Hein wrote: »

Based on some tests I did with an unrolled Spin interpreter using a jump table, I think we should expect close to a 2X speed up with 4K of cog RAM.

I love the idea of Spin being twice as fast! Lot's of interest here in C and all that but for what it is I think Spin is a marvelous little interpretor that's always "ready to hand" for implementing ideas. I mainly use the Propeller and the Prop Tool as a computer idea board with an easy path from high level to assembly language. Spin is about as easy a language as imaginable. I think it's underappreciated just because it looks a little odd. Spin is Zen simple. That would actually be a great name for it.

And with overclocking it's possible that vSpin can be nearly 4x faster than Spin on a stock P1.

I like it!!!!!

How much memory did the "unrolled" interpretor use?

Dave Hein · 2014-08-27 09:52

The full implementation of an unrolled interpreter is about 1500 cog locations with a 768-byte jump table in hub RAM. This was implemented on the P2 FPGA image from last October, and is posted at http://forums.parallax.com/showthread.php/154460-p1spin?highlight=p1spin . The most frequently used bytecodes were executed from cog RAM, with the remaining bytecodes running from hub RAM using P2's hubexec mode. I believe the 1500 longs could be reduced to 1000 longs without too much impact on performance.

The estimate of 2X performance improvement came from some work I did on trimspin, which was a Spin interpreter that only implemented the bytecodes used by a program. Trimspin would first analyze a binary Spin file to determine which bytecodes were used, and then custom build a Spin interpreter using only the portion of the unrolled interpreter that was needed. In most cases the resulting Spin interpreter would fit into the current cog memory.

Cluso99 · 2014-08-27 13:13

I certainly did not see these figures with my faster spin and the decode table in hub. I was expecting something like 25% improvement although the maths were significantly improved. While placing the decoder in cog will improve again another average say 8 clocks, every bit helps. Placing the stack in cog would have the most improvement of all, even with bounds checking added.

BTW I was unsuccessful in getting AUGDS working yesterday. Need a rethink as to how i can check parts of it work as currently i am running blind if it doesn't work.

msrobots · 2014-08-27 23:11

Dave Hein wrote: »

The full implementation of an unrolled interpreter is about 1500 cog locations with a 768-byte jump table in hub RAM. This was implemented on the P2 FPGA image from last October, and is posted at http://forums.parallax.com/showthread.php/154460-p1spin?highlight=p1spin . The most frequently used bytecodes were executed from cog RAM, with the remaining bytecodes running from hub RAM using P2's hubexec mode. I believe the 1500 longs could be reduced to 1000 longs without too much impact on performance.

The estimate of 2X performance improvement came from some work I did on trimspin, which was a Spin interpreter that only implemented the bytecodes used by a program. Trimspin would first analyze a binary Spin file to determine which bytecodes were used, and then custom build a Spin interpreter using only the portion of the unrolled interpreter that was needed. In most cases the resulting Spin interpreter would fit into the current cog memory.

I remember reading about your trimspin test. Brilliant way to do it. I guess most of the Spin programs do not use all byte codes.

Since on the P2 (or the P1v) the spin interpreter is loadable and not fixed, this is a quite valuable option to create a spin interpreter based of the byte codes USED in your spin source.

Enjoy!

Mike.

4KB Cog(s) - Pasm requirements

Comments