4KB Cog(s) - Pasm requirements
Cluso99
Posts: 18,069
As you may have seen, I am implementing an AUGDS instruction somewhat similar to the P2. While I have coded it, I havenot had time to test it yet.
The other requirement for addressing above 2KBof cog ram, is to have a JMP/CALL/RET instruction where the jump-to or return address is in the lower 2KB, plus of course an increased bit size of the PC (program counter).
While I will require a larger PC forhubexec, 10 bits will be fine for a 4KB Cog.
The current JMPRET (JMP/CALL/RET) instruction doesn't really make sensible use of the WC or WZ bits. I can use these bits to expand the JMPRET instruction to 10 bits for both S and D cog addresses.
So, I am going to try this method first.
The other requirement for addressing above 2KBof cog ram, is to have a JMP/CALL/RET instruction where the jump-to or return address is in the lower 2KB, plus of course an increased bit size of the PC (program counter).
While I will require a larger PC forhubexec, 10 bits will be fine for a 4KB Cog.
The current JMPRET (JMP/CALL/RET) instruction doesn't really make sensible use of the WC or WZ bits. I can use these bits to expand the JMPRET instruction to 10 bits for both S and D cog addresses.
So, I am going to try this method first.
Comments
eg. your CALLR example would expand the CALLR macro to this...
JMPRET #(my_function>>9), #(my_function&$1ff) WC
The D field now holds the top nine bits of the called function address, S is low 9 bits. WC is used as an instruction modifier to the original JMPRET behavior which now means interpret D field as part of S. No need for an additional preceding AUGDS instruction either.
But of course you'd need expanded compiler support for generating code like that. The P1 syntax doesn't support constants for D field today (unlike what we had for the previous P2). However this method would allow up to 18 bit Program counters in a COG!
Note. You'd also need to push the LR somewhere if you had nested or recursive calls.
The JMP behaviour could be expanded in the same way with the WC flag and the (currently unused) D field.
Roger.
One advantage of leaving the D field alone though is that you can have different LR registers and implement a sort of static stack for handling nested calls. The wouldn't work with recursive calls though. The idea would be to have a bunch of locations reserved for return addresses. LR would be used by leaf functions (that don't call any other functions), LR2 would be used for functions that call only leaf functions, LR3 would be used by functions that call LR2 functions, etc. Then you don't have to store move the return address from the fixed LR onto a stack or into some other location.
Having said that, I think I like your approach better because it doesn't require 64 bits for a call.
I thought I would try the method I detailed above first. This allows me to test out memory above 2KB. Hopefully that will give me some more insight before I tackle the harder solution with a new jmpret for hubexec, which will also work for cog >2KB and also for cog >4KB.
For any P1 variants, 1MB hub restriiction is fine as far as I am concerned. Even thecurrent targetted P2 onlyhas 512KB,and the hot P2 wasonly going to have 128KB.
Anticipate getting some testing time tomorrow
Back home again. It's a bit of a pain to type on my Xoom and worse on a train
Anyway, it seemed an easier way to test that the >2KB cog is really working, and also to get a small program running above the 2KB limit. The same logic can almost equally apply to cog vs hub above 2KB. And it also permits me to try the AUGDS as well, although initially I can test the AUGDS without any jmpret mods.
You really do need to give this Verilog a try. My biggest problem is getting the syntax correct as that part is not explained very well from the docs I have seen. A lot is by trial and error. I have found Quartus complains about a line I added, and when I finally comment out that line (effectively giving up on the lines syntax) I find the error still persists. Then I realise I have left a comma out on the previous line, or forgot to remove a semicolon, or something equally as stupid. But the error report is misleading me.
Anyway, its great to get some of these simple things done. With 44KB of usable hub ram and 4KB of hub as rom (interpreter, booter and runner), there is a lot we can try on the DE0. Then its a simple matter of increasing the hub ram to a full 60KB ram and 4KB of ram/rom.
Looking forward to tomorrow
One thing that 4K buys is the possibility of a tremendously augmented Spin interpretor. How many hours did Chip spend beating against a 2K limit?
I love the idea of Spin being twice as fast! Lot's of interest here in C and all that but for what it is I think Spin is a marvelous little interpretor that's always "ready to hand" for implementing ideas. I mainly use the Propeller and the Prop Tool as a computer idea board with an easy path from high level to assembly language. Spin is about as easy a language as imaginable. I think it's underappreciated just because it looks a little odd. Spin is Zen simple. That would actually be a great name for it.
And with overclocking it's possible that vSpin can be nearly 4x faster than Spin on a stock P1.
I like it!!!!!
How much memory did the "unrolled" interpretor use?
The estimate of 2X performance improvement came from some work I did on trimspin, which was a Spin interpreter that only implemented the bytecodes used by a program. Trimspin would first analyze a binary Spin file to determine which bytecodes were used, and then custom build a Spin interpreter using only the portion of the unrolled interpreter that was needed. In most cases the resulting Spin interpreter would fit into the current cog memory.
BTW I was unsuccessful in getting AUGDS working yesterday. Need a rethink as to how i can check parts of it work as currently i am running blind if it doesn't work.
I remember reading about your trimspin test. Brilliant way to do it. I guess most of the Spin programs do not use all byte codes.
Since on the P2 (or the P1v) the spin interpreter is loadable and not fixed, this is a quite valuable option to create a spin interpreter based of the byte codes USED in your spin source.
Enjoy!
Mike.