P2 COG and HUB exec now ~100% Binary Compatible!
jmg
Posts: 15,173
I think this milestone post deserves its own thread:
I got the new memory and branching model working. I also got REP working in hub exec.
There's full binary compatibility now between cog/lut code and hub code that use relative addressing.
In cog code now, we are back to the good old 1:1 addressing - no more 4x'd register addresses. What a relief!
Here's the new map for code execution :
00000..001FF = cog
00200..003FF = lut
00400..FFFFF = hub
Downloaded programs start at $400.
When in the cog, all registers are long, with their addresses being contiguous integers. The PC steps by 1.
When in the hub, instructions take 4 bytes. The PC steps by 4.
To bridge the two contexts, there are two simple things done:
The 9-bit-constant relative branches DJNZ/DJZ/TJZ/... encode the -256..+255 instruction range into their S field. When in cog exec, that value is sign-extended and added to the PC. When in hub exec, it is shifted left two bits and used the same way. This way, both cog and hub contexts get the max use out of these instructions and maintain binary compatibility.
The 20-bit-constant relative branches JMP/CALL/CALLA/... are encoded for hub exec as you imagine they would be, where they track byte offset. When the cog uses these branches, it shifts them right two bits to get cog-relative values. They are assembled pre-4x'd in cog code that way. So, these instructions are now binary compatible between cog/lut and hub code.
REP now works in hub exec by forcing a jump during the last instruction in the repeat block. It didn't take much logic to implement and it works just as you'd expect. Even though it's slow in hub exec, because of the branching on each iteration, it is a convenient instruction to have for doing simple loops.
The assembler generates the same code for relative branches and REP in both cog/lut exec and hub exec contexts.
I will have updated FPGA files done tomorrow. I just finished the Prop123-A7 compile and now I need to make the DE2-115 version.
Here's what the all_cogs_blink program looks like now. Note the ORGH and the REP:dat orgh $400 ' launch 15 cogs (cog 0 falls through and runs 'blink', too) ' any cogs missing from the FPGA won't blink loc x,@blink rep @repend,#15 coginit #16,x repend blink cogid x 'which cog am I? setb dirb,x 'make that pin an output notb outb,x 'flip its output state add x,#16 'add to my id shl x,#18 'shift up to make it big waitx x 'wait that many clocks jmp @blink 'do it again org x res 1 'variable at cog register 8
Comments
It saves a shipload of explaining and admin, and means discussion can focus on features, not on caveats and gotchas.
It also allows one mindset in code development, and late-in-design-flow choices on what code will run where.
HUB Exec, including in Assembler, now has all the features of COG exec, and users can craft a design to use COG exec where it really matters.
P2 now has gained some things in common with PC level higher end MPUs - in those, you have a local Cache that is limited, but very fast and much larger SDRAM that is less deterministic.
(and usually an OS as well, to make things increasingly less deterministic)
Add Cache-lock thinking, where a design can lock small code into that faster memory area, and the P2 tracks that mindset, but adds the feature that such fastest, deterministic code also gets it own core.
Easy to explain to new users, and the potential for hard real time use, is obvious.
The bigger limit here seems to be that COG(LUT?) memory isn't byte addressable ?
Alignment for portability is more of a tool setting issue.
I'll paste Chip's code snippet here, as it is small and shows the points
One could argue that placing casual string constants into COG, is not a good idea, and send_string is not likely to be the sort of hard-real-time code that is cog-destined.
Because COG can call HUB anytime, users have some choices here of place this code in HUB, or place the strings in HUB with a prefix call.
Also, it's fairly trivial to make your code work in both by avoiding things like Chip's string example. Especially in generated code like from gcc.
I assume Chip's assembler already complains if you try to compile his example in COG space.
Anyway, I think we are in a good place on all this stuff.
- but yes, there are memory area and opcode reach caveats that mean mixed code and data can have issues, those are not so much related to opcode execution, but more data mapping. (I've changed the title to ~100% Binary Compatible)
As you say, the tools should report data mapping errors.
Make hubexec long-aligned.
Then the last simplification will be possible....
All instructions can be "longs" everywhere. PC will always be +1.
Only as the instruction is fetched from hub will 2 LSBs of %00 be appended.
Then we will ultimately have a simple programming model equally applied to hub/cog/lut.
That would mean:
cog exec $00000..$007FF
lut exec $00800..$00FFF
hub exec $01000..$FFFFF
A jump to $01000 would read a long from $04000.
Say you had a code label for instruction $01000. How would you get an address from that label, in order to do a RDBYTE? What would that look like?
cog exec $00000..$001FF
lut exec $00200..$003FF
hub exec $00400..$3FFFF (it is in longs; <<2 and append %00 for byte addresses for data)
For hub addresses, they will be addresses as bytes for all data accesses. However, for DJxx/TJxx/JMP/CALLx/RETx operands, hub addresses will always be shifted >>2 and masked to 18-bits by the compiler. The PC will always treat instructions as being longs and long-aligned, so the PC will always increment by +1 and be 18 bits. When fetching instructions from hub, the PC will append %00 which makes a long-aligned byte address.
Now it can be explained simply to the user that all instruction addresses are long-aligned long addresses, for hub/cog/lut.
And we have reduced the instruction address requirements to 18-bits. As a side benefit, it simplifies and frees up some opcode space in the process.
I'm not following - there already is "a simple programming model equally applied to hub/cog/lut", -
Chip now has hardware manage the differences, so the binary code is identical, and can be copied in blocks from HUB to COG or LUT, (RJMPs assumed) with only data access caveats.
Those caveats are more related to BYTE pointers than to alignment.
The user model is not improved by forcing Hubexec long aligned, but you do remove some more compact code options.
If some MHz gain were achieved by forcing align, that becomes a different question.
Here is an example
If you want to intermix BYTE and code, I think you are forced to use packers, if code is forced long-aligned.
That's not drop-dead, but it is wasteful.
I'm not seeing a compelling case for long alignment.
IIRC you said there was minimal speed impact ? - and what is there now works.
What I was getting, was how do you reconcile between code and data addresses via labels? How do you take a code label and get a long address out of it that you can use with RDLONG, in order to, say, load code into lut?
Okay, but what about using a code label as both a JMP address and a RDLONG address?
I'm wondering how you handle the 4x difference from PC to hub address. How do you bridge the two address schemes?
LOC of an immediate 20 bit address would shift it left two bits when not prefixed by ALTDS
long addr_label
would explicitly get the two 0 lsb's
Does this answer your question?
I am not really sure of all of LOCs usage.