P2 COG and HUB exec now ~100% Binary Compatible!

jmg · 2015-10-03 21:40

I think this milestone post deserves its own thread:

cgracey wrote: »
I got the new memory and branching model working. I also got REP working in hub exec.

There's full binary compatibility now between cog/lut code and hub code that use relative addressing.

In cog code now, we are back to the good old 1:1 addressing - no more 4x'd register addresses. What a relief!

Here's the new map for code execution :

00000..001FF = cog
00200..003FF = lut
00400..FFFFF = hub

Downloaded programs start at $400.

When in the cog, all registers are long, with their addresses being contiguous integers. The PC steps by 1.

When in the hub, instructions take 4 bytes. The PC steps by 4.

To bridge the two contexts, there are two simple things done:

The 9-bit-constant relative branches DJNZ/DJZ/TJZ/... encode the -256..+255 instruction range into their S field. When in cog exec, that value is sign-extended and added to the PC. When in hub exec, it is shifted left two bits and used the same way. This way, both cog and hub contexts get the max use out of these instructions and maintain binary compatibility.

The 20-bit-constant relative branches JMP/CALL/CALLA/... are encoded for hub exec as you imagine they would be, where they track byte offset. When the cog uses these branches, it shifts them right two bits to get cog-relative values. They are assembled pre-4x'd in cog code that way. So, these instructions are now binary compatible between cog/lut and hub code.

REP now works in hub exec by forcing a jump during the last instruction in the repeat block. It didn't take much logic to implement and it works just as you'd expect. Even though it's slow in hub exec, because of the branching on each iteration, it is a convenient instruction to have for doing simple loops.

The assembler generates the same code for relative branches and REP in both cog/lut exec and hub exec contexts.

I will have updated FPGA files done tomorrow. I just finished the Prop123-A7 compile and now I need to make the DE2-115 version.

Here's what the all_cogs_blink program looks like now. Note the ORGH and the REP:
dat
	orgh	$400

' launch 15 cogs (cog 0 falls through and runs 'blink', too)
' any cogs missing from the FPGA won't blink

	loc	x,@blink

	rep	@repend,#15
	coginit	#16,x
repend

blink	cogid	x		'which cog am I?
	setb	dirb,x		'make that pin an output
	notb	outb,x		'flip its output state
	add	x,#16		'add to my id
	shl	x,#18		'shift up to make it big
	waitx	x		'wait that many clocks
	jmp	@blink		'do it again

	org
x	res	1		'variable at cog register 8

jmg · 2015-10-03 21:40

and this new feature is related

cgracey wrote: »
Here's an example of the JMPREL instruction. It works in both cog and hub. There needed to be some mechanism like this that automatically scales the branch offset (<<2 for hub), so that variable-relative branches could be realized in binary-compatible code.
dat
	orgh	$400


pgm	mov	dira,#$FF	'entry
	mov	x,#0

loop	shl	x,#1		'loop
	call	@spread
	shr	x,#1
	incmod	x,#7
	jmp	@loop


spread	jmprel	x

	notb	outa,#0
	ret

	notb	outa,#1
	ret

	notb	outa,#2
	ret

	notb	outa,#3
	ret

	notb	outa,#4
	ret

	notb	outa,#5
	ret

	notb	outa,#6
	ret

	notb	outa,#7
	ret


	org
x	res	1

jmg · 2015-10-03 21:53

Why does this matter ? :
It saves a shipload of explaining and admin, and means discussion can focus on features, not on caveats and gotchas.
It also allows one mindset in code development, and late-in-design-flow choices on what code will run where.

HUB Exec, including in Assembler, now has all the features of COG exec, and users can craft a design to use COG exec where it really matters.

P2 now has gained some things in common with PC level higher end MPUs - in those, you have a local Cache that is limited, but very fast and much larger SDRAM that is less deterministic.
(and usually an OS as well, to make things increasingly less deterministic)

Add Cache-lock thinking, where a design can lock small code into that faster memory area, and the P2 tracks that mindset, but adds the feature that such fastest, deterministic code also gets it own core.

Easy to explain to new users, and the potential for hard real time use, is obvious.

David Betz · 2015-10-03 22:05

Well, they're not entirely compatible. You still can't have byte-aligned instructions in COG memory which means that you can't use the trick that Chip mentioned about putting inline byte data in your code if you intend to run it in COG memory. Requiring instructions to be long-aligned even in hub memory would improve that a bit but you still wouldn't be able to run Chip's code because COG memory isn't byte addressable. So, while it's true that any code that will run in COG memory will also run in hub memory, the reverse is not necessarily true unless you follow some restrictions.

jmg · 2015-10-03 22:12

David Betz wrote: »

Well, they're not entirely compatible. You still can't have byte-aligned instructions in COG memory which means that you can't use the trick that Chip mentioned about putting inline byte data in your code if you intend to run it in COG memory. Requiring instructions to be long-aligned even in hub memory would improve that a bit but you still wouldn't be able to run Chip's code because COG memory isn't byte addressable. So, while it's true that any code that will run in COG memory will also run in hub memory, the reverse is not necessarily true unless you follow some restrictions.

The bigger limit here seems to be that COG(LUT?) memory isn't byte addressable ?
Alignment for portability is more of a tool setting issue.

I'll paste Chip's code snippet here, as it is small and shows the points

cgracey wrote: »
Here is an example of why unaligned hub code is important:
	call	@send_string
	byte	13,13,"The time is ",0
	mov	val,hours
	call	@send_decimal2
	call	@send_string
	byte	':',0
	mov	val,minutes
	call	@send_decimal2
	call	@send_string
	byte	" and the date is ",0
	...
You can do things like that, which is way better than having to get pointers to data located elsewhere.
Edit: changed 'db' to 'byte'

One could argue that placing casual string constants into COG, is not a good idea, and send_string is not likely to be the sort of hard-real-time code that is cog-destined.
Because COG can call HUB anytime, users have some choices here of place this code in HUB, or place the strings in HUB with a prefix call.

Roy Eltham · 2015-10-03 22:37

David beat me to it. It's not 100% binary compatible, but at least all the instructions work the same no matter where they are run from.

Also, it's fairly trivial to make your code work in both by avoiding things like Chip's string example. Especially in generated code like from gcc.

I assume Chip's assembler already complains if you try to compile his example in COG space.

Anyway, I think we are in a good place on all this stuff.

jmg · 2015-10-04 00:00

Binary Compatible is usually related to 'all the instructions work the same no matter where they are run from'.
- but yes, there are memory area and opcode reach caveats that mean mixed code and data can have issues, those are not so much related to opcode execution, but more data mapping. (I've changed the title to ~100% Binary Compatible)

As you say, the tools should report data mapping errors.

Cluso99 · 2015-10-05 03:32

Just one more baby step required...

Make hubexec long-aligned.
Then the last simplification will be possible....

All instructions can be "longs" everywhere. PC will always be +1.
Only as the instruction is fetched from hub will 2 LSBs of %00 be appended.

Then we will ultimately have a simple programming model equally applied to hub/cog/lut.

potatohead · 2015-10-05 03:34

I don't want it aligned. Being able to inline data is too sweet.

potatohead · 2015-10-05 03:41

Besides, when we make SPIN 2, it will have inline PASM, and that with a byte code and native PASM options will make nice, dense programs.

cgracey · 2015-10-05 03:45

Cluso99 wrote: »

Just one more baby step required...

Make hubexec long-aligned.
Then the last simplification will be possible....

All instructions can be "longs" everywhere. PC will always be +1.
Only as the instruction is fetched from hub will 2 LSBs of %00 be appended.

Then we will ultimately have a simple programming model equally applied to hub/cog/lut.

That would mean:

cog exec $00000..$007FF
lut exec $00800..$00FFF
hub exec $01000..$FFFFF

A jump to $01000 would read a long from $04000.

Say you had a code label for instruction $01000. How would you get an address from that label, in order to do a RDBYTE? What would that look like?

Cluso99 · 2015-10-05 04:02

cgracey wrote: »

Cluso99 wrote: »

Just one more baby step required...

Make hubexec long-aligned.
Then the last simplification will be possible....

All instructions can be "longs" everywhere. PC will always be +1.
Only as the instruction is fetched from hub will 2 LSBs of %00 be appended.

Then we will ultimately have a simple programming model equally applied to hub/cog/lut.

That would mean:

cog exec $00000..$007FF
lut exec $00800..$00FFF
hub exec $01000..$FFFFF

A jump to $01000 would read a long from $04000.

Say you had a code label for instruction $01000. How would you get an address from that label, in order to do a RDBYTE? What would that look like?

No!

cog exec $00000..$001FF
lut exec $00200..$003FF
hub exec $00400..$3FFFF (it is in longs; <<2 and append %00 for byte addresses for data)

For hub addresses, they will be addresses as bytes for all data accesses. However, for DJxx/TJxx/JMP/CALLx/RETx operands, hub addresses will always be shifted >>2 and masked to 18-bits by the compiler. The PC will always treat instructions as being longs and long-aligned, so the PC will always increment by +1 and be 18 bits. When fetching instructions from hub, the PC will append %00 which makes a long-aligned byte address.

Now it can be explained simply to the user that all instruction addresses are long-aligned long addresses, for hub/cog/lut.
And we have reduced the instruction address requirements to 18-bits. As a side benefit, it simplifies and frees up some opcode space in the process.

jmg · 2015-10-05 04:12

Cluso99 wrote: »

Just one more baby step required...

Make hubexec long-aligned.
Then the last simplification will be possible....

All instructions can be "longs" everywhere. PC will always be +1.
Only as the instruction is fetched from hub will 2 LSBs of %00 be appended.

Then we will ultimately have a simple programming model equally applied to hub/cog/lut.

I'm not following - there already is "a simple programming model equally applied to hub/cog/lut", -

Chip now has hardware manage the differences, so the binary code is identical, and can be copied in blocks from HUB to COG or LUT, (RJMPs assumed) with only data access caveats.
Those caveats are more related to BYTE pointers than to alignment.

The user model is not improved by forcing Hubexec long aligned, but you do remove some more compact code options.

If some MHz gain were achieved by forcing align, that becomes a different question.

Cluso99 · 2015-10-05 04:15

from Chip
Say you had a code label for instruction $01000. How would you get an address from that label, in order to do a RDBYTE? What would that look like?

Here is an example

dat
              orgh              $04000                  'hub byte addresses
              mov               y,x                     '<-- compiler forces long-alignement (or errors out)
              loc               hubptr,datab            ' set hub byte address for "datab" label 
              rdbyte            y,hubptr
              loc               hubptr,dataw            ' set hub byte address for "dataw" label 
              rdword            y,hubptr
              loc               hubptr,datal            ' set hub byte address for "datal" label 
              rdlong            y,hubptr

                orgh            $08000                  'hub byte addresses

datab           byte            "A"
                byte            0                       'filler
dataw           word            $1234
datal           long            $1234_4321


              org               $0100                   'cog long addresses
x             long              $10
y             res               1
hubptr        res               1

jmg · 2015-10-05 04:26

cgracey wrote: »

Say you had a code label for instruction $01000. How would you get an address from that label, in order to do a RDBYTE? What would that look like?

If you want to intermix BYTE and code, I think you are forced to use packers, if code is forced long-aligned.

That's not drop-dead, but it is wasteful.

I'm not seeing a compelling case for long alignment.
IIRC you said there was minimal speed impact ? - and what is there now works.

cgracey · 2015-10-05 04:43

Cluso99 wrote: »

from Chip
Say you had a code label for instruction $01000. How would you get an address from that label, in order to do a RDBYTE? What would that look like?

Here is an example

dat
              orgh              $04000                  'hub byte addresses
              mov               y,x                     '<-- compiler forces long-alignement (or errors out)
              loc               hubptr,datab            ' set hub byte address for "datab" label 
              rdbyte            y,hubptr
              loc               hubptr,dataw            ' set hub byte address for "dataw" label 
              rdword            y,hubptr
              loc               hubptr,datal            ' set hub byte address for "datal" label 
              rdlong            y,hubptr

                orgh            $08000                  'hub byte addresses

datab           byte            "A"
                byte            0                       'filler
dataw           word            $1234
datal           long            $1234_4321


              org               $0100                   'cog long addresses
x             long              $10
y             res               1
hubptr        res               1

What I was getting, was how do you reconcile between code and data addresses via labels? How do you take a code label and get a long address out of it that you can use with RDLONG, in order to, say, load code into lut?

Cluso99 · 2015-10-05 07:16

Here is how I think it should work...

DAT
                orgh    $04000

entry
                loc     hptr,#code                      ' hub byte address
                setq    #(x-begin)                      ' count in longs
                rdlong  begin,hptr                      ' "begin"is cog (long) reg/addr; hptr point to hub (byte) addr

                jmp     #begin

code            .....                                   'code to be loaded from hub to cog
                .....

                org     $008

begin           ......

data            long    0
hptr            res     1
x               res     1

cgracey · 2015-10-05 13:15

Cluso99 wrote: »

Here is how I think it should work...

DAT
                orgh    $04000

entry
                loc     hptr,#code                      ' hub byte address
                setq    #(x-begin)                      ' count in longs
                rdlong  begin,hptr                      ' "begin"is cog (long) reg/addr; hptr point to hub (byte) addr

                jmp     #begin

code            .....                                   'code to be loaded from hub to cog
                .....

                org     $008

begin           ......

data            long    0
hptr            res     1
x               res     1

Okay, but what about using a code label as both a JMP address and a RDLONG address?

I'm wondering how you handle the 4x difference from PC to hub address. How do you bridge the two address schemes?

Bill Henning · 2015-10-05 18:53

9 and 20 bit embedded addresses have implied '00' extension, managed by the assembler when not prefixed by ALTDS

LOC of an immediate 20 bit address would shift it left two bits when not prefixed by ALTDS

long addr_label

would explicitly get the two 0 lsb's

Cluso99 · 2015-10-05 22:02

Chip,
Does this answer your question?

                orgh    $04000

entry
                loc     hptr,#code                      ' hub byte address
                setq    #(x-begin)                      ' count in longs
                rdlong  begin,hptr                      ' "begin"is cog (long) reg/addr; hptr point to hub (byte) addr

'some silly code to show jmp #abs in hubexec
                .....
                calla   #hub_sub                        ' assembler inserts hub long addr (now_begin>>2) into call instr
                .....
                jmp     #now_begin                      ' assembler inserts hub long addr (now_begin>>2) into jmp instr 


now_begin       .....
                jmp     #begin

'some subroutine
hub_sub         .....
                .....
                reta

code            .....                                   'code to be loaded from hub to cog
                .....

                org     $008

begin           ......

data            long    0
hptr            res     1
x               res     1

Cluso99 · 2015-10-05 22:06

Is there any case where LOC is used to get an instruction address?
I am not really sure of all of LOCs usage.

P2 COG and HUB exec now ~100% Binary Compatible!

Comments