The addressing conundrum

evanh · 2015-10-02 22:21

For HubExec it's no big deal but REP for use in CogExec is a beautiful thing. I can probably think of other features to forgo before ditching that just to save a little logic.

jmg · 2015-10-02 22:37

Seairth wrote: »

* Timing is not the same in cog and hub exec modes, except for very small snippets.
* For code that's good enough to run in hub exec mode, it seems highly unlikely that there'd be a reason to run it in a cog instead.
....

All of this is correct, but ignores source code maintenance, and the concept of least surprises - which is why Chip is now trying hard to make REP work in HUB mode,

Also "good enough to run in hub exec mode" varies with application, one COG may be fine with a slower LIB call, and have other higher priority tasks packed into the COG, whilst another COG in the same design, may want the lower jitter that COG-run buys.
There is no optimal single target for code, it varies COG by COG.

COG code is also likely to be lower power and better protected than HUB, so even more reasons to move code around exist.
Put all those designer-choice items into the mix, and it is hard not to want binary compatible operation.

evanh · 2015-10-02 23:23

HLL sources don't know anything about REP. REP is optional for HubExec. If it is easy to fix then sure, lets have it. But if not, then it can stay as is.

msrobots · 2015-10-03 00:37

I am not sure why binary compatibility is so important.

If you write in PASM you need to assemble it. You will start with a source file and the Assembler can take care of it.
If you write in C you need to compile it. You will start with source files and the Compiler takes care of that.
If you write in SPIN or a future PropBasic you will also start from a source file.

So why we need binary compatibility at all?

Neither of the languages on the prop use dynamic linked modules in binary form. Where should they load from? Even PropGCC uses just static linking, so there should be no need to include linking precompiled binaries, so my thinking.

And even if somebody wants to do that it would just make sense for a large program, bound to run in HubExec anyways.

Parallax also made a huge afford to go into the Open Source thing. Even the OBEX for the P1 is Open Source somehow.

Do we really need to put that closed source, just provided as binary, into this? And if so, why?

If you run code in a COG in C you can do that by providing -mcog or something like that. Works already on the P1.

Like @Chip said I do not see the same code running on the same P2 sometimes in HubExec and sometimes in CogExec in the same program. Makes no sense to me.

So it can be a compile time decision where it runs. And should be.

Please enlighten the stupid me.

Enjoy!

Mike

jmg · 2015-10-03 00:51

msrobots wrote: »

Neither of the languages on the prop use dynamic linked modules in binary form.

Not yet, but they could easily do this.
You have closed a lot of doors, and constrained usage, and complicated libraries, with what you seek to impose above.

Seairth · 2015-10-03 01:02

jmg wrote: »

Seairth wrote: »

* Timing is not the same in cog and hub exec modes, except for very small snippets.
* For code that's good enough to run in hub exec mode, it seems highly unlikely that there'd be a reason to run it in a cog instead.
....

All of this is correct, but ignores source code maintenance, and the concept of least surprises - which is why Chip is now trying hard to make REP work in HUB mode,

Also "good enough to run in hub exec mode" varies with application, one COG may be fine with a slower LIB call, and have other higher priority tasks packed into the COG, whilst another COG in the same design, may want the lower jitter that COG-run buys.
There is no optimal single target for code, it varies COG by COG.

COG code is also likely to be lower power and better protected than HUB, so even more reasons to move code around exist.
Put all those designer-choice items into the mix, and it is hard not to want binary compatible operation.

I was only discussing binary compatibility. What you are talking about is primarily a matter of source compatibility. Practically speaking, if you want a bit of code to run specifically in hub or cog, you will most likely do so when writing the code, not at run time. Yes, it is possible to wait until run time to make that decision, but I think that will be a rare design approach. It will not be a regularly used feature and therefore I believe it should not be dictating changes to the language and/or architecture.

Cluso99 · 2015-10-03 01:02

Binary compatibility for hubexec and cogexec just makes sense.
No need to worry about differences.
No need to supply two versions, one for hub and one for cog.

Just think about where the P2 is headed...
Chip wants an on-chip development platform. (we actually have that capability on P1 now)
So, users will not want to compile everything. Binaries can be held as files on SD cards.

It's just another unnecessary restriction which makes hubexec seem like a kludge.

IMHO it all adds up to why I am so adamant that the "Instruction Model" should be "Longs" and "Long Aligned".
It is just plain simple, imposes no unnecessary restrictions, and easy to understand.

FWIW non xxxx aligned instructions are abnormal in the computer erra. There is a reason for this - simplicity.

msrobots · 2015-10-03 01:03

jmg wrote: »

msrobots wrote: »

Neither of the languages on the prop use dynamic linked modules in binary form.

Not yet, but they could easily do this.
You have closed a lot of doors, and constrained usage, and complicated libraries, with what you seek to impose above.

Why is that so? And why you can not compile the library from source into your Program?

We do have no support for a MemoryManager and virtual addresses. So your binary would either need to be able to full relocate or needs to compile from the source anyways.

If located in a COG the function/subroutine/whatever is anyways single use for that single cog. No sense in sharing at all. Doesn't work. So no dynamic part anyways.

If located in HUB the function/subroutine/whatever is always shared between all cogs. They can call it if needed.

I still do not see the need for some DLLs on the P2.

Enjoy!

Mike

jmg · 2015-10-03 01:13

Seairth wrote: »

I was only discussing binary compatibility. What you are talking about is primarily a matter of source compatibility. Practically speaking, if you want a bit of code to run specifically in hub or cog, you will most likely do so when writing the code, not at run time. Yes, it is possible to wait until run time to make that decision, but I think that will be a rare design approach. It will not be a regularly used feature and therefore I believe it should not be dictating changes to the language and/or architecture.

If REP cannot work in HUB, yes, users could live with that, but care is needed with dismissing binary compatible as unimportant.
Once you are incompatible, the questions become 'where, and by how much ?' and the users and tools have to tag and track the variances, and that is non-trivial. - not to mention simply unexpected in MCU land, and so also fails the 'no surprises' model,

I think Chip has now made AJMP/RJMP compatible and that is a significant gain, it would be nice if REP can have a form that is compatible, but I could live with a build-compatible version of REP that uses DJNZ.
It sounds like that can use a single source master.

evanh · 2015-10-03 01:34

Cluso99 wrote: »

IMHO it all adds up to why I am so adamant that the "Instruction Model" should be "Longs" and "Long Aligned".

I think Chip is doing this. Or at very least re-pondering the possibilities of which combination works the best.

msrobots · 2015-10-03 01:43

Yes, for libraries the same source aspect sounds good. But even with LutExec a COG is a small resource and basically the Lut is not just there just to be executed. It can now, but the reason for its invention was somehow different from extending Cog code space.

It is made to lookup/buffer/transfer DATA with some neat pointers as far as I remember.

So we basically have still quite small COGs if we use the lut as -hmm - lut.

But lots of them.

So stuff to run in a cog will be created to do so and stuff to run in the hub also.

Like with SPIN and PASM or with PropGCC as done on the P1. If it fits it can be compiled with -mcog, if not, not.

With 16 Cogs you will mostly not running all of them in HubExec. And drivers for hardware should 'sort of' run in a COG to be mixed and combined like LEGO.

So I still think that the only form of a dynamic linkable unit making sense on a propeller is some object running on a cog like them coglets from @dr_acula.

I do like SPIN and know that I will have to wait a while until SPIN runs on a P2. But it will!

Enjoy!

Mike

Electrodude · 2015-10-03 03:31

Binary compatibility is important for when you have a function that is called rarely in some parts of your code and very often in others. Normally, you can call the function out of hubram, but when you need it often and want to load it into cog/lutram, a simple setq+rdlong is all that's necessary.

In fact, instead of calling commonly juggled functions directly, code that can't be sure if a juggleable function is currently in cog/lutram could instead call a jmp in, say, lutram, that jumps to the cog/lutram version if the function is currently in cog/lutram, and otherwise jumps to the hubram version. This jump would have to be manually updated to point to the appropriate place whenever the function is loaded into or unloaded from cog/lutram.

Heater. · 2015-10-03 07:18

Do we now have three processor architectures in one machine: HUB, COG and LUT?

Colour me confused but that sounds like the road to madness.

Currently we have COG and LMM. Which are not binary compatible. They are not source compatible at the assembler level. They are source compatible in C/C++ at least.

Roy Eltham · 2015-10-03 07:53

Heater,
One way to think about the COG/LUT is like this: It's 1024 longs of instructions space, with the first 512 as direct access registers and the second 512 as optionally available as indirect access data memory.

HUBEXEC is just hardware accelerated LMM.

Heater. · 2015-10-03 08:22

All right Roy. So can I assume that the HUBEXEC is a different CPU architecture in the same way that LMM is on the P1? You certainly cannot run LLM code in COG or vice vesa.

jmg · 2015-10-03 08:30

Heater. wrote: »

So can I assume that the HUBEXEC is a different CPU architecture in the same way that LMM is on the P1? You certainly cannot run LLM code in COG or vice vesa.

If HUB is running LMM then yes, that applies exactly.

However, P2 Hubexec can run ASM opcodes which is not an option on P1, and there, the P1 metaphor must diverge.

Heater. · 2015-10-03 08:56

jmg,

But, but, running from HUB and running from COG are not the same. And cannot ever be as far as I can tell.

Unless things have changed beyond all recognition (sorry I have not been following all P2 the details for a long while) then:

1) In COG I can jump to any other COG location with a single 32 bit JMP instruction. Clearly in HUB I need a lot more bits to get to all HUB locations. That sounds like a fundamentally different CPU architecture right there.

2) If all else were bit for bit the same HUB code will be slower and it will suffer the jitters as it competes for HUB access with all other COGS.

Seems to me that any coder, and by extension compiler, will need to be aware of what code runs where, HUB or COG.

Roy Eltham · 2015-10-03 08:57

Another way to think about it is that it's one CPU architecture that is able to read instructions from all 3 of it's memories.

The instructions all work the same no matter where they come from aside from hub being slower because of stalls refilling the streamer on branches (and currently, the REP instruction, but hopefully that gets fixed).

Roy Eltham · 2015-10-03 09:01

Heater,
There are branch instructions with 20 bits of target in them (still 32bit total), and there are relative branch instructions with smaller numbers of bits for how far to jump + or -.

When you branch (from anywhere) to a cog/lut address it will start executing from cog memory there, if you branch to a hub address, then it will start executing from there.

Chip has made it simple.

Heater. · 2015-10-03 09:09

Roy,

Sounds great.

What happened to JMPRET? As used in CALL/RET. Which used to need two 9 bit addresses.

Roy Eltham · 2015-10-03 09:19

There are a whole slew of CALLx/RETx instructions that use various stacks. One internal to the COG that's like 8 deep or so, and the others are in HUB via PTRA/B registers. I think it's also possible to use LUT memory for a stack.

The old style JMPRET that used self modifying to achieve the return is gone.

jmg · 2015-10-03 09:30

Heater. wrote: »

But, but, running from HUB and running from COG are not the same. And cannot ever be as far as I can tell.

2) If all else were bit for bit the same HUB code will be slower and it will suffer the jitters as it competes for HUB access with all other COGS.

True in the second part, for the first part, see below

Heater. wrote: »

Unless things have changed beyond all recognition (sorry I have not been following all P2 the details for a long while) then:

1) In COG I can jump to any other COG location with a single 32 bit JMP instruction. Clearly in HUB I need a lot more bits to get to all HUB locations. That sounds like a fundamentally different CPU architecture right there.

Things have changed - a COG can call into HUBEXEC code from anywhere, (and simply roll-into LUT) and HUBEXEC can call COG code.
All of this code can be ASM, if you like.

Heater. wrote: »

Seems to me that any coder, and by extension compiler, will need to be aware of what code runs where, HUB or COG.

Correct, but that placement decision can be made late in a design, and vary from COG to COG.

I believe Chip now has AJMP and RJMP binary compatible, which means only REP (for now) is truly COG-centric. The difference has all but melted away.

Heater. · 2015-10-03 10:02

Oh boy, I have a lot of catching up to do. I'm a bit stuck if it won't run on my DE0-Nano though.

David Betz · 2015-10-03 11:07

Heater. wrote: »

Oh boy, I have a lot of catching up to do. I'm a bit stuck if it won't run on my DE0-Nano though.

I think Dave Hein has updated his P2 simulator to handle the new instruction and architecture. You might try that until a DE0-Nano FPGA image is available.

cgracey · 2015-10-03 11:24

I got the new memory and branching model working. I also got REP working in hub exec.

There's full binary compatibility now between cog/lut code and hub code that use relative addressing.

In cog code now, we are back to the good old 1:1 addressing - no more 4x'd register addresses. What a relief!

Here's the new map for code execution :

00000..001FF = cog
00200..003FF = lut
00400..FFFFF = hub

Downloaded programs start at $400.

When in the cog, all registers are long, with their addresses being contiguous integers. The PC steps by 1.

When in the hub, instructions take 4 bytes. The PC steps by 4.

To bridge the two contexts, there are two simple things done:

The 9-bit-constant relative branches DJNZ/DJZ/TJZ/... encode the -256..+255 instruction range into their S field. When in cog exec, that value is sign-extended and added to the PC. When in hub exec, it is shifted left two bits and used the same way. This way, both cog and hub contexts get the max use out of these instructions and maintain binary compatibility.

The 20-bit-constant relative branches JMP/CALL/CALLA/... are encoded for hub exec as you imagine they would be, where they track byte offset. When the cog uses these branches, it shifts them right two bits to get cog-relative values. They are assembled pre-4x'd in cog code that way. So, these instructions are now binary compatible between cog/lut and hub code.

REP now works in hub exec by forcing a jump during the last instruction in the repeat block. It didn't take much logic to implement and it works just as you'd expect. Even though it's slow in hub exec, because of the branching on each iteration, it is a convenient instruction to have for doing simple loops.

The assembler generates the same code for relative branches and REP in both cog/lut exec and hub exec contexts.

I will have updated FPGA files done tomorrow. I just finished the Prop123-A7 compile and now I need to make the DE2-115 version.

Here's what the all_cogs_blink program looks like now. Note the ORGH and the REP:

dat
	orgh	$400

' launch 15 cogs (cog 0 falls through and runs 'blink', too)
' any cogs missing from the FPGA won't blink

	loc	x,@blink

	rep	@repend,#15
	coginit	#16,x
repend

blink	cogid	x		'which cog am I?
	setb	dirb,x		'make that pin an output
	notb	outb,x		'flip its output state
	add	x,#16		'add to my id
	shl	x,#18		'shift up to make it big
	waitx	x		'wait that many clocks
	jmp	@blink		'do it again

	org
x	res	1		'variable at cog register 8

Rayman · 2015-10-03 11:37

Sounds perfect! Glad to hear you solved the binary compatibility puzzle.

cgracey · 2015-10-03 12:00

Here's an example of the JMPREL instruction. It works in both cog and hub. There needed to be some mechanism like this that automatically scales the branch offset (<<2 for hub), so that variable-relative branches could be realized in binary-compatible code.

dat
	orgh	$400


pgm	mov	dira,#$FF	'entry
	mov	x,#0

loop	shl	x,#1		'loop
	call	@spread
	shr	x,#1
	incmod	x,#7
	jmp	@loop


spread	jmprel	x

	notb	outa,#0
	ret

	notb	outa,#1
	ret

	notb	outa,#2
	ret

	notb	outa,#3
	ret

	notb	outa,#4
	ret

	notb	outa,#5
	ret

	notb	outa,#6
	ret

	notb	outa,#7
	ret


	org
x	res	1

evanh · 2015-10-03 12:03

Ah, took me a bit to twig. the LOC instruction is because COGINIT can't take more than 9 bits immediate start address. And x is conveniently placed on one of the special registers that LOC can use.

evanh · 2015-10-03 12:08

I think I have a plan for my own asm code from now on - use multiple labels for locations that have reused purposes.

David Betz · 2015-10-03 12:32

cgracey wrote: »
I got the new memory and branching model working. I also got REP working in hub exec.

There's full binary compatibility now between cog/lut code and hub code that use relative addressing.

In cog code now, we are back to the good old 1:1 addressing - no more 4x'd register addresses. What a relief!

Here's the new map for code execution :

00000..001FF = cog
00200..003FF = lut
00400..FFFFF = hub

Downloaded programs start at $400.

When in the cog, all registers are long, with their addresses being contiguous integers. The PC steps by 1.

When in the hub, instructions take 4 bytes. The PC steps by 4.

To bridge the two contexts, there are two simple things done:

The 9-bit-constant relative branches DJNZ/DJZ/TJZ/... encode the -256..+255 instruction range into their S field. When in cog exec, that value is sign-extended and added to the PC. When in hub exec, it is shifted left two bits and used the same way. This way, both cog and hub contexts get the max use out of these instructions and maintain binary compatibility.

The 20-bit-constant relative branches JMP/CALL/CALLA/... are encoded for hub exec as you imagine they would be, where they track byte offset. When the cog uses these branches, it shifts them right two bits to get cog-relative values. They are assembled pre-4x'd in cog code that way. So, these instructions are now binary compatible between cog/lut and hub code.

REP now works in hub exec by forcing a jump during the last instruction in the repeat block. It didn't take much logic to implement and it works just as you'd expect. Even though it's slow in hub exec, because of the branching on each iteration, it is a convenient instruction to have for doing simple loops.

The assembler generates the same code for relative branches and REP in both cog/lut exec and hub exec contexts.

I will have updated FPGA files done tomorrow. I just finished the Prop123-A7 compile and now I need to make the DE2-115 version.

Here's what the all_cogs_blink program looks like now. Note the ORGH and the REP:
dat
	orgh	$400

' launch 15 cogs (cog 0 falls through and runs 'blink', too)
' any cogs missing from the FPGA won't blink

	loc	x,@blink

	rep	@repend,#15
	coginit	#16,x
repend

blink	cogid	x		'which cog am I?
	setb	dirb,x		'make that pin an output
	notb	outb,x		'flip its output state
	add	x,#16		'add to my id
	shl	x,#18		'shift up to make it big
	waitx	x		'wait that many clocks
	jmp	@blink		'do it again

	org
x	res	1		'variable at cog register 8

Looks good except I find it odd that the 9 bit immediate addresses get treated as long addresses but the 20 bit addresses get treated as byte addresses. Seems like it would be better if they were both shifted left by 2 to get hub addresses for consistency. Then all immediate address fields are treated as long addresses or long offsets.

The addressing conundrum

Comments