The addressing conundrum

Heater. · 2015-10-01 18:53

potatohead,

Lots of check, read, modify, write cycles are needed to work only with longs.

Yep. It's even worse than that. No amount of checking in such a situation will help unless you wrap locks around them.

Which clobbers performance. Imagine four COGs all trying to update four different bytes that happen to be in the same long. Now they all have to use locks and wait for each other. Even if whatever they are doing is totally unrelated.

It creates an accidental dependency between processes.

And doesn't LONG only access bugger up string processing performance?

I have not really followed all the arguments here but it seems to me that all memory should be byte addressed. Instructions should be LONG aligned as is normal in some machines.

rod1963 · 2015-10-01 18:55

Looks like 3 memory models. Cog, LUT and Hub.

The question for me is how will it play out for those who want to code in a C or BASIC? Are there going to be multiple libraries and can GCC even handle this architecture.

Or will we just be relegated to SPIN and PASM only because of the complexity?

David Betz · 2015-10-01 18:57

Hub addressed in longs? Is that Chip's current proposal? It's hard to keep up with this thread.

Rayman · 2015-10-01 19:01

I saw that hub will be byte addressable (and executable) but cog will be only longs...
Hope it stays this way, sounded good to me!

David Betz · 2015-10-01 19:08

Rayman wrote: »

I saw that hub will be byte addressable (and executable) but cog will be only longs...
Hope it stays this way, sounded good to me!

Yes, that seems okay.

Seairth · 2015-10-01 19:10

Yeah, I see where the issue with my idea comes in with writing to hub. Ah well. It was worth a shot.

Roy Eltham · 2015-10-01 19:20

David Betz wrote: »

Roy Eltham wrote: »

David Betz,
Even if Chip changed things with addressing to make hub and cog spaces more compatible, there's still REP that doesn't work in hub space. I think there might be some other things too.

It would be nice if the P2 gcc could support a function level designation for the targets, instead of compilation unit level. That way you can have some functions in cog space and the rest in hub space and they can call each other directly. This wasn't really an option on P1, but it is on P2.

The problem is with the libraries. Any library functions called by the COG code would have to be compiled differently than the ones called from HUB code. That means if both call memcpy for example then there would have to be two copies of memcpy linked with the program. That results in a multiply defined symbol. I guess you could add a prefix to every library function internally (_COG_memcpy and _HUB_memcpy) to disambiguate but I don't think GCC is setup to do that.

Not sure why you think this? memcpy could just be hubexec code, and you can call it from cog or hub, or vice versa.

I guess if you want to have modes where the code all gets pushed into one place or the other, then you'd need two variants of lib functions. Perhaps a solution would be libs that have a single version (say hubexec), but have fixup tables to convert things to cogexec versions (I think this will be possible and fairly easy).

Doesn't the existing GCC for P1 only have one version of lib functions? How does it deal with cog target? are lib functions just not available then?

Roy Eltham · 2015-10-01 19:24

Heater. wrote: »

Hmm...doesn't making all access to hub LONG have dire consequences when different COGs are updating adjacent bytes?

I mean to update a byte the COG has to:

1) Read the long that contains the byte.
2) Update the bits in question (presumably there is a PUTBYTE to go with that GETBYTE)
3) Write the long back to HUB.

But that clobbers three other bytes that may have been updated by other COGs whilst all that was going on!

Sounds like chaos.

Heater brings up probably the most important reason not to force hub data access to long only. We MUST have atomic read/write of bytes/words to/from hub. Anything that breaks that, bring complete chaos and hell to multi cog coding.

Seairth · 2015-10-01 19:26

cgracey wrote: »

Seairth wrote: »

cgracey wrote: »

I just made the changes to the Verilog. There was no logic growth, just using different bits here and there.

Now, I need to update the assembler and recompile everything to test it.

I've lost track. What changes did you make?

The ones I last propsed.

I've lost track of what that was. But based on a few other posts, I wonder if it was this:

* All hub addressing (instruction and data) is bytes.
* All cog/LUT addressing (instruction and data) is longs.
* PC counts by 1 for COG/LUT and by 4 for HUB
* Hub exec starts at byte address $400 (all addresses below that are COG/LUT long addresses)
* In ORGH offset is in bytes, BYTE, WORD, LONG are packed.
* In ORG offset is in longs, BYTE, WORD, LONG are aligned.
* Labels defined in ORGH are in bytes, even when referenced from an ORG section.
* Labels defined in ORG are in longs, even when referenced from an ORGH section.

jmg · 2015-10-01 19:30

David Betz wrote: »

Roy Eltham wrote: »

David Betz,
Even if Chip changed things with addressing to make hub and cog spaces more compatible, there's still REP that doesn't work in hub space. I think there might be some other things too.

It would be nice if the P2 gcc could support a function level designation for the targets, instead of compilation unit level. That way you can have some functions in cog space and the rest in hub space and they can call each other directly. This wasn't really an option on P1, but it is on P2.

The problem is with the libraries. Any library functions called by the COG code would have to be compiled differently than the ones called from HUB code. That means if both call memcpy for example then there would have to be two copies of memcpy linked with the program. That results in a multiply defined symbol. I guess you could add a prefix to every library function internally (_COG_memcpy and _HUB_memcpy) to disambiguate but I don't think GCC is setup to do that.

Exactly - however this will need to be done for P2, along with the wrappers that group _COG(binary mode) functions for copy into COG (those functions can be any mix of Library and user tagged Functions )

David Betz · 2015-10-01 19:30

Roy Eltham wrote: »

David Betz wrote: »

Roy Eltham wrote: »

David Betz,
Even if Chip changed things with addressing to make hub and cog spaces more compatible, there's still REP that doesn't work in hub space. I think there might be some other things too.

It would be nice if the P2 gcc could support a function level designation for the targets, instead of compilation unit level. That way you can have some functions in cog space and the rest in hub space and they can call each other directly. This wasn't really an option on P1, but it is on P2.

The problem is with the libraries. Any library functions called by the COG code would have to be compiled differently than the ones called from HUB code. That means if both call memcpy for example then there would have to be two copies of memcpy linked with the program. That results in a multiply defined symbol. I guess you could add a prefix to every library function internally (_COG_memcpy and _HUB_memcpy) to disambiguate but I don't think GCC is setup to do that.

Not sure why you think this? memcpy could just be hubexec code, and you can call it from cog or hub, or vice versa.

I guess if you want to have modes where the code all gets pushed into one place or the other, then you'd need two variants of lib functions. Perhaps a solution would be libs that have a single version (say hubexec), but have fixup tables to convert things to cogexec versions (I think this will be possible and fairly easy).

Doesn't the existing GCC for P1 only have one version of lib functions? How does it deal with cog target? are lib functions just not available then?

GCC for P1 has many copies of the libraries. Certainly, it has different libraries for COG, LMM, CMM, and XMM memory models. It also has different libraries depending on whether float support is included and various other compiler options. This is the multilib support in GCC. It is one of the reasons it takes so long to build PropGCC! :-)

cgracey · 2015-10-01 19:44

I've been thinking very hard about what Cluso99 is saying, since he's being so adamant. And I see the ramifications for compiler makers.

I think I have a solution that will please everyone.

Consider that ANY code that is going to run in both cog and hub must use relative jumps within itself, as absolute execution addresses are different between modes.

And here is the whole problem with cog exec vs. hub exec: In cog exec, the PC steps by 1, whereas in hub exec it must step by 4. This creates different relative address encodings which make binaries incompatible between cog exec and hub exec modes.

Well, what if we assembled those 20-bit relative addresses in cog code as shifted left by two bits? This will give them the same expanse as hub code. Then, whenever we are in cog exec, we always shift relative addresses down by two bits before adding them to the PC. Now, the same binary will run in both modes.

Does anyone see a problem with this?

Seairth · 2015-10-01 19:44

Could the PASM be changed slightly? Instead of DAT, could we have a separate HUB and COG? Then, ORG would apply to whichever block it's contained in. I know this seems like a minor detail, but I like the block-level highlighting that the Propeller Tool does, and would like to easily distinguish Hub DAT sections from Cog/LUT DAT sections. Even in the relatively simple bits of code we've been writing over the last couple days, the orgh/org doesn't really do much to help the eye spot the separate sections.

Heater. · 2015-10-01 19:47

I am no compiler writer but if I were I might just say f'it.

All code will be compiled for execution from HUB. But we now have a processor with 512 registers.

If I understand correctly more registers is always every compiler writers dream.

Seairth · 2015-10-01 19:51

cgracey wrote: »

I've been thinking very hard about what Cluso99 is saying. I've been trying to get my head around this whole matter.

I think I have a solution that will please everyone.

Consider that ANY code that is going to run in both cog and hub, must use relative jumps within itself, as absolute execution addresses are different between modes.

And here is the whole problem with cog exec vs. hub exec: In cog exec, the PC steps by 1, whereas in hub exec it must step by 4.

This creates different relative address encodings which makes binaries incompatible between cog and hub modes.

Well, what if we assembled those 20-bit relative addresses in cog code as shifted left by two bits? This will give them the same expanse as hub code. Then, whenever we are in cog exec, we always shift relative addresses down by two bits before adding them to the PC. Now, the same binary will run in both modes.

Does anyone see a problem with this?

I suggest doing it the other way: all 20-bit relatives are in terms of "#number of instructions". Then, when in hub exec mode, shift them two bits to the left before adding to PC. This keeps the interpretation of the 20-bit relatives the same as the 9-bit relatives.

cgracey · 2015-10-01 20:03

Seairth wrote: »

cgracey wrote: »

I've been thinking very hard about what Cluso99 is saying. I've been trying to get my head around this whole matter.

I think I have a solution that will please everyone.

Consider that ANY code that is going to run in both cog and hub, must use relative jumps within itself, as absolute execution addresses are different between modes.

And here is the whole problem with cog exec vs. hub exec: In cog exec, the PC steps by 1, whereas in hub exec it must step by 4.

This creates different relative address encodings which makes binaries incompatible between cog and hub modes.

Well, what if we assembled those 20-bit relative addresses in cog code as shifted left by two bits? This will give them the same expanse as hub code. Then, whenever we are in cog exec, we always shift relative addresses down by two bits before adding them to the PC. Now, the same binary will run in both modes.

Does anyone see a problem with this?

I suggest doing it the other way: all 20-bit relatives are in terms of "#number of instructions". Then, when in hub exec mode, shift them two bits to the left before adding to PC. This keeps the interpretation of the 20-bit relatives the same as the 9-bit relatives.

That would mean hub code couldn't branch to a relatively unaligned address. It would make a new rule for hub code.

P.S. I see what you are getting at about the 9-bit relative branches, but they are very short range. The 20 bit addresses should be able to go anywhere.

cgracey · 2015-10-01 20:07

Seairth wrote: »

Could the PASM be changed slightly? Instead of DAT, could we have a separate HUB and COG? Then, ORG would apply to whichever block it's contained in. I know this seems like a minor detail, but I like the block-level highlighting that the Propeller Tool does, and would like to easily distinguish Hub DAT sections from Cog/LUT DAT sections. Even in the relatively simple bits of code we've been writing over the last couple days, the orgh/org doesn't really do much to help the eye spot the separate sections.

We could, but that would complicate doing things like loading nearby cog code into a cog. It seems to me that you need to easily switch between hub and cog modes when writing assembly code. Maybe we just need better/bigger words for ORG and ORGH.

cgracey · 2015-10-01 20:13

Seairth wrote: »

cgracey wrote: »

Seairth wrote: »

cgracey wrote: »

I just made the changes to the Verilog. There was no logic growth, just using different bits here and there.

Now, I need to update the assembler and recompile everything to test it.

I've lost track. What changes did you make?

The ones I last propsed.

I've lost track of what that was. But based on a few other posts, I wonder if it was this:

* All hub addressing (instruction and data) is bytes.
* All cog/LUT addressing (instruction and data) is longs.
* PC counts by 1 for COG/LUT and by 4 for HUB
* Hub exec starts at byte address $400 (all addresses below that are COG/LUT long addresses)
* In ORGH offset is in bytes, BYTE, WORD, LONG are packed.
* In ORG offset is in longs, BYTE, WORD, LONG are aligned.
* Labels defined in ORGH are in bytes, even when referenced from an ORG section.
* Labels defined in ORG are in longs, even when referenced from an ORGH section.

That's it.

Seairth · 2015-10-01 20:27

cgracey wrote: »

That would mean hub code couldn't branch to a relatively unaligned address. It would make a new rule for hub code.

P.S. I see what you are getting at about the 9-bit relative branches, but they are very short range. The 20 bit addresses should be able to go anywhere.

True. Of course, you have exactly the same problem with the 9-bit relative addresses when used in hub exec mode.

Anyhow, I thought relative addressing was included primarily to support relocatable code. Frankly, it seems odd to me that you would have a relocatable code block that has instructions with different alignments.

Besides, the you can always use the 20-bit #immediate form instead, which will work just fine with unaligned instructions.

Edit: in fact, if the code is meant to be relocatable to cog/lut memory, then it can't have unaligned instructions.

Edit Edit: also, for those rare circumstances where someone actually does try to do the relative address of an unaligned instruction, the compiler can catch this and give an error.

jmg · 2015-10-01 20:31

Roy Eltham wrote: »

I guess if you want to have modes where the code all gets pushed into one place or the other, then you'd need two variants of lib functions.

Yes, if you want your request above of "It would be nice if the P2 gcc could support a function level designation for the targets, instead of compilation unit level. That way you can have some functions in cog space and the rest in hub space and they can call each other directly. This wasn't really an option on P1, but it is on P2."
( which is what users will expect to be able to do), then the tools will need to tag and collect {Library functions and user functions}, that will be _COG or _LUT resident into one blob, that is then loaded into COG/LUT before it is run.

I'd expect some overlay system too, as the collection of functions can possibly exceed the total size of COG.LUT, just care is needed to not load all at once.

jmg · 2015-10-01 20:34

David Betz wrote: »

.... You have to use -mlmm for "hub exec" and -mcog for "cog exec". Any one program has to be one of the other. You can't mix them in the same program. Hence, -mcog programs are compiled separately into binary blobs and linked with a -mlmm program. The blob is then loaded into a COG at runtime.

I presume you use -mlmm loosely here, and that -mhub will be supported with -mcog ?
ie GCC can native compile HUB code, not only -mlmm ?

Will -mlut be needed too, for code that is portable enough to go into LUT, as opposed to code that must go into COG ?

jmg · 2015-10-01 20:37

Heater. wrote: »

Hmm...doesn't making all access to hub LONG have dire consequences when different COGs are updating adjacent bytes?

I mean to update a byte the COG has to:

1) Read the long that contains the byte.
2) Update the bits in question (presumably there is a PUTBYTE to go with that GETBYTE)
3) Write the long back to HUB.

But that clobbers three other bytes that may have been updated by other COGs whilst all that was going on!

Sounds like chaos.

Agreed, byte access needs to be Atomic, which I think is what Chip was saying.

David Betz · 2015-10-01 20:45

Seairth wrote: »

cgracey wrote: »

I've been thinking very hard about what Cluso99 is saying. I've been trying to get my head around this whole matter.

I think I have a solution that will please everyone.

Consider that ANY code that is going to run in both cog and hub, must use relative jumps within itself, as absolute execution addresses are different between modes.

And here is the whole problem with cog exec vs. hub exec: In cog exec, the PC steps by 1, whereas in hub exec it must step by 4.

This creates different relative address encodings which makes binaries incompatible between cog and hub modes.

Well, what if we assembled those 20-bit relative addresses in cog code as shifted left by two bits? This will give them the same expanse as hub code. Then, whenever we are in cog exec, we always shift relative addresses down by two bits before adding them to the PC. Now, the same binary will run in both modes.

Does anyone see a problem with this?

I suggest doing it the other way: all 20-bit relatives are in terms of "#number of instructions". Then, when in hub exec mode, shift them two bits to the left before adding to PC. This keeps the interpretation of the 20-bit relatives the same as the 9-bit relatives.

I agree. Again, I don't see any need to support unaligned hub execution. It doesn't really hurt to have it except when it reduces branch ranges like in Chip's proposal.

David Betz · 2015-10-01 20:48

jmg wrote: »

David Betz wrote: »

.... You have to use -mlmm for "hub exec" and -mcog for "cog exec". Any one program has to be one of the other. You can't mix them in the same program. Hence, -mcog programs are compiled separately into binary blobs and linked with a -mlmm program. The blob is then loaded into a COG at runtime.

I presume you use -mlmm loosely here, and that -mhub will be supported with -mcog ?
ie GCC can native compile HUB code, not only -mlmm ?

Will -mlut be needed too, for code that is portable enough to go into LUT, as opposed to code that must go into COG ?

I just meant that -mlmm is as close to hub exec as we get in P1.

Roy Eltham · 2015-10-01 21:20

Chip,

Is it possible to have all code execution related stuff always be long aligned and addressed as longs, but then keep the rd/wrbyte, etc, all using byte addressing? To me this seems like the best solution.
Can the PC just be internally shifted up 2 when jumping to hub code to get the byte address to start loading into the streamer?

It means the compiler needs to handle labels properly based on how they are used (data access means the label is byte addressed, branch target means the label is long addressed), but that's trivial.

Unless I am missing something, this seems like the most straight forward way to go from the perspective of programming the P2. Then everything is binary compatible everywhere. Also, if this is done, could REP be fixed to work in hubexec mode too? just having a streamer refill stall on the branch back?

cgracey · 2015-10-01 21:52

David Betz wrote: »

...Again, I don't see any need to support unaligned hub execution...

Here is an example of why unaligned hub code is important:

	call	@send_string
	byte	13,13,"The time is ",0
	mov	val,hours
	call	@send_decimal2
	call	@send_string
	byte	':',0
	mov	val,minutes
	call	@send_decimal2
	call	@send_string
	byte	" and the date is ",0
	...

You can do things like that, which is way better than having to get pointers to data located elsewhere.

Edit: changed 'db' to 'byte'

David Betz · 2015-10-01 21:59

David Betz wrote: »

...Again, I don't see any need to support unaligned hub execution...

Here is an example of why unaligned hub code is important:
	call	@send_string
	db	13,13,"The time is ",0
	mov	val,hours
	call	@send_decimal2
	call	@send_string
	db	':',0
	mov	val,minutes
	call	@send_decimal2
	call	@sendstring
	db	" and the date is ",0
	...
You can do things like that, which is way better than having to get pointers to data located elsewhere.

Yes that would be kind of cool but you could still do this even without byte addressing if you pad the text to the next long boundary.

Roy Eltham · 2015-10-01 22:01

Chip,
In your example, is the function send_string manipulating the return address to be after the string? Couldn't it just do that but to the next long aligned address? Then you just need the db to code transition to auto-pad to long alignment.

You get the same thing with a little bit of padding added (automatically). I think the trade off is worth it.

Edit: I can't imagine ever coding that up that way. I would gather all the strings up into a data area and have labels to them that would get passed to send_string. Makes for simpler code that is easier to change and edit in the future.

jmg · 2015-10-01 22:10

cgracey wrote: »

Here is an example of why unaligned hub code is important:

Looks good to me.

I think avoiding ORG as a base-control, and using a SEGCOG/SEGLUT?/SEGHUB etc will allow intermix of COG and HUB source (as may make sense to the user) and they can be collected by the tools into collated blocks for download.
ie a Similar coding style to your example.

labels resolve, and any code generation variances are easy, as the tools always know what seg code is in.

Overlay support gets a little trickier, as labels may need resetting, perhaps
SEGCOG
with no param appends to earlier SEGCOG (starting from deault base) and
SEGCOG Base
forces a user defines offset.

SEG_COG_OVL would allow collecting of overlay items

cgracey · 2015-10-01 22:14

Roy Eltham wrote: »

Chip,

Is it possible to have all code execution related stuff always be long aligned and addressed as longs, but then keep the rd/wrbyte, etc, all using byte addressing? To me this seems like the best solution.
Can the PC just be internally shifted up 2 when jumping to hub code to get the byte address to start loading into the streamer?

It means the compiler needs to handle labels properly based on how they are used (data access means the label is byte addressed, branch target means the label is long addressed), but that's trivial.

Unless I am missing something, this seems like the most straight forward way to go from the perspective of programming the P2. Then everything is binary compatible everywhere. Also, if this is done, could REP be fixed to work in hubexec mode too? just having a streamer refill stall on the branch back?

It is possible to make all execution long-aligned, but I don't want to do it that way. I don't think it's necessary to go to that length in order to get binary compatibility between cog and hub code. I really like that anything of any size can be anywhere in the hub.

About making REP work in hub exec: I've thought about it, just for the sake of compatibility, but it looks rather complicated. I will look at it some more, though.

The addressing conundrum

Comments