The addressing conundrum

cgracey · 2015-10-01 08:05

jmg wrote: »

cgracey wrote: »

This my philosophy, too. Hub and cog/LUT code are going to be two very different animals with different purposes and different structure.

Which of those "two very different animals' will GCC support ?

I don't know, but I would suspect the hub model in just about every case. You'd need some special reason to execute C from cog/LUT. At least, because of size constraints, you could only fit 1k-instruction programs. They would have to qualify before they could get loaded into cog/LUT.

I suspect some common C housekeeping code would be always in cog/LUT.

Roy Eltham · 2015-10-01 08:09

gcc for the P1 supports both COG target native code and LMM(hub) target code of like 3-4 flavors.

I assume gcc for P2 will have targets for cogexec and hubexec, and probably some CMM variant (because it's compact and would allow more code to fit).

evanh · 2015-10-01 08:19

jmg wrote: »

You seem to be rather confusing x86 terminology, with the Linker/memory segments I was referring to.

Ah, true. The situation with Hub/Cog compatibility was never going to be any other way though. Hub and Cog modes are different beasts.

Hehe, just found this little gem - http://stackoverflow.com/questions/14361248/whats-the-difference-of-section-and-segment-in-elf-file-format.

Cluso99 · 2015-10-01 08:38

cgracey wrote: »

I understand the notion of all longs being long-aligned, but like Roy said, it just simplifies a rare use case (binary compatibility) and complicates the Sunday driver case of hub memory having no alignment caveats.

I couldn't DISAGREE more!!!

The case is so complex now with byte vs long. And it gets compounded by non-alignment.

IMHO most code objects will be a mix of cog and lut. There may be some initialisation done using hubexec.

There is absolutely no reason to have byte code addresses in JMP/CALL addresses. And mixing them between hub and cog/lut is just plain confusing.

As for the case of everything in hubexec, nothing could be further from the truth. You will lose out on some speed due to JMP/CALL requiring hub delays.

If every cog uses hubexec then we will have a P2 HOT problem, because all HUB RAM will be enabled/accessed full time (every clock) due to the egg-beater. Not nice!

And then to add to this, hubexec cannot use some of the instruction sequences such as the "rep" instruction.

Hubexec programs will not have all their variables in hub. Otherwise, why not just get rid of the cog registers altogether! But wait a minute... all those normal instructions like mov/and/add/cmp/etc only act on the cog registers!

We also want to be careful what hubexec code is running. Remember, there is no hub memory protection. So we don't want every program running where a bug could corrupt every other program. What a nightmare to debug.

IMHO, hubexec is there for one or two big programs that run in a couple of cogs. The rest will mainly run in cog & lut.

Just my 2c and why I believe emphatically that long-alignment and long-addressing for instructions is an absolute necessity. To do otherwise is not simple and going to be difficult to understand - this is where we are at currently and it's not working!

evanh · 2015-10-01 08:57

Long code alignment will be the norm just be default.

Regarding typical coding types, I suspect Chip was just meaning that the average teen learning to code is not going to be crafting Cog code. At best he'll be digging into the Obex to get advanced features that happens to use CogExec.

David Betz · 2015-10-01 10:54

jmg wrote: »

cgracey wrote: »

This my philosophy, too. Hub and cog/LUT code are going to be two very different animals with different purposes and different structure.

Which of those "two very different animals' will GCC support ?

It will support both as it does now. You have to use -mlmm for "hub exec" and -mcog for "cog exec". Any one program has to be one of the other. You can't mix them in the same program. Hence, -mcog programs are compiled separately into binary blobs and linked with a -mlmm program. The blob is then loaded into a COG at runtime.

Seairth · 2015-10-01 11:19

Cluso99 wrote: »

I am thinking of its use as tables, etc. Don't want to set the top bit too.
IMHO its a kludge because when we get 1MB of hub, we then use the lower half without the top bit set. Then we have exactly the same problems. May as well deal with them now.

To be clear, every idea we've thrown on the table is a kludge. The question is: which kludge are we most comfortable with?

That aside, what are you talking about? If hub memory gets bigger, then you use the reserved space above the existing hub memory. The reserved space below hub memory would be limited to per-cog expansion (e.g. LUT2).

Seairth · 2015-10-01 11:51

cgracey wrote: »

I just made the changes to the Verilog. There was no logic growth, just using different bits here and there.

Now, I need to update the assembler and recompile everything to test it.

I've lost track. What changes did you make?

Rayman · 2015-10-01 12:26

I like where Chip is taking this...
More like P1 sounds good to me.
I like the non-overlapping memory spaces.

I'd like to see how you can jump in and out from hubexec to cog exec and vice versa...
Sounds complicated...

Regarding speed between hubexec and cog exec.
I think David Betz has said that GCC hubexec code is branching all the time.
So, I'm pretty sure that, in general, cog exec is going to be much faster.

Seairth · 2015-10-01 13:07

Chip,

I'm still struggling to understand why you insist that hub instructions should be able to have any alignment. Is this because it's the only way to execute hub code below $10000? Or is it just because "hub memory is byte accessible, so why restrict it?" Maybe another way of asking this is: is your reasoning more technical or aesthetic?

cgracey · 2015-10-01 14:39

Seairth wrote: »

Chip,

I'm still struggling to understand why you insist that hub instructions should be able to have any alignment. Is this because it's the only way to execute hub code below $10000? Or is it just because "hub memory is byte accessible, so why restrict it?" Maybe another way of asking this is: is your reasoning more technical or aesthetic?

I like the idea of having a unified hub address space, as opposed one in which there are different reckonings for code and data.

I'm still trying to see if I can get excited about what Cluso99 is proposing, though.

cgracey · 2015-10-01 14:45

Rayman wrote: »

...I'd like to see how you can jump in and out from hubexec to cog exec and vice versa...

You just jump, call, or return. There's nothing to it. Even interrupts don't care where their code is or where they return to. It just works.

cgracey · 2015-10-01 14:46

Seairth wrote: »

cgracey wrote: »

I just made the changes to the Verilog. There was no logic growth, just using different bits here and there.

Now, I need to update the assembler and recompile everything to test it.

I've lost track. What changes did you make?

The ones I last propsed.

cgracey · 2015-10-01 14:52

David Betz wrote: »

jmg wrote: »

cgracey wrote: »

This my philosophy, too. Hub and cog/LUT code are going to be two very different animals with different purposes and different structure.

Which of those "two very different animals' will GCC support ?

It will support both as it does now. You have to use -mlmm for "hub exec" and -mcog for "cog exec". Any one program has to be one of the other. You can't mix them in the same program. Hence, -mcog programs are compiled separately into binary blobs and linked with a -mlmm program. The blob is then loaded into a COG at runtime.

So, David, you are saying that GCC will always make separate binaries for hub code and cog/LUT code.

GCC, then, has no interest in binary compatibility between hub and cog/LUT code.

cgracey · 2015-10-01 15:06

evanh wrote: »

Long code alignment will be the norm just be default.

Regarding typical coding types, I suspect Chip was just meaning that the average teen learning to code is not going to be crafting Cog code. At best he'll be digging into the Obex to get advanced features that happens to use CogExec.

Whatever he chooses to do will be simple. Will binary compatability be important enough to him, though, that we should make him deal with two different addressing systems within hub?

This seems so simple to me: long addressing for cog/LUT space, and byte addressing for hub space. After all, those are the natures of each.

There's no technical difficulty in handling unaligned longs and words in hub. The capability is designed in, already, for other reasons, as well. Why impose restrictions, at this point?

It's true that we could take advantage of those 20 bits in the immediate branch opcodes to gain 3x more instruction space for future use, but that clobbers data addressing. What we (will) have seems perfectly balanced to me.

cgracey · 2015-10-01 15:18

Cluso99 wrote: »

cgracey wrote: »

I understand the notion of all longs being long-aligned, but like Roy said, it just simplifies a rare use case (binary compatibility) and complicates the Sunday driver case of hub memory having no alignment caveats.

I couldn't DISAGREE more!!!

The case is so complex now with byte vs long. And it gets compounded by non-alignment.

IMHO most code objects will be a mix of cog and lut. There may be some initialisation done using hubexec.

There is absolutely no reason to have byte code addresses in JMP/CALL addresses. And mixing them between hub and cog/lut is just plain confusing.

As for the case of everything in hubexec, nothing could be further from the truth. You will lose out on some speed due to JMP/CALL requiring hub delays.

If every cog uses hubexec then we will have a P2 HOT problem, because all HUB RAM will be enabled/accessed full time (every clock) due to the egg-beater. Not nice!

And then to add to this, hubexec cannot use some of the instruction sequences such as the "rep" instruction.

Hubexec programs will not have all their variables in hub. Otherwise, why not just get rid of the cog registers altogether! But wait a minute... all those normal instructions like mov/and/add/cmp/etc only act on the cog registers!

We also want to be careful what hubexec code is running. Remember, there is no hub memory protection. So we don't want every program running where a bug could corrupt every other program. What a nightmare to debug.

IMHO, hubexec is there for one or two big programs that run in a couple of cogs. The rest will mainly run in cog & lut.

Just my 2c and why I believe emphatically that long-alignment and long-addressing for instructions is an absolute necessity. To do otherwise is not simple and going to be difficult to understand - this is where we are at currently and it's not working!

Cluso99, I'm thinking that we're not understanding each other.

I don't think that where I'm going right now is going to be a problem for you, at all.

All you guys are suffering from a lack of documentation, at this point. I feel bad about this, but look forward to getting this taken care of soon.

Just hang on.

David Betz · 2015-10-01 15:19

cgracey wrote: »

David Betz wrote: »

jmg wrote: »

cgracey wrote: »

This my philosophy, too. Hub and cog/LUT code are going to be two very different animals with different purposes and different structure.

Which of those "two very different animals' will GCC support ?

It will support both as it does now. You have to use -mlmm for "hub exec" and -mcog for "cog exec". Any one program has to be one of the other. You can't mix them in the same program. Hence, -mcog programs are compiled separately into binary blobs and linked with a -mlmm program. The blob is then loaded into a COG at runtime.

So, David, you are saying that GCC will always make separate binaries for hub code and cog/LUT code.

GCC, then, has no interest in binary compatibility between hub and cog/LUT code.

Well, I'm not sure I'd put it that way. Because COG and LMM (and CMM) code are different on P1, GCC is already ready to deal with different instruction sets and separate libraries for each. That doesn't mean it wouldn't be nice to have cog-exec and hub-exec compatible. It's just that having them be incompatible will not necessarily break GCC.

Electrodude · 2015-10-01 15:25

It seems to me that much of the problem HLL compilers will have with binary incompatibility will go away or at least become easier if Parallax decides to use LLVM instead of GCC for the P2. In LLVM, linking typically happens before any machine code is generated and before most optimizations are done. If all libraries are shipped in LLVM IR form, then the P2 LLVM backend would then have access to all code and data before generating any machine code at all and could then decide what code should be in what format as it generates the final PASM code.

EDIT: Another advantage of LLVM would be that when machine code can be generated, since the compiler has more context, it might be able to do fancy optimizations that require things to be at very specific addresses. For example, if you write a Forth interpreter in C and use an enum to define opcodes, it could set enum values to be pointers into cogram, like how Tachyon works.

Seairth · 2015-10-01 15:35

Ok. I have a VERY radical thought (I can already hear the disagreement, so save it for something less off-the-wall): make the hub long-only, just like cog/lut.

Here's the fallout of such a change (I think):

* RDLONG/WRLONG would become RDHUB/WRHUB.
* the xxBYTE and xxWORD opcodes are freed up for other uses.
* All addressing everywhere would be in terms of longs. Period.
* Bytes/Words can still be extracted with the new instruction set:

RDHUB x, addr
GETBYTE x, x, #3

Additionally, you can still support "packed" data by doing two things:

1. change BYTE, WORD, and LONG to be packed on their natural alignment
2. Add byte[] and word[] operators, which would give the 2-bit or 1-bit (respectively) value for the label:

        ORGH $400
        RDHUB x, b3
        GETBYTE x, x, #byte[b3]

b1      BYTE 0  ' b1 = $402, byte[b1] = %00
b2      BYTE 0  ' b2 = $402, byte[b2] = %01
b3      BYTE 0  ' b3 = $402, byte[b3] = %10
                ' (unused byte)
w1      WORD 0  ' w1 = $403, word[w1] = %0
w2      WORD 0  ' w2 = $403, word[w2] = %1

This has the additional advantage that byte/word extraction in COG/LUT memory works exactly the same way.

Now, I think *that* is keeping things simple. Yes, it gives up something potentially powerful (one-instruction byte-level manipulation). But that powerful thing is currently causing A LOT of disagreement and confusion. I would gladly give up byte-addressing for this. It's so simple and easy to understand! With 128K 32-bit registers in the hub and 1K 32-bit registers in COG/LUT, it doesn't bother me in the least that byte/word extraction/insertion will be a multi-instruction affair.

As I said, I know some (all?) of you will immediately disagree with this. As I said, it's radical. Before you immediately fire of a "I STRONGLY DISAGREE" reply, ponder it for a bit. Take a few deep breaths. Note how the simplicity and clarity of the idea washes over you.

David Betz · 2015-10-01 15:36

Electrodude wrote: »

It seems to me that much of the problem HLL compilers will have with binary incompatibility will go away or at least become easier if Parallax decides to use LLVM instead of GCC for the P2. In LLVM, linking typically happens before any machine code is generated and before most optimizations are done. If all libraries are shipped in LLVM IR form, then the P2 LLVM backend would then have access to all code and data before generating any machine code at all and could then decide what code should be in what format as it generates the final PASM code.

EDIT: Another advantage of LLVM would be that when machine code can be generated, since the compiler has more context, it might be able to do fancy optimizations that require things to be at very specific addresses. For example, if you write a Forth interpreter in C and use an enum to define opcodes, it could set enum values to be pointers into cogram, like how Tachyon works.

Yes, moving to LLVM would have some advantages. Unless we have a volunteer to do the work, it will likely cost quite a bit to make that switch. Do you have LLVM experience and time to help with this? Do you know someone who does?

Electrodude · 2015-10-01 15:53

David Betz wrote: »

Electrodude wrote: »

It seems to me that much of the problem HLL compilers will have with binary incompatibility will go away or at least become easier if Parallax decides to use LLVM instead of GCC for the P2. In LLVM, linking typically happens before any machine code is generated and before most optimizations are done. If all libraries are shipped in LLVM IR form, then the P2 LLVM backend would then have access to all code and data before generating any machine code at all and could then decide what code should be in what format as it generates the final PASM code.

EDIT: Another advantage of LLVM would be that when machine code can be generated, since the compiler has more context, it might be able to do fancy optimizations that require things to be at very specific addresses. For example, if you write a Forth interpreter in C and use an enum to define opcodes, it could set enum values to be pointers into cogram, like how Tachyon works.

Yes, moving to LLVM would have some advantages. Unless we have a volunteer to do the work, it will likely cost quite a bit to make that switch. Do you have LLVM experience and time to help with this? Do you know someone who does?

I don't really have any LLVM experience myself. The most I can say is that I wrote an Hello World program in LLVM IR. I do know someone who's very experienced with LLVM - he introduced me to it -, but AFAIK he's not into the Propeller, and I'm pretty sure he only wrote a frontend, not a backend. He wrote a Common Lisp -> LLVM compiler for supercomputing artificial protein-like polymer optimization problems, it is awesome.

I would love to write an LLVM backend for the P2 myself, but I'm too busy with school to have the time. If I ever have an opportunity or excuse to somehow do it for school, as part of a course or as an internship/co-op with Parallax or something, though, I would definitely jump on it. I'm pretty sure I have no clue how much work writing a backend is.

cgracey · 2015-10-01 16:23

Seairth wrote: »
Ok. I have a VERY radical thought (I can already hear the disagreement, so save it for something less off-the-wall): make the hub long-only, just like cog/lut.

Here's the fallout of such a change (I think):

* RDLONG/WRLONG would become RDHUB/WRHUB.
* the xxBYTE and xxWORD opcodes are freed up for other uses.
* All addressing everywhere would be in terms of longs. Period.
* Bytes/Words can still be extracted with the new instruction set:
RDHUB x, addr
GETBYTE x, x, #3
Additionally, you can still support "packed" data by doing two things:

1. change BYTE, WORD, and LONG to be packed on their natural alignment
2. Add byte[] and word[] operators, which would give the 2-bit or 1-bit (respectively) value for the label:
        ORGH $400
        RDHUB x, b3
        GETBYTE x, x, #byte[b3]

b1      BYTE 0  ' b1 = $402, byte[b1] = %00
b2      BYTE 0  ' b2 = $402, byte[b2] = %01
b3      BYTE 0  ' b3 = $402, byte[b3] = %10
                ' (unused byte)
w1      WORD 0  ' w1 = $403, word[w1] = %0
w2      WORD 0  ' w2 = $403, word[w2] = %1
This has the additional advantage that byte/word extraction in COG/LUT memory works exactly the same way.

Now, I think *that* is keeping things simple. Yes, it gives up something potentially powerful (one-instruction byte-level manipulation). But that powerful thing is currently causing A LOT of disagreement and confusion. I would gladly give up byte-addressing for this. It's so simple and easy to understand! With 128K 32-bit registers in the hub and 1K 32-bit registers in COG/LUT, it doesn't bother me in the least that byte/word extraction/insertion will be a multi-instruction affair.

As I said, I know some (all?) of you will immediately disagree with this. As I said, it's radical. Before you immediately fire of a "I STRONGLY DISAGREE" reply, ponder it for a bit. Take a few deep breaths. Note how the simplicity and clarity of the idea washes over you.

I really like this line of thinking - some way to enable simplification of addressing, to where we get everything into long-interaction form. I don't think the extra instruction to handle a byte is a big deal. For that matter, we could even have nibble and bit addressing. Imagine being able to pull out exactly as many bits as you want at a time from a memory stream. That would enable some dense interpreted code.

What I have a hard time picturing is some scenario like this:

Right now, we can read in, say, a small bitmap made up of byte pixels. We can write that bitmap back into hub memory at any byte offset along a scan line buffer. It only takes SETQ+RDLONG and SETQ+WRLONG to make this unaligned move happen. We can even do SETQ2+WRLONG to have it not write $FF bytes, allowing transparency. How would we handle something like this? In the current case, there is hardware assist for dealing with the misalignment. In your case, misalignment would have to be dealt with in the cog.

I keep returning to the notion that getting away from alignment-free hub ram to support cog/LUT-hub code compatibility brings in too many difficulties.

Roy Eltham · 2015-10-01 16:32

David Betz,
Even if Chip changed things with addressing to make hub and cog spaces more compatible, there's still REP that doesn't work in hub space. I think there might be some other things too.

It would be nice if the P2 gcc could support a function level designation for the targets, instead of compilation unit level. That way you can have some functions in cog space and the rest in hub space and they can call each other directly. This wasn't really an option on P1, but it is on P2.

Rayman · 2015-10-01 16:47

I was just wondering if GCC could support compiling for SDRAM...
Then, we'd have a full blown PC with hub streamer as L1 cache, cog ram as L2 cache and HUB RAM as L3 cache?

Probably getting ahead of myself...

David Betz · 2015-10-01 17:17

Rayman wrote: »

I was just wondering if GCC could support compiling for SDRAM...
Then, we'd have a full blown PC with hub streamer as L1 cache, cog ram as L2 cache and HUB RAM as L3 cache?

Probably getting ahead of myself...

Don't we need page faults and TLBs for that?

David Betz · 2015-10-01 17:19

Roy Eltham wrote: »

David Betz,
Even if Chip changed things with addressing to make hub and cog spaces more compatible, there's still REP that doesn't work in hub space. I think there might be some other things too.

It would be nice if the P2 gcc could support a function level designation for the targets, instead of compilation unit level. That way you can have some functions in cog space and the rest in hub space and they can call each other directly. This wasn't really an option on P1, but it is on P2.

The problem is with the libraries. Any library functions called by the COG code would have to be compiled differently than the ones called from HUB code. That means if both call memcpy for example then there would have to be two copies of memcpy linked with the program. That results in a multiply defined symbol. I guess you could add a prefix to every library function internally (_COG_memcpy and _HUB_memcpy) to disambiguate but I don't think GCC is setup to do that.

Seairth · 2015-10-01 17:23

cgracey wrote: »

What I have a hard time picturing is some scenario like this:

Right now, we can read in, say, a small bitmap made up of byte pixels. We can write that bitmap back into hub memory at any byte offset along a scan line buffer. It only takes SETQ+RDLONG and SETQ+WRLONG to make this unaligned move happen. We can even do SETQ2+WRLONG to have it not write $FF bytes, allowing transparency. How would we handle something like this? In the current case, there is hardware assist for dealing with the misalignment. In your case, misalignment would have to be dealt with in the cog.

That is a good question. Just to make sure I'm understanding this correctly, existing code would look something like this?

        orgh
        
        rdlong count, ##width           ' get the bitmap width (in bytes)
        rdlong _offset, ##line          ' get the bitmap line index to copy
        mul _offset, count              ' calculate the byte offset in bitmap
        
        mov adra, ##bmp                 ' set up adra to point to the byte offset
        add adra, _offset
        
        shr count, #2                   ' adjust count to longs
        
        setq count                      
        rdlong temp, adra               ' copy longs to cog memory

        rdlong _offset, ##offset        ' get the offset in the scanline
        
        mov adra, ##scanl               ' set up adra to point to the scanline offset (in bytes)
        add adra, _offset
        
        setq count
        wrlong temp, adra               ' copy the cog line buffer to the scanline
        
bmp     byte $00, $01, $03,...
width   long 128                        ' bytes
line    long 0                          ' current bitmap line

scanl   long 0[$100]                    ' 1024 byte scanline
offset  long 0                          ' offset of bmp in scanline

        org
_offset res 1                           
count   res 1                           
temp    res 256                         ' 1KB line buffer

(note: I don't know if this actually works. Just threw it together to show the gist of what you are talking about.)

Heater. · 2015-10-01 17:37

Hmm...doesn't making all access to hub LONG have dire consequences when different COGs are updating adjacent bytes?

I mean to update a byte the COG has to:

1) Read the long that contains the byte.
2) Update the bits in question (presumably there is a PUTBYTE to go with that GETBYTE)
3) Write the long back to HUB.

But that clobbers three other bytes that may have been updated by other COGs whilst all that was going on!

Sounds like chaos.

Cluso99 · 2015-10-01 17:43

Seairth,
Your post makes sense!
It's simple and easy to understand.

Now let's just add in a couple of helper instructions so that we can also use the hub efficiently for bytes and words. So we will just add in RD/WR BYTE/WORD instructions and they will now add 2 LSB (bits) to the hub address so we can break them down into bytes.

I am being serious here !!!

This is the point I am trying (unsuccessfully) to get across here.

There is a consistent instruction model of everything addressed in LONGS, and a data model where HUB can be accessed in bytes/words/longs using byte addressing, and a data model where COG/LUT is always in LONGS.

I cannot see why this isn't the simplest way.

Chip,
I see you want to be able to have a block of hub bytes that you might want to read/write to cog/LUT quickly, and at any offset. This would still work with the above models because the data model addresses hub in bytes.

And the instruction model still addresses the hub and cog/LUT as Longs. The hub exec instruction fetching verilog would just insert 2 LSBs of "00" to convert hub addressing back to bytes for internal consistency. That is why instructions should be long aligned. And it retains a consistent programming model between hub exec and cog exec. This also helps the GCC guys.

Does this help explain where I am coming from?

potatohead · 2015-10-01 18:25

I really dislike data access being long only. Addresses for execute? Fine.

We need byte access for data. How can we have many COGS working on adjacent bytes?

Edit, just saw heaters post. Lots of check, read, modify, write cycles are needed to work only with longs.

The addressing conundrum

Comments