Hub Execution Model Thread (split from blog)

David Betz · 2013-12-05 11:25

jmg wrote: »

I would make a larger leap on the basis of Assembler Clarity.
( no change to the binary action, just to what the user 'sees' )

ie If the above opcodes work

ADD reg,#bigconstant & $1FF
BIG #bigconstant >> 9

but what that finally does is 'add bigconstant to reg', then it makes more sense to be able to write this in one ASM line

ADDI32 reg,#big32constant // does what it says

Now the assembler creates two 32 bit values, so you have a 2 word opcode.

if you do want to also support the more obtuse dual opcode in ASM then I'd use EXTend Immediate 32

ADD reg,#bigconstant & $1FF
EXTI32 #bigconstant >> 9

If that second opcode is context dependent on Any instruction having an immediate S or D, then the Assembler should check that, and give an error. ( another reason for the simpler, clearer one line syntax )

A smart assembler could even support this as well

ADD reg,#AnyConstant

and spawn one of two opcode sets (just like many ASMs now do automatically with JMP/CALL)
The LIST file should make it clear when 32 bit promotion occurred.

This is exactly what I proposed. :-)
Well, I didn't suggest the ADDI32 opcode but rather that the assembler notice when immediate operands were bigger than 9 bits and automatically supply the BIG instruction with the remaining 23 bits.

Bill Henning · 2013-12-05 11:32

David Betz wrote: »

I see what you're doing but are you sure you want to waste another COG location with a visible BIG register?

With hubexec, its not a waste, but a win.

Outside of hubexec, it is just a regular register.

C code has a ton of cases where the result variable is used in the source expression, this would save many hub longs.

David Betz wrote: »

Also, I' ve already said why I think the low bits should be the 9 bits from the modified instruction. You haven't yet provided an example showing how having the BIG instruction supply the low bits would be useful and I think it will be more complicated to implement in hardware.

I have provided such examples before, but here is are the most relevant two:

Advantage #1: More readable, saves 1 long

RDLONG   reg,#0
BIG      #$31240   'more readable
' do math on reg
WRLONG   reg,$1F1

Saves one long in the hub.

Advantage #2: More readable, saves 2 longs, saves a hub cycle

(assuming 8-long is mapped to $1E0, which I strongly recommend for hubexec usage)

1e0: RDLONG   reg,#0
1e1: BIG      #$31240   ' advantage #1: more readable
1e2: WRLONG   reg,#2
1e3: BIG      #$12540   ' advantage #1: more readable
1e4: RDLONG   reg2, $1E1 ' re-use pointer in BIG
1e5: ADD      reg2, #128
1e6: WRLONG   reg2, $1E1' re-use pointer in BIG
1e7: ' do something else

above code saves 2 longs in the hub, and one hub fetch cycle.

If the BIG constant was left justified, we could not save the two longs.

Well worth it, even if the GCC code generator can't make use of it. Spin, other VM's and compilers can.

I believe that should convince you.

Bill Henning · 2013-12-05 11:39

Not quite, jmg is right justifying like I want - note the "bigconstant>>9"

David Betz wrote: »

This is exactly what I proposed. :-)
Well, I didn't suggest the ADDI32 opcode but rather that the assembler notice when immediate operands were bigger than 9 bits and automatically supply the BIG instruction with the remaining 23 bits.

jmg · 2013-12-05 11:39

David Betz wrote: »

This is exactly what I proposed. :-)

Oops, I merely did a quick scan, and only noticed the continued BIG syntax mentioned.

ersmith · 2013-12-05 11:42

Bill Henning wrote: »
Advantage #1: More readable, saves 1 long
RDLONG   reg,#0
BIG      #$31240   'more readable
' do math on reg
WRLONG   reg,$1F1

The readability is moot, since the assembler should handle the addressing. I'd also argue that from a tools perspective it's *less* useful to have the big value in the lower bits, since then you need to handle two different cases (sometimes an immediate value will mean X, sometimes it will mean (X<<9). You also need to heavily comment this code, since the casual reader will not know what's in $1F1. It's another "gotcha" for the PASM programmer.

Ultimately it should be whatever's easiest for Chip to design, of course :-).

David Betz · 2013-12-05 11:46

Bill Henning wrote: »
With hubexec, its not a waste, but a win.

Outside of hubexec, it is just a regular register.

C code has a ton of cases where the result variable is used in the source expression, this would save many hub longs.

I have provided such examples before, but here is are the most relevant two:

Advantage #1: More readable, saves 1 long
RDLONG   reg,#0
BIG      #$31240   'more readable
' do math on reg
WRLONG   reg,$1F1
Saves one long in the hub.

Advantage #2: More readable, saves 2 longs, saves a hub cycle

(assuming 8-long is mapped to $1E0, which I strongly recommend for hubexec usage)
1e0: RDLONG   reg,#0
1e1: BIG      #$31240   ' advantage #1: more readable
1e2: WRLONG   reg,#2
1e3: BIG      #$12540   ' advantage #1: more readable
1e4: RDLONG   reg2, $1E1 ' re-use pointer in BIG
1e5: ADD      reg2, #128
1e6: WRLONG   reg2, $1E1' re-use pointer in BIG
1e7: ' do something else
above code saves 2 longs in the hub, and one hub fetch cycle.

If the BIG constant was left justified, we could not save the two longs.

Well worth it, even if the GCC code generator can't make use of it. Spin, other VM's and compilers can.

I believe that should convince you.

If you really believe that there should be a user visible BIG register then you could achieve the same goal by having the full 32 bit value be stored in the user visible register. The "readability" argument is not relevant if the assembler builds these instructions automatically.

Bill Henning · 2013-12-05 11:46

Assuming assembly macros exist, I agree the readability is less critical... but should always be a consideration (ie simpler dis-assemblers)

I am not sure I agree, as $1F1 should have a registername like "LASTCONST32" or "LASTADDR", which are fairly easy to understand.

Regarding the potential immediate value confusion, perhaps the dis-assembler ought to present two-long instructions as one instruction then? That would eliminate the end-user confusion.

Good discussion.

ersmith wrote: »

The readability is moot, since the assembler should handle the addressing. I'd also argue that from a tools perspective it's *less* useful to have the big value in the lower bits, since then you need to handle two different cases (sometimes an immediate value will mean X, sometimes it will mean (X<<9). You also need to heavily comment this code, since the casual reader will not know what's in $1F1. It's another "gotcha" for the PASM programmer.

Ultimately it should be whatever's easiest for Chip to design, of course :-).

jmg · 2013-12-05 11:47

Bill Henning wrote: »

1e3: BIG #$12540 ' advantage #1: more readable
1e4: RDLONG reg2, $1E1 ' re-use pointer in BIG

Oops, I think those words re-use pointer in BIG, are not supported by what Chip has actually done

chip wrote:

ADD reg,#bigconstant & $1FF
BIG #bigconstant >> 9

That would add bigconstant to reg.

Instead of BIG, we should probably give it a name like AUGI for 'augment immediate'.

Any instruction having an immediate S or D would look for AUGI behind it. If it sees it and it's not cancelled, it extends the immediate value right in the pipeline, before it gets to stage 4. This was a really clever idea you guys came up with, and it turns out that it can be done by using the registers already in the pipeline, so it's almost free!

Notice the fancy footwork in the pipeline here, and the fact the EXTI32 follows the opcode.
Looks like a mono-flop opcode extension to me ? - and making something like this have extended life, has other fish hooks.

ersmith · 2013-12-05 11:48

cgracey wrote: »

It would probably never be used for in-cog code, as it would waste a cycle as the dummy data-payload instruction floated through the pipeline, but it would provide code executing from the hub a way to have 32-bit constants without resorting to complicated means.

The suffix would work great for hardware HUBEXEC. The prefix has an advantage though that if the hardware HUBEXEC doesn't work (or if it's too hard to implement) we could fall back to traditional LMM and still use the prefix form of BIG, as long as the BIG register contents were only cleared after use (rather than after every instruction). We'd have to write the LMM loop to not contain any immediates, but that wouldn't be hard.

Bill Henning · 2013-12-05 11:48

David,

You are not addressing Advantage # 2, which saves two longs out of 9 (encoding above code in 7 longs) precisely because of right justified constant embedded in big.

By the way, the visible register would store the full 32 bit value.

David Betz wrote: »

If you really believe that there should be a user visible BIG register then you could achieve the same goal by having the full 32 bit value be stored in the user visible register. The "readability" argument is not relevant if the assembler builds these instructions automatically.

David Betz · 2013-12-05 11:50

jmg wrote: »

Oops, I think those words re-use pointer in BIG, are not supported by what Chip has actually done

Notice the fancy footwork in the pipeline here, and the fact the EXTI32 follows the opcode.
Looks like a mono-flop opcode extension to me ? - and making something like this have extended life, has other fish hooks.

I'm still afraid that this fancy footwork in the pipeline might get in the way of running other hardware tasks along with the hub memory task. Bill says that isn't necessary and maybe he's right but we should keep in mind that limitation though.

David Betz · 2013-12-05 11:50

Bill Henning wrote: »

David,

You are not addressing Advantage # 2, which saves two longs out of 9 (encoding above code in 7 longs) precisely because of right justified constant embedded in big.

By the way, the visible register would store the full 32 bit value.

If the visible register stores the full 32 bits then it doesn't matter whether the BIG instruction supplies the low bits or the high bits.

Bill Henning · 2013-12-05 11:51

I am taking advantage of knowing where the 8-long is mapped, which would work regardless of pipeline footwork as it would be a cog-addressable register, and a compiler would know where it placed the long (perhaps not gcc, it may be too difficult to implement in its back end architecture).

Sorry, don't understand the mono flop extension comment.

jmg wrote: »

Oops, I think those words re-use pointer in BIG, are not supported by what Chip has actually done

Notice the fancy footwork in the pipeline here, and the fact the EXTI32 follows the opcode.
Looks like a mono-flop opcode extension to me ? - and making something like this have extended life, has other fish hooks.

jmg · 2013-12-05 11:51

Bill Henning wrote: »

Regarding the potential immediate value confusion, perhaps the dis-assembler ought to present two-long instructions as one instruction then? That would eliminate the end-user confusion.

Of course, the user should always see the simplest asm/disasm, and closest to what they intended/did.
Plenty of micros have variable width opcodes.

jmg · 2013-12-05 11:52

David Betz wrote: »

I'm still afraid that this fancy footwork in the pipeline might get in the way of running other hardware tasks along with the hub memory task. Bill says that isn't necessary and maybe he's right but we should keep in mind that limitation though.

Easy to test when Chip has this done ... ?

Bill Henning · 2013-12-05 11:53

prefix/suffix is up to Chip - like all of this

The only case Chip's pipeline suffix may cause issues is running multiple tasks alongside the LMM loop... talk about slowing LMM down more!

Also, I am fine with the prefix form, as long as the constant is right justified (see Advantage#2 to see how it saves memory and a hub cycle)

ersmith wrote: »

The suffix would work great for hardware HUBEXEC. The prefix has an advantage though that if the hardware HUBEXEC doesn't work (or if it's too hard to implement) we could fall back to traditional LMM and still use the prefix form of BIG, as long as the BIG register contents were only cleared after use (rather than after every instruction). We'd have to write the LMM loop to not contain any immediates, but that wouldn't be hard.

Bill Henning · 2013-12-05 11:54

Incorrect.

left justifies removes the savings of Advantage#2, which you are deliberately not addressing at this point.

David Betz wrote: »

If the visible register stores the full 32 bits then it doesn't matter whether the BIG instruction supplies the low bits or the high bits.

jmg · 2013-12-05 11:58

Bill Henning wrote: »

I am taking advantage of knowing where the 8-long is mapped, which would work regardless of pipeline footwork as it would be a cog-addressable register, and a compiler would know where it placed the long (perhaps not gcc, it may be too difficult to implement in its back end architecture).

Sorry, don't understand the mono flop extension comment.

The way I read Chip's description, it sounds like the work is all done live, in the pipeline, and there is no read of another LONG.
That also means there is a must-be-following caveat on the EXTI32, and then it vanishes, hence the mono-flop (one off / one shot) description ( I could use the English volatile, but some languages have that horribly mangled.)

David Betz · 2013-12-05 11:58

Bill Henning wrote: »

Incorrect.

left justifies removes the savings of Advantage#2, which you are deliberately not addressing at this point.

Sorry, I guess I don't understand your advantage #2. It looks like all you're doing is using the user visible register to reuse the address computed in the last BIG instruction sequence. This could just as easily be done if the BIG instruction supplies the high bits and the instruction being modified supplies the low bits. As long as all 32 bits are stored in the user visible register then I don't see why your example wouldn't work.

Bill Henning · 2013-12-05 12:11

No worries, I guess you did not differentiate the $1E1 reference to the in-cog BIG instruction, using it as a 23 bit pointer.

My apologies, my original comment must have mislead you.

Let's see if I can clear up the misundertanding.

(assuming 8-long is mapped to $1E0, which I strongly recommend for hubexec usage)

Same hub address as Advantage#2, coded for right justified

#$31240 ' ` 0011 0001 001 0 1000 0000 in binary

Visible BIG at $1F1 would contain the second address, value irrelevant, as it would overwrite $321240 from the first big

1e0: RDLONG   reg,#$0800 ' low nine bits, was $31240   ' ` 0011 0001 0010 1000 0000
1e1: BIG      #$0000189
1e2: WRLONG   reg,#2
1e3: BIG      #$12540   ' actual value irrelevant to example, clobbers 32 bit big built in first two instructions
1e4: RDLONG   reg2, $1E1 ' CANNOT reuse pointer at $1e1, wrong address if left justified, CAN use if right justified
1e5: ADD      reg2, #128
1e6: WRLONG   reg2, $1E1' CANNOT reuse pointer at $1e1, wrong address if left justified, CAN use if right justified
1e7: ' do something else

I hope you now see why left justified does not allow saving space that right justified does. An instruction at $1e7 could even use the pointer from $1e3

Basically, this is a memory savings strategy, to maximize even the 256KB hub.

David Betz wrote: »

Sorry, I guess I don't understand your advantage #2. It looks like all you're doing is using the user visible register to reuse the address computed in the last BIG instruction sequence. This could just as easily be done if the BIG instruction supplies the high bits and the instruction being modified supplies the low bits. As long as all 32 bits are stored in the user visible register then I don't see why your example wouldn't work.

David Betz · 2013-12-05 12:16

Bill Henning wrote: »
No worries, I guess you did not differentiate the $1E1 reference to the in-cog BIG instruction, using it as a 23 bit pointer.

My apologies, my original comment must have mislead you.

Let's see if I can clear up the misundertanding.

(assuming 8-long is mapped to $1E0, which I strongly recommend for hubexec usage)

Same hub address as Advantage#2, coded for right justified

#$31240 ' ` 0011 0001 001 0 1000 0000 in binary

Visible BIG at $1F1 would contain the second address, value irrelevant, as it would overwrite $321240 from the first big
1e0: RDLONG   reg,#$0800 ' low nine bits, was $31240   ' ` 0011 0001 0010 1000 0000
1e1: BIG      #$0000189
1e2: WRLONG   reg,#2
1e3: BIG      #$12540   ' actual value irrelevant to example, clobbers 32 bit big built in first two instructions
1e4: RDLONG   reg2, $1E1 ' CANNOT reuse pointer at $1e1, wrong address if left justified, CAN use if right justified
1e5: ADD      reg2, #128
1e6: WRLONG   reg2, $1E1' CANNOT reuse pointer at $1e1, wrong address if left justified, CAN use if right justified
1e7: ' do something else
I hope you now see why left justified does not allow saving space that right justified does. An instruction at $1e7 could even use the pointer from $1e3

Basically, this is a memory savings strategy, to maximize even the 256KB hub.

Okay, I see what you mean now but this seems very obscure. I guess one plays tricks like this to get fast code but I'd hate to have to maintain code written like this.

Bill Henning · 2013-12-05 12:20

Thanks, now I understand.

I was basic my comments on how the four long equivalent was mapped on an earlier FPGA, and if Chip does it the same way, it would also be visible.

Chip will tell us soon I hope.

Heck, knowing what Chip is like, we may be playing with hubexec this weekend!

jmg wrote: »

The way I read Chip's description, it sounds like the work is all done live, in the pipeline, and there is no read of another LONG.
That also means there is a must-be-following caveat on the EXTI32, and then it vanishes, hence the mono-flop (one off / one shot) description ( I could use the English volatile, but some languages have that horribly mangled.)

Bill Henning · 2013-12-05 12:22

The beauty is... only those who need to, will use it - and compilers, once set up for it, can take advantage of such tricks without regular users having to touch it.

It boils down to right justifying saves hub space, and macros can make it look good

David Betz wrote: »

Okay, I see what you mean now but this seems very obscure. I guess one plays tricks like this to get fast code but I'd hate to have to maintain code written like this.

David Betz · 2013-12-05 12:24

Bill Henning wrote: »

The beauty is... only those who need to, will use it - and compilers, once set up for it, can take advantage of such tricks without regular users having to touch it.

It boils down to right justifying saves hub space, and macros can make it look good

I guess we'll see what Chip has to say about handling the S field differently depending on whether there is a BIG instruction in the pipeline or not. Maybe it's trivial and you'll get your wish.

Bill Henning · 2013-12-05 12:42

Of course it is up to Chip.

David, with me you can always bet that the reason for whatever suggestion I make has solid technical reasons for it - and to make the best P2 within time and process constraints.

At times, I may not express it sufficiently well in an initial posting, or in enough detail, to make it as obvious to others as it is to me. I will try to be better at that. I know at times I come off sounding arrogant, but that is never my intent - and plenty of others have that shortcoming too.

I never mind explaining my technical reasons (time permitting), but I do get frustrated when people don't try to understand my responses. And I can always be convinced with a better technical argument.

If left justifying saved memory or time, I'd be all over it.

I am all over right justifying precisely because it saves precious hub resources, and will make the P2 a little faster.

I've invested far too much time and money in the prop, and I need for Parallax to succeed to recover that investment (from Propeller products that I have made, and will make - I have two more coming over the next few months), but besides that - I like the Prop, Chip and the rest of Parallax, even the vast majority of the forumistas, which is reason enough to want Parallax to succeed.

David Betz wrote: »

I guess we'll see what Chip has to say about handling the S field differently depending on whether there is a BIG instruction in the pipeline or not. Maybe it's trivial and you'll get your wish.

David Betz · 2013-12-05 12:48

Bill Henning wrote: »

Of course it is up to Chip.

David, with me you can always bet that the reason for whatever suggestion I make has solid technical reasons for it - and to make the best P2 within time and process constraints.

At times, I may not express it sufficiently well in an initial posting, or in enough detail, to make it as obvious to others as it is to me. I will try to be better at that. I know at times I come off sounding arrogant, but that is never my intent - and plenty of others have that shortcoming too.

I never mind explaining my technical reasons (time permitting), but I do get frustrated when people don't try to understand my responses. And I can always be convinced with a better technical argument.

If left justifying saved memory or time, I'd be all over it.

I am all over right justifying precisely because it saves precious hub resources, and will make the P2 a little faster.

I've invested far too much time and money in the prop, and I need for Parallax to succeed to recover that investment (from Propeller products that I have made, and will make - I have two more coming over the next few months), but besides that - I like the Prop, Chip and the rest of Parallax, even the vast majority of the forumistas, which is reason enough to want Parallax to succeed.

Sorry, I misread your example. You used $1F1 as the user visible BIG 32 bit value and $1E1 as a constant taken from the code stream. They looked similar enough that I didn't notice the difference and thought you were always talking about the BIG register. I still think it will be of limited value to be able to directly address instructions in the 8-long window. GCC doesn't even know where instructions are going to end up in memory so it won't be easy to guarantee that they end up on an 8-long boundary and if the linker is forced to align them then the extra NOP instructions it will have to insert will probably negate much of the benefit of your scheme. Of course, you can easily get around that for assembly code. I guess that's where you'll see the most advantage to this hack.

Bill Henning · 2013-12-05 12:54

No worries, I later realized that the my original comments on the code were incorrect, and that $1E1 and $1F1 sure look similar :-)

I agree that it may not be worthwhile to support this optimization in PropGCC - especially in the first release!

But as you note, assembly code - which is what most of my published prop code tends to be - can easily make use of it.

I LOVE the idea of not being constrained to <512 long assembly language programs without taking the LMM hit!

Being able to use the AUX stack for hubexec will also help my code (and a lot of other code) be much faster and smaller, even if PropGCC does not use it.

Spin, other VM's, and potentially other compilers can make use of it, if the writer thinks its worth an effort, and it fits the compilers architecture.

Besides, an additional peep hole generation stage for Gcc (for pasm assembly code) could be added later, that could take advantage of pointers embedded in BIG's in the current window.

I think that the hubex instructions, and big, will remove much of the uncertainty for GCC, and it could later be made eight long window aware, avoiding the excess NOP's and allowing better optimization. Of course that need not be done at all, or can be left to the future.

David Betz wrote: »

Sorry, I misread your example. You used $1F1 as the user visible BIG 32 bit value and $1E1 as a constant taken from the code stream. They looked similar enough that I didn't notice the difference and thought you were always talking about the BIG register. I still think it will be of limited value to be able to directly address instructions in the 8-long window. GCC doesn't even know where instructions are going to end up in memory so it won't be easy to guarantee that they end up on an 8-long boundary and if the linker is forced to align them then the extra NOP instructions it will have to insert will probably negate much of the benefit of your scheme. Of course, you can easily get around that for assembly code. I guess that's where you'll see the most advantage to this hack.

David Betz · 2013-12-05 12:57

I'll still be amazed if we actually get this "execute from hub" feature in P2. It seems like a pretty big step. I hope Chip can find a way to do it with a minimum of risk!

Bill Henning · 2013-12-05 13:03

Chip has pulled off much more impressive and complex feats in very little time

Given that the four long RDQUAD was already present, hubexec (IMHO) is not a lot of extra verilog, and it should not be too complex. Chip would know better than me.

Besides, an FPGA implementation will let us test it - and my understanding is due to transistor, dac bus, etc changes, we are ~4 months from the next possible shuttle run, leaving time for testing.

Even if it works in the FPGA, and not in the April shuttle, it can simply be ignored, and I think SETTRACE will make debugging the shuttle run MUCH easier.

Having hubex does not affect the ability to use LMM... and I am betting that Chip will show up with an FPGA image, with hubex, far sooner than any of us expect.

David Betz wrote: »

I'll still be amazed if we actually get this "execute from hub" feature in P2. It seems like a pretty big step. I hope Chip can find a way to do it with a minimum of risk!

Cluso99 · 2013-12-05 14:06

Bill, while I can see the advantages of automatically placing the 32bit immediate result into a register (be it $1F1 or wherever), I think its more of a kludge that will cause a lot of misunderstanding. Personally, I'd rather not have that feature because of it's obscurity.

I have just thought of a potential "gotcha". What if the two instructions (xxx and BIG/AUGI) were over a WIDE break (ie spread over an 8*Long boundary)?

Oh, and I love the assembler emitting the pair of instructions automatically if the constant is >9 bits.

This last week has certainly seen some massive advances in P2's abilities. Thanks go to "Thanksgiving Holidays" (and we don't have it here in Oz).

Hub Execution Model Thread (split from blog)

Comments