The New 16-Cog, 512KB, 64 analog I/O Propeller Chip

dMajo · 2014-04-13 05:16

cgracey wrote: »

I realized tonight that it is hard to support hub exec and not get mired back in the complexities of the Prop2.

The problems arise from hub exec needing INDA/INDB-type functionality to overcome the impracticality of self-modifying code in hub memory. INDA/INDB require an effective pipeline stage, unto themselves, in order to take the instruction, recognize INDA/INDB usage, and then substitute the INDA/INDB pointers into the S and D fields of the instruction before reading S and D. The Prop1's simpler architecture just feeds the S and D fields of the instruction straight into the address inputs of the cog RAM to read S and D on the next clock, requiring no extra stage. This pipeline stage needed to support INDA/INDB increases the number of instructions trailing a branch (which will need cancellation) from one to two. That, in turn, has the effect of requiring Prop2-type INDA/INDB backtracking circuits to accommodate various cancellation scenarios. This pipeline stage is also required to make the RDxxxxC cached reads work. To implement hub exec, we're going to be complicating the cog quite a bit. Delayed branches will become more necessary to regain looping performance. I think this is the wrong road, in light of what became of the P2 at 180nm.

There is a way around this, and that is to resolve all the INDA/INDB stuff on the same cycle as the instruction read, tacking the INDA/INDB logic time onto the cog RAM access time. This would have the effect of slowing the clock down by ~30%, I estimate. It would require less flops and not introduce a new pipeline stage, but would instantly create what will remain the critical path in the cog.

In summary, if we don't pursue hub exec and INDA/INDB, we're much simpler, which means smaller and faster. There is little cost and no extra pipelining required to implement PTRA/PTRB, though, so that is viable. Same with a hardware LIFO stack for CALL/RET. And 128-bit hub transfers are no problem, either. Same with keeping the four tasks - those make a lot possible and are almost free.

Is hub exec worth slowing the cog down for?

I do not get all the problems here, perhaps due to my ignorance.

We have now self-modifyng code in P1 that needs a few instructions/nops between the modifing and the modified instruction. If the issue is due to the 4 long icache isn't enough to have a rule that for hubexec we need at least 4/5 nops in between?
Can't eventually the problem be solved by adopting two instructions that changes the D and/or S for the next instruction like it was for the P2? Won't it work even in the icached area and even avoid the required nops between?

If not, than not having the hubexec mode will not be an issue, the old lmm will still work isn't it?
BTW the slow speed means that there be some 2 clock instructions and some 4 clock or that everything will be slower?
Do not reduce the overall speed, IMHO is not a good compromise.

EDIT: reading through the posts I've seen that the gurus have already found a solution.

ozpropdev · 2014-04-13 05:35

Heater. wrote: »

ozpropdev,

How is indexing through COG RAM whilst executing from HUB a useful feature? Who would miss it if we could not do it?

In a debugger written for hub-exec, it is quite common to read/modify cog register values.
The bonus of hub-exec is we get 495 registers now, being able to monitor these is important, to me anyway.
Cheers
Brian

Heater. · 2014-04-13 05:42

OK, a debugger sounds like a use case.

Is it essential, a show stopper? I presume monitoring those registers can be done by another way, perhaps not as fast but doable.

RossH · 2014-04-13 05:45

ozpropdev wrote: »

In a debugger written for hub-exec, it is quite common to read/modify cog register values.
The bonus of hub-exec is we get 495 registers now, being able to monitor these is important, to me anyway.
Cheers
Brian

I don't see the connection. A debugger can view or modify any cog register already.

Ross.

David Betz · 2014-04-13 05:54

dMajo wrote: »

I do not get all the problems here, perhaps due to my ignorance.

We have now self-modifyng code in P1 that needs a few instructions/nops between the modifing and the modified instruction. If the issue is due to the 4 long icache isn't enough to have a rule that for hubexec we need at least 4/5 nops in between?
Can't eventually the problem be solved by adopting two instructions that changes the D and/or S for the next instruction like it was for the P2? Won't it work even in the icached area and even avoid the required nops between?

If not, than not having the hubexec mode will not be an issue, the old lmm will still work isn't it?
BTW the slow speed means that there be some 2 clock instructions and some 4 clock or that everything will be slower?
Do not reduce the overall speed, IMHO is not a good compromise.

EDIT: reading through the posts I've seen that the gurus have already found a solution.

I would hate to lose hub execution mode just because it's too difficult to solve the problem of self-modifying COG code. I'd rather give that up when running from hub memory or use COG subroutines to accomplish it. Don't throw out the baby with the bath water! :-)

RossH · 2014-04-13 05:59

David Betz wrote: »

I would hate to lose hub execution mode just because it's too difficult to solve the problem of self-modifying COG code. I'd rather give that up when running from hub memory or use COG subroutines to accomplish it. Don't throw out the baby with the bath water! :-)

Since we will still have LMM, I'd rather lose HUB execution mode than slow down COG execution mode by such a massive amount.

However, I think Chip is already on track for a solution that will allow us to keep both.

Ross.

Bill Henning · 2014-04-13 07:15

Hi David,

1-7 are correct

INDA/INDB are needed for cog-only code, small stacks, small fifo's, and so that hubexec code can index cog memory (as hubexec cannot self-modify without tons of hoops)

PTRA/PTRB and LOCPTR are needed for efficient indexed hub memory access

AUGS/AUGD are needed for efficient 32 bit constant and address loading

The above allow getting rid of the LMM support routines in the cog, and save a lot of hub memory (and increase instruction throughput)

In conjunction with the single line i & d caches, it allows hubexec to blast through the "lmm speed barrier"

As currently proposed, this leads to a 50MIPS hubexec mode (could be 100MIPS with my slot mapping, or Chip's 256 bit wide bus)

Hope this helps,

Bill

David Betz wrote: »

I don't understand this proposed hubexec model. Can someone please explain it in a single message rather than scattered across the entire thread? I don't believe we need anything other than the following:

1) PC extended to 17 bits to address 512K of hub memory
2) When the high 8 bits of the PC are zero, it points to COG memory
3) When the high 8 bits of the PC are non-zero, it points to hub memory
4) CALL/JMP instructions with 17 bit address fields
5) A CALL instruction that stores its full 17 bit return address in a register
6) A way to load a 32 bit constant (AUGS)
7) A RDLONGC-like facility that allows 128 bits to be fetched at once and used as a one-line i-cache

What else is needed? In fact, number 7 could be left out in a really simple implementation but I think Bill determined that would only be 25% faster than LMM.

What is all of this about INDA, INDB, ALTD, ALTS, etc? What am I missing here?

Edit: Okay, I think I'm beginning to understand. Chip said INDA-like functionality. I guess the problem is with my points 2 and 3 where the value in PC is sometimes treated as a hub address and sometimes as a COG address. Is that the issue?

Bill Henning · 2014-04-13 07:17

We need INDA/INDB for efficient assembly code, it does not matter if gcc needs them.

If we did not put in things pasm needed, we could not write high efficiency cog drivers, so gcc will still benefit from them indirectly.

David Betz wrote: »

Okay, I understand. That sounds like a cool facility but I don't think it will be needed by PropGCC. As I said, it seems like it's orthogonal to hub execution. Without
INDA and INDB we are in no worse shape than we were in with P1 which also had no indirect COG memory instructions.

Bill Henning · 2014-04-13 07:19

David,

I am confused.

- Why would we lose hubexec due to self modifying cog code?

The self modifying code, and INDx, will be used by pure pasm code, so it has nothing whatsoever to do with hubexec.

David Betz wrote: »

I would hate to lose hub execution mode just because it's too difficult to solve the problem of self-modifying COG code. I'd rather give that up when running from hub memory or use COG subroutines to accomplish it. Don't throw out the baby with the bath water! :-)

Bill Henning · 2014-04-13 07:23

Generic plea:

Guys, let's wait for the P1+ fpga image and power analysis before proposing any more cuts, ok?

Features such as INDA/INDB/PTRA/PTRB/AUGS/AUGD take tiny amounts of gates, getting rid of them will have no noticeable effect.

64-80 "smart pins" with 32 bit clkfreq counters etc will use incredibly more power, heck one of those pins will use more gates than ALL of the features on the line above in every cog!

IF we need to cut power (again) - which I doubt - reducing the number of smart pins is the fastest way to do so.

Seairth · 2014-04-13 07:29

cgracey wrote: »

To do INDA/INDB, the cogs, themselves, would either slow down or become very complicated.

Instructions are going to be two clocks, no matter what. INDA/INDB would increase the pipeline depth from two (which is so simple, there's not much to label as 'pipeline') to three (which has all kinds of uglier ramifications).

If you were to remove the increment/decrement functionality of INDx, then I'm guessing the use of INDx would be reduced to a MUX (with S/D) for the load. It would still require additional instructions to manipulate INDx, but that's not the critical function of those registers. Would that resolve the critical path timing issue?

Seairth · 2014-04-13 07:32

cgracey wrote: »

I'm really glad we've got a solution now. What a relief!. In some ways, this is better than INDA/INDB.

Hmm. Should have read further. Ignore my last post, then. :P.

Seairth · 2014-04-13 07:35

Cluso99 wrote: »

Just so everyone realises, MOV [ptrd],[ptrs] would still be limited to registers (cog not hub).

And, from what I can tell, take two additional clock cycles per indirect fix-up.

Bill Henning · 2014-04-13 08:20

Hi Chip,

(I did not go back far enough, so I responded to some posts without seeing this. Sorry for the out of sequence comments!)

I think there is no need to support self-modifying code in the hub, or at least I can't come up with a case for it.

I think we need INDA/INDB is for faster cog stack/buffer addressing for both hub and cog execution.

PTRA/PTRB take care of indexed hub access for stacks, arrays, fifo's etc nicely.

In summary,

PLEASE don't drop INDA/INDB, and please don't worry about self-modifying hubexec code.

Ok, found the post where you explained the complications, and why it would need three cycles.

Thanks,

Bill

cgracey wrote: »

I realized tonight that it is hard to support hub exec and not get mired back in the complexities of the Prop2.

The problems arise from hub exec needing INDA/INDB-type functionality to overcome the impracticality of self-modifying code in hub memory. INDA/INDB require an effective pipeline stage, unto themselves, in order to take the instruction, recognize INDA/INDB usage, and then substitute the INDA/INDB pointers into the S and D fields of the instruction before reading S and D. The Prop1's simpler architecture just feeds the S and D fields of the instruction straight into the address inputs of the cog RAM to read S and D on the next clock, requiring no extra stage. This pipeline stage needed to support INDA/INDB increases the number of instructions trailing a branch (which will need cancellation) from one to two. That, in turn, has the effect of requiring Prop2-type INDA/INDB backtracking circuits to accommodate various cancellation scenarios. This pipeline stage is also required to make the RDxxxxC cached reads work. To implement hub exec, we're going to be complicating the cog quite a bit. Delayed branches will become more necessary to regain looping performance. I think this is the wrong road, in light of what became of the P2 at 180nm.

There is a way around this, and that is to resolve all the INDA/INDB stuff on the same cycle as the instruction read, tacking the INDA/INDB logic time onto the cog RAM access time. This would have the effect of slowing the clock down by ~30%, I estimate. It would require less flops and not introduce a new pipeline stage, but would instantly create what will remain the critical path in the cog.

In summary, if we don't pursue hub exec and INDA/INDB, we're much simpler, which means smaller and faster. There is little cost and no extra pipelining required to implement PTRA/PTRB, though, so that is viable. Same with a hardware LIFO stack for CALL/RET. And 128-bit hub transfers are no problem, either. Same with keeping the four tasks - those make a lot possible and are almost free.

Is hub exec worth slowing the cog down for?

Bill Henning · 2014-04-13 08:23

Chip,

I am really confused.

Why do you want to get rid of INDA/INDB?

It is VERY useful for cog only mode, and hubexec could use it for a small in-cog stack, buffers etc.

I really don't see the need for self-modifying hubexec, so I don't understand the problem with INDA/INDB.

Please help me understand!

Ok, I now understand, I skipped two pages of posts this morning - not a good idea!

cgracey wrote: »

Good points. We COULD have hub exec without INDA/INDB - the programmer would just need to have some routine he could call in the register space to do 'indirect' addressing via self-modifying code that lives there. That's not optimal, but it allows everything else to remain fast.

Bill Henning · 2014-04-13 08:32

Ok, now later posts make sense - thanks

cgracey wrote: »

To do INDA/INDB, the cogs, themselves, would either slow down or become very complicated.

Instructions are going to be two clocks, no matter what. INDA/INDB would increase the pipeline depth from two (which is so simple, there's not much to label as 'pipeline') to three (which has all kinds of uglier ramifications).

Bill Henning · 2014-04-13 08:40

I REALLY like this.

Keeps us to two cycles for all the instructions, four if we need one prefix.

Can we get (for low cost) for D/# for both ALTD/ALTS:

++D (applied before D is used)
--D

D++ (done after instruction referencing D executes)
D--

+# (relative to contents of S or D specified in instruction)
-#

Example 1:

Because then we get

ALTS ptr++
RDBYTEC ch,ptr

More nicely written as

RDBYTE ch,[ptr]++ ' walk a buffer

Slower than PTRA/B, but we have those where we need more speed.

Example 2:

ALTD buffer++
ALTS ptr++
RDLONGC buffer,ptr

Nicely written as:

RDLONG [buffer]++, [ptr]++

Ok, looks great!

cgracey wrote: »

What you said made me realize that we could do something like AUGS/AUGD, but instead of augmenting the next S/D constant, we could alter the S/D field in the next instruction. This is the way to achieve indirection for S and D! This is REALLY simple.

Along with augmenting D and S constants, we could alter D and S registers:

ALTD D/#
ALTS S/#
ALTDS D/#,S/#

This:

ALTS ptr
MOV OUTA,0

Could also be coded as:

MOV OUTA,[ptr]

David Betz · 2014-04-13 08:54

Bill Henning wrote: »

We need INDA/INDB for efficient assembly code, it does not matter if gcc needs them.

If we did not put in things pasm needed, we could not write high efficiency cog drivers, so gcc will still benefit from them indirectly.

This may be true. I'm not arguing against it. I'm just saying that if INDA and INDB can't be implemented efficiently then there is no need to also throw out hub execution as well.

David Betz · 2014-04-13 08:56

Bill Henning wrote: »

David,

I am confused.

- Why would we lose hubexec due to self modifying cog code?

The self modifying code, and INDx, will be used by pure pasm code, so it has nothing whatsoever to do with hubexec.

I think Chip's original message about this implied that they were all tied up together. I'm just saying that INDA and INDB don't have to be present for hub execution to be useful. They may be useful for other things but they aren't required for hub execution. We can do without self-modifying code.

Bill Henning · 2014-04-13 08:56

There we agree. Sorry, as I stated above my initial posts were without seeing two pages of posts - including the issues Chip is having with INDA/INDB.

His proposed ALTS/ALTD are a good solution if they can do pre-increment/decrement and post-increment/decrement.

Mind you, after Chip sleeps, he may come up with a brilliant way of doing INDA/INDB... things like that have happened often!

And overall, hubexec is extremely important.

David Betz wrote: »

This may be true. I'm not arguing against it. I'm just saying that if INDA and INDB can't be implemented efficiently then there is no need to also throw out hub execution as well.

pjv · 2014-04-13 09:43

David Betz wrote: »

. Without INDA and INDB we are in no worse shape than we were in with P1 which also had no indirect COG memory instructions.

But David, for writing fast and tight assembly code, this is a horrible restriction in the P1. While it can be worked around with self-modifying code, it takes a lot of extra cycles to accomplish this. Even the single indirect register in the SX made fast execution a breeze. Every day I lament the fact that the P1 does not have this capability !

I'really looking forward to being released from that straight jacket by the new chip.

Cheers,

Peter (pjv)

David Betz · 2014-04-13 09:49

pjv wrote: »

But David, for writing fast and tight assembly code, this is a horrible restriction in the P1. While it can be worked around with self-modifying code, it takes a lot of extra cycles to accomplish this. Even the single indirect register in the SX made fast execution a breeze. Every day I lament the fact that the P1 does not have this capability !

I'really looking forward to being released from that straight jacket by the new chip.

Cheers,

Peter (pjv)

Yes, it will be nice if it can happen. Anything to eliminate the need for self-modifying instructions wins my vote! :-)

Roy Eltham · 2014-04-13 10:15

Guys, you gotta make sure to read all the posts before replying!

Chip posted the issue, and in a matter of minutes (with a flurry of posts) we had an alternate solution for indirection and convinced Chip that hubexec didn't need to die because of inda/indb issues.

Bill,
I like you idea to add the inc/dec stuff to the ALTxx instructions. Would be nice if it's possible, but if not we can still accomplish it with extra instructions, of course.

David Betz · 2014-04-13 10:22

Roy Eltham wrote: »

Guys, you gotta make sure to read all the posts before replying!

I agree in principle but I find it almost impossible to keep up with this thread. Is there an easy way to "show me all of the new posts by cgracey"? :-)

potatohead · 2014-04-13 10:26

http://forums.parallax.com/search.php?do=finduser&userid=41126&contenttype=vBForum_Post&showposts=1

Yes there is.

jazzed · 2014-04-13 10:37

pjv wrote: »

I'really looking forward to being released from that straight jacket by the new chip.

LOL. Me too with that and in so many other ways as well over the years.

David Betz · 2014-04-13 10:37

potatohead wrote: »

http://forums.parallax.com/search.php?do=finduser&userid=41126&contenttype=vBForum_Post&showposts=1

Yes there is.

Thanks! That's very helpful for someone who doesn't have enough time to follow this discussion in all of its gory detail.

4x5n · 2014-04-13 12:11

RossH wrote: »

By all means do what you can - I'm sure it will get used one way or another, most likely in ways we don't currently anticipate.

But as you say, LMM is already here, and is essentially "free" - so don't get yourself bogged down adding stuff that we just don't need.

Ross.

Agreed, as simple as possible and no simpler!

jmg · 2014-04-13 13:02

Bill Henning wrote: »

I REALLY like this.

Keeps us to two cycles for all the instructions, four if we need one prefix.

Can we get (for low cost) for D/# for both ALTD/ALTS:

++D (applied before D is used)
--D

D++ (done after instruction referencing D executes)
D--

+# (relative to contents of S or D specified in instruction)
-#

Example 1:

Because then we get

ALTS ptr++
RDBYTEC ch,ptr

More nicely written as

RDBYTE ch,[ptr]++ ' walk a buffer

Slower than PTRA/B, but we have those where we need more speed.

Example 2:

ALTD buffer++
ALTS ptr++
RDLONGC buffer,ptr

Nicely written as:

RDLONG [buffer]++, [ptr]++

Ok, looks great!

Yes, the Auto INC/DEC are obvious extensions, I'm sure Chip will look at, once he has the in-line-self-modify working.
Cycle and port wise, there seems to be room to do this.

Phil Pilgrim (PhiPi) · 2014-04-13 14:24

There was some talk about repurposing the wr flag, since "it wasn't very useful." So I went through all my code to see how I had used wr and nr. Here's what I found:

              cmps      dx,px wc,wr                   
              cmps      dy,py wc,wr                   
              cmps      dx,px wc,wr                   
              cmps      dy,py wc,wr                   
divabx        cmpsub    ra,rx wc,nr                                                       
              cmps      manA,#0 wc, nr                             
:zeroSubnormal or       manA,expA wz,nr                        
              cmps      manA,#0 wc, nr                             
:build_lp     or        :csdp,:csdn nr,wz                            
:squ          shl       phase,#1 wc,nr                                               
        if_c  add       btn_shift,#1 wz,nr                                   
              or        right_count,left_count nr,wz                            
              shl       acc,#1 nr,wc
              shl       outL,#1 wc,nr                                            
              shl       phase1,#1 wc,nr                                 
              shl       phase2,#1 wc,nr                               
        if_c  add       btn_shift,#1 wz,nr                                   
:coord_ok     or        right_count,left_count nr,wz

(Not shown are the instances where I could have used another instruction without the modifier.)

Although I didn't use the modifiers very often, they were definitely useful when I needed them.

-Phil

The New 16-Cog, 512KB, 64 analog I/O Propeller Chip

Comments