The New 16-Cog, 512KB, 64 analog I/O Propeller Chip

evanh · 2014-04-20 02:19

Cluso99 wrote: »

Wouldn't you find a larger LIFO more beneficial???

That's one that needs trimmed. :P

Question: What does CMPR do? I'm keen on a new compare that can perform a Carry = (D - S < #0) as per my post - http://forums.parallax.com/showthread.php/155298-FullDuplexSerial-for-P1/page2

Roy Eltham · 2014-04-20 02:21

RossH,
Stop trying to spew FUD here too.

Coley · 2014-04-20 02:58

Roy Eltham wrote: »

RossH,
Stop trying to spew FUD here too.

Roy, you seem a bit tetchy today.

Like Ross, Cluso99 and all the others we are entitled to our opinions, just as you are.

Like it or not the facts are that the recent additions to the P2 have introduced additional delays and have led us to this point.
This may ultimately be the best thing to have happened, only time will tell, but, please stop telling people they can't express their opinion.

Regards,

Coley

Roy Eltham · 2014-04-20 03:06

Coley,
I will tell him to stop expressing his opinion when it's FUD and when it's blaming other forum members for P2 problems. Go read the other thread where he does that!

As for the current form of the chip, I think it's better than it was before, and Chip has a better handle on the power issues now, so he'll reach a better result.

Cluso99 · 2014-04-20 03:47

Roy Eltham wrote: »

Coley,
I will tell him to stop expressing his opinion when it's FUD and when it's blaming other forum members for P2 problems. Go read the other thread where he does that!

As for the current form of the chip, I think it's better than it was before, and Chip has a better handle on the power issues now, so he'll reach a better result.

I respect your opinion Roy. But you have not given it. You just got on your horse instead. That doesn't help anyone, does it?

I was one who originally asked for, and it was supported by others, for the SETCOND. However, in hindsight, and where we are now heading, I can do this other ways by using more instructions. But it is not something I would use regularly, and with hubexec (which cannot use it anyway), I see its benefit as being less than when I proposed it.

I noticed Chip has just added GETS, GETD, GETI and GETCOND. These are brand new, and again, I cannot see great use for these. Maybe Chip has a valid reason. In the meantime, I am asking does anyone else see a valid use for these.

evanh:
I have used the LIFOs and found them extremely useful and simple. I used them both in hubexec and cog mode (in the previous P2 fpga). But 4 deep is really too small to be of much use.

Each level takes about 19 flops (from Chips info a while back). While I would hate to lose them, 4 can only be used in tight subroutines due to the likelihood of otherwise crashing.

CMPR is D=S-D, Z=result0, C=unsigned borrow. It is a SUB in reverse and NR.
CMP is D= D-S, Z=result0, C=unsigned borrow. It is a SUB with NR.

ozpropdev · 2014-04-20 03:51

evanh wrote: »

What does CMPR do?

CMPR works in reverse to CMP.
CMP uses D-S
CMPR uses S- D

same for SUBR.

Edit: Cluso beat me to it!

evanh · 2014-04-20 03:59

Cluso99 wrote: »

evanh:
I have used the LIFOs and found them extremely useful and simple.

I was being cheeky. I thought you may have followed my recent postings on the subject. A few of us have noted it seems easy for the hardware stack to be re-purposed to use CogRAM in-place of a LIFO without any penalty. Particularly now that there is 4 stages.

CMPR is D=S-D, Z=result0, C=unsigned borrow. It is a SUB in reverse and NR.

Oh, is that all. All that does is change the polarity of the carry. D-S<#0 has the nice advantage of ignoring any absolute rollovers.

Cluso99 · 2014-04-20 04:22

evanh wrote: »

I was being cheeky. I thought you may have followed my recent postings on the subject. A few of us have noted it seems easy for the hardware stack to be re-purposed to use CogRAM in-place of a LIFO without any penalty. Particularly now that there is 4 stages.

OK. Yes, it would be nice if Chip could do this. But it may add quite a bit of silicon???

Oh, is that all. All that does is change the polarity of the carry. D-S<#0 has the nice advantage of ignoring any absolute rollovers.

I actually found it made my code easier when comparing using CMPR rather than CMP. If I had a choice of only 1 I would choose CMPR except for the backward compatibility.
Maybe it is just mindset, but I seem to recall it worked better for decoding a lot of values/ranges.

evanh · 2014-04-20 04:35

Cluso99 wrote: »

OK. Yes, it would be nice if Chip could do this. But it may add quite a bit of silicon???

I'd be surprised if it did increase the footprint. The extra mux'ing or whatever should be offset by the elimination of the LIFO.

The only question mark I heard of is the possibility of fetch timings causing an extension of critical path. This is what killed the INDA feature, but then that might have been using a CogRAM general register as the indexing register which makes a difference to the timing.

Cluso, looking back, it was you that got me thinking about this you know! - http://forums.parallax.com/showthread.php/155132-The-New-16-Cog-512KB-64-analog-I-O-Propeller-Chip?p=1261140&viewfull=1#post1261140

RossH · 2014-04-20 04:35

Roy Eltham wrote: »

Coley,
I will tell him to stop expressing his opinion when it's FUD and when it's blaming other forum members for P2 problems. Go read the other thread where he does that!

As for the current form of the chip, I think it's better than it was before, and Chip has a better handle on the power issues now, so he'll reach a better result.

FUD? No, just a slice of reality.

Ross.

Cluso99 · 2014-04-20 04:50

evanh wrote: »

I'd be surprised if it did increase the footprint. The extra mux'ing or whatever should be offset by the elimination of the LIFO.

The only question mark I heard of is the possibility of fetch timings causing an extension of critical path. This is what killed the INDA feature, but then that might have been using a CogRAM general register as the indexing register which makes a difference to the timing.

Cluso, looking back, it was you that got me thinking about this you know! - http://forums.parallax.com/showthread.php/155132-The-New-16-Cog-512KB-64-analog-I-O-Propeller-Chip?p=1261140&viewfull=1#post1261140

Yes. Sometimes hindsight is wonderful. I really would like one alternative, either deeper LIFO, or in-cog LIFO/FIFO/stack.
The last two timing diagrams reinforced where these mux delays are adding up. I am still hopeful of there being a solution though.
I would just like one of them, as a hub based stack is very wasteful of hub slots, and should be avoided where possible.

If not, then I posted a possible sw solution and maybe Chip can provide an instruction or two to support that method and reduce the instruction count to do it.

Heater. · 2014-04-20 05:13

Kye,

Interesting post, you know, it would be possible now days to make a processor that could execute tens of SUBLEQ instructions in one clock at like 1 GHZ. Would be funny to see how fast such a simple processor could execute code.

As it happens Oleg Mazonka and Alex Kolodin built an FPGA with 28 subleq machines in it in 2011. They claimed:

"Our test results demonstrate that computational power of our Subleq OISC multi-processor is comparable
to that of CPU of a modern personal computer."
[url]
http://arxiv.org/pdf/1106.2593.pdf
[/url]
I have often wondered how this would go if built into a real chip. The small simple SUBLEQ can probably be clocked very fast. Problem is you need a lot of processors to make up for the inefficiency of the instruction set. Which brings you to the shared RAM bandwidth problem.

Expect a SUBLEQ VM for the Propeller II at some point

ctwardell · 2014-04-20 06:58

Cluso99 wrote: »

I have been looking at the new proposed instruction set.

Some of these instructions seem to be a carry over from the P2 and do not seem to be so relevant now that we will have hubexec. Some of them cater more for cog execution where we did tricks because of the cog space restriction. Some just seem to be there because they are reverse equivalents.

You may wonder why I am asking this. Well, everything uses silicon. If some of these instructions are unnecessary, maybe we can save enough silicon to add something more beneficial, such as a deeper LIFO.

Are these instructions now really necessary ???
SETCOND, GETCOND, GETI, GETD, GETS

Are these necessary (maybe they are for video modes???) ???
RORNIBn, RORBYTn, RORWRDn, ROLNIBn, ROLBYTn, ROLWRDn

I presume these are necessary, but worthy for comment...
INCMOD, DECMOD, ESWAP4, ESWAP8, SPLITW, MERGEW, TOPBIT, DECOD, PICKZC

PICKZC - helps restore Z & C flags quickly from any bit pair. Extremely useful IMHO. Also permits 4 state decoding from 2 bits simply.
ESWAP8 - useful for reordering bytes in a long (big vs little endian)

Do we need this many compare variations (note we no longer have NR option on ADDnn/SUBnn) ???
TESTB, TESTN, TEST, CMP, CMPX, CMPS, CMPS, CMPSX, CMPR, CMPSUB

TEST, TESTN - both extremely useful.
CMP - Obviously required IMHO.
CMPR - I have found this extremely useful.

A TARG instruction may avoid some of these compare variations?
TARG was an instruction Chip used in the P2 to redirect the result of the following instruction to another location (does not overwrite D). By selecting a read-only register, NR could be simulated (using the additional instruction). But the real benefit of TARG was the ability of directing the result to a different register, thereby making a simple C = D xxx S (does not overwrite D).

Be careful with that axe Eugene.

We went through the phase of everybody wanting to put their mark on the P2 by adding something, now it has become wanting to do it by suggesting a cut.

I completely disagree with considering the existence of HUBEXEC as a reason to neuter the capabilities of COG execution. It is looking like at best HUBEXEC will run at half the speed of COG code and that will be for straight line code that isn't hitting the hub for data.

When considering removing an instruction because it could be replaced with several other instructions we need to consider how it is normally used and if it has side effects.

For example:

Is it typically used within tight loops where the increased time is an issue?

Is it used frequently where the added space is an issue?

Does the replacement sequence alter the flags and require them to be saved and restored?

C.W.

Brian Fairchild · 2014-04-20 10:41

I think jmg might have a deeper insight into how the P1+ is structered but it's not always true to say that more opcodes = more gates = more power. Once you have a certain collection of different stages, like registers, adders, shifters etc, connected together with things like multiplexers, you can often create new opcodes just by connecting them together in a different way. So the total number of gates stays the same, they just get used differently.

Where power consumptions goes up is where you need to add in whole extra sections of latches and multiplexer to re-time and reroute internal signals, especially if those gates need to be clocked at high frequencies.

As an example, a ripple counter, where each section counts at half the rate of the previous section, is quite efficient for power but bad for speed. A synchronous counter, where every flip-flop is always being clocked at full speed, consumes more power but can reach higher speeds.

Seairth · 2014-04-20 11:09

Drawing from discussion about FullDuplexSerial, I had a thought about how WAITCNT could work a little differently. Suppose that each cog maintained two additional registers:

cnt_trigger_value (32-bit)
cnt_trigger_flag (1-bit)

Along with this, there would be two instruction changes:

A new instruction: NEXTCNT D, #n/S. This instruction will set cnt_trigger value to D+#n/S and clear cnt_trigger_flag.
WAITCNT will become zero-operand. This instruction will stall until cnt_trigger_flag is 1. Additionally, if WC is set, it will return immediately with C=cnt_trigger_flag.

Internally, the code to update looks something like:

always_ff(@clk)
begin
    if (cnt_trigger_value == cnt)
	    cnt_trigger_flag <= 1;
end

With this, you can take the typical "polled wait":

                mov     _cnt,cnt                          
                add     _cnt,delay
				
:wait           mov     temp,_cnt
                sub     temp,cnt
                cmps    temp,#0           wc
        if_nc   jmp     #:wait

And turn it into:

                nextcnt cnt, delay
				
:wait           waitcnt                   wc
        if_nc   jmp #:wait

And, of course, simple stalled waits:

                waitcnt cnt, delay

Becomes:

                nextcnt cnt, delay
                waitcnt

You get the idea. Here are a few additional thoughts:

The original WAITCNT D, #n/S could still be supported, where it basically combines the NEXTCNT and zero-operand WAITCNT.
It would be possible to maintain multiple triggers/flags. This might allow a single cog to maintain multiple tasks with separate timing requirements.
There could also be a JMPCNT and JMPNCNT instruction that would branch based on the value of wait_trigger_flag. This would make for even tighter polled waits.
And, this approach should work well with hardware multitasking, if it is added later on.

jmg · 2014-04-20 13:41

Seairth wrote: »

Drawing from discussion about FullDuplexSerial, I had a thought about how WAITCNT could work a little differently. Suppose that each cog maintained two additional registers:
cnt_trigger_value (32-bit)

cnt_trigger_flag (1-bit)

Yes, I agree a more flexible poll-able form is very useful .

We are still waiting on Counter MHz values which will indicate the adder-delay costs, but this idea of adding a way to use the faster logic of equals is an alternative I had considered.

Advantages of a Counter Waypoint polling opcode, as already mentioned, are

It also can easily support Power-Gating, ie clock gating of the whole COG to significantly lower power.
It is compatible with SW task switching
It is also compatible with HW Task switching, should that make the cut.

However, I think a HW cell (as above) has issues with multiple way-point compares, as SW tasking may be looping many such way-point checks ?

I would implement the details slightly differently, so that WAITCNT was still supported in a backward compatible manner,
but add a sibling opcode that can check for a CNT way-point without stalling, is the key item here.

Pseudo code, for UNTILCNT opcode :

	REPEAT                    ' repeat while (CNT not yet at D )
	  .. other code
          jmpsw   rxcode,txcode   ' optional TaskSw
        UNTILCNT  rxcnt           ' RELJMP:Repeat until (CNT passes D )
        add     rxcnt,bitticks    ' Set next way-point

Pass test is like (CNT >= WP) & (CNT[MSB] == WP[MSB])

This is more granular than WAITCNT, but more flexible.

Seairth · 2014-04-20 20:46

jmg wrote: »

Yes, I agree a more flexible poll-able form is very useful .

We are still waiting on Counter MHz values which will indicate the adder-delay costs, but this idea of adding a way to use the faster logic of equals is an alternative I had considered.

Advantages of a Counter Waypoint polling opcode, as already mentioned, are
It also can easily support Power-Gating, ie clock gating of the whole COG to significantly lower power.

It is compatible with SW task switching

It is also compatible with HW Task switching, should that make the cut.

However, I think a HW cell (as above) has issues with multiple way-point compares, as SW tasking may be looping many such way-point checks ?

Do you mean that multiple independent tasks could be attempting their own NEXTCNT/WAITCNT at the same time? If so, that's a good point. I can certainly see how that would be a problem. In fact, I can see how FDS would have this very issue.

One of my suggestions was to support multiple such cells, which would potentially allow a limited number of tasks to use their own. Of course, in software-only tasking, this would require a bit of coordination on the programmer's part. With hardware multitasking, you could probably have one per hardware task. On the other hand, the current WAITCNT can only be used by one task at a time anyhow. So, if you were to have just one of these cells, you would still have the same limitation.

(Chip: does the P1+ support the WC polled version of WAITCNT?)

jmg wrote: »
I would implement the details slightly differently, so that WAITCNT was still supported in a backward compatible manner,
but add a sibling opcode that can check for a CNT way-point without stalling, is the key item here.

Pseudo code, for UNTILCNT opcode :
	REPEAT                    ' repeat while (CNT not yet at D )
	  .. other code
          jmpsw   rxcode,txcode   ' optional TaskSw
        UNTILCNT  rxcnt           ' RELJMP:Repeat until (CNT passes D )
        add     rxcnt,bitticks    ' Set next way-point
Pass test is like (CNT >= WP) & (CNT[MSB] == WP[MSB])

This is more granular than WAITCNT, but more flexible.

Wouldn't that test potentially fail if WP were very near $FFFFFFFF and cnt wrapped around to $00000000?

I agree that there should still be a compatible form of WAITCNT. So maybe you could have WAITCNT (which blocks, using operands, exactly as it does now) and TESTCNT (which works with NEXTCNT, never blocks, and sets C as above).

And I really like the JMPCNT D (which would also work with NEXTCNT, conditional on C). With hardware multitasking, JMPCNT could effectively be used as the now-defunct (for the current version of P1+, at least) PASSCNT.

Hmm... this obviously requires some more thought.

jmg · 2014-04-20 21:32

Seairth wrote: »

Wouldn't that test potentially fail if WP were very near $FFFFFFFF and cnt wrapped around to $00000000?

Hmm, I thought I had covered all wrap cases, but you are right, a sluggish task could return with a Ceiling WP and a CNT that had already wrapped.

I think this works, over a 2^31 range/reach from CNT to next WP.
(WP-CNT)[MSB]

evanh · 2014-04-20 23:20

jmg wrote: »

(WP-CNT)[MSB]

http://forums.parallax.com/showthread.php/155298-FullDuplexSerial-for-P1/page2

cgracey · 2014-04-22 20:12

I've got the ALU done. It exists as six separately-clocked sections, so that only the one that is needed (if any) gets clocked and undergoes state changes (this cuts power consumption). The whole ALU is currently compiling at 140MHz and it uses 1800 LE's. There is some fanout coming and another mux delay for the data-forwarding circuits, which will drop the speed a little. Because the ALU gets two clocks and it is, by far, the longest path, we should get the Cyclone IV FPGAs (DE0-Nano and DE2-115) to clock at a Quartus-approved 200+ MHz. And you can always overclock by 25%, in my experience.

potatohead · 2014-04-22 20:21

Cool! Thanks for giving us the update Chip.

BTW: How do you test at this stage?? If it's a long answer, don't, but I'm just wondering how it comes together.

jmg · 2014-04-22 20:24

cgracey wrote: »

I've got the ALU done. It exists as six separately-clocked sections, so that only the one that is needed (if any) gets clocked and undergoes state changes (this cuts power consumption). The whole ALU is currently compiling at 140MHz and it uses 1800 LE's. There is some fanout coming and another mux delay for the data-forwarding circuits, which will drop the speed a little. Because the ALU gets two clocks and it is, by far, the longest path, we should get the Cyclone IV FPGAs (DE0-Nano and DE2-115) to clock at a Quartus-approved 200+ MHz. And you can always overclock by 25%, in my experience.

Do you have a list of supported Operations and Cycles for the ALU Block ?
How does that 1800 LEs compare with a (current) COG ?

cgracey · 2014-04-22 20:35

potatohead wrote: »

Cool! Thanks for giving us the update Chip.

BTW: How do you test at this stage?? If it's a long answer, don't, but I'm just wondering how it comes together.

You just put some flops after the ALU, so that it has some register-to-register paths, and then tell it to compile for some impossibly-high speed.

cgracey · 2014-04-22 20:37

jmg wrote: »

Do you have a list of supported Operations and Cycles for the ALU Block ?
How does that 1800 LEs compare with a (current) COG ?

The entire Prop1 cog was 1850 LE's, while the ALU is always the biggest chunk. The Prop2 cog was ~30,000 LE's.

I'll post the latest instruction set in a bit. All ALU operations take two clocks.

Bill Henning · 2014-04-22 20:51

Very cool!

cgracey wrote: »

I've got the ALU done. It exists as six separately-clocked sections, so that only the one that is needed (if any) gets clocked and undergoes state changes (this cuts power consumption). The whole ALU is currently compiling at 140MHz and it uses 1800 LE's. There is some fanout coming and another mux delay for the data-forwarding circuits, which will drop the speed a little. Because the ALU gets two clocks and it is, by far, the longest path, we should get the Cyclone IV FPGAs (DE0-Nano and DE2-115) to clock at a Quartus-approved 200+ MHz. And you can always overclock by 25%, in my experience.

jmg · 2014-04-22 20:57

cgracey wrote: »

The entire Prop1 cog was 1850 LE's, while the ALU is always the biggest chunk. The Prop2 cog was ~30,000 LE's.

What about the current P1+ COG size ? (with or without the Counter/adder)

2 Clocks sounds fast, is there still 32x32-> 64 and 64/32->32 (rem32) ALU support ?

If the ALU takes 2 clocks and is shared, how does that interact with those COGs running odd-SysClk aligned, vs those even-SysClk aligned ?

cgracey · 2014-04-22 21:05

jmg wrote: »

What about the current P1+ COG size ? (with or without the Counter/adder)

2 Clocks sounds fast, is there still 32x32-> 64 and 64/32->32 (rem32) ALU support ?

If the ALU takes 2 clocks and is shared, how does that interact with those COGs running odd-SysClk aligned, vs those even-SysClk aligned ?

Each cog has its own ALU, so they are not shared. There will be shared pipelined mul/div/CORDIC in the hub, though.

T Chap · 2014-04-22 21:08

I am curious if since the ALU is slower than the cog, does the cog simply just wait for a reply from the ALU and then continue?

jmg · 2014-04-22 21:21

cgracey wrote: »

Each cog has its own ALU, so they are not shared. There will be shared pipelined mul/div/CORDIC in the hub, though.

Ahh.. oops, when you said 'the ALU', I took the Arithmetic to mean the single shared Arithmetic Block in the Hub.

So the COG ALU does MUL, MULS and Addition and Subtraction opcodes ?

Roy Eltham · 2014-04-22 21:25

Sounds excellent Chip! Looking forward to this!

The New 16-Cog, 512KB, 64 analog I/O Propeller Chip

Comments