The New 16-Cog, 512KB, 64 analog I/O Propeller Chip

Seairth · 2014-04-08 06:43

Heater. wrote: »
Seairth,

Your analysis of FullDuplexSerial needs a closer look:

Below is the receive loop. It consists of 9 instructions.

That means more than 10% of the code space (and time) is wasted on the yield (jmpret).

If we home in in the inner loop that is only 5 instructions so we could say 20% of the time is wasted on the yield.

Moreover there are 4 instructions used for what would normally be a single WAITPxx instruction if this were a thread.

So we are looking at about 50% of waste time. And no hope of improving edge timing and sampling jitter. Looks like hardware scheduled threads would help a lot here.
:bit                    add     rxcnt,bitticks        'ready next bit period


:wait                   jmpret  rxcode,txcode         'run a chuck of transmit code, then return


                        mov     t1,rxcnt              'check if bit receive period done
                        sub     t1,cnt
                        cmps    t1,#0           wc
        if_nc           jmp     #:wait


                        test    rxmask,ina      wc    'receive bit on rx pin
                        rcr     rxdata,#1
                        djnz    rxbits,#:bit

I was not arguing your point about the execution time ("is a dog"), rather Bill's comment about the waste of cog space. My apologies for the lack of clarity. Again, I believe the new pin approach that Chip is discussing will mitigate your concerns (baud rate, jitter, etc) to some extent. No, it's not a perfect solution. No, you won't be able to push the I/O as far as other tasking approaches might allow. But there are always going to be use cases where the hardware doesn't suffice. So, as I said before, I am trying to offer a solution with minimal hardware requirement (and minimal impact on time and risk). I think we can afford to wait until the P2 for full tasking support.

Frankly, I'd take the P1+ with no additional task support if it would mean we get it sooner.

ctwardell · 2014-04-08 06:56

I'm starting to fear that 16 of these beasties will not end up fitting the available space.

Even with the items that have been factored out into common resources, these are going to be way more complex than P1 cogs.

C.W.

Seairth · 2014-04-08 07:03

Heater. wrote: »
Seairth,

Your analysis of FullDuplexSerial needs a closer look:

Below is the receive loop. It consists of 9 instructions.

That means more than 10% of the code space (and time) is wasted on the yield (jmpret).

If we home in in the inner loop that is only 5 instructions so we could say 20% of the time is wasted on the yield.

Moreover there are 4 instructions used for what would normally be a single WAITPxx instruction if this were a thread.

So we are looking at about 50% of waste time. And no hope of improving edge timing and sampling jitter. Looks like hardware scheduled threads would help a lot here.
:bit                    add     rxcnt,bitticks        'ready next bit period


:wait                   jmpret  rxcode,txcode         'run a chuck of transmit code, then return


                        mov     t1,rxcnt              'check if bit receive period done
                        sub     t1,cnt
                        cmps    t1,#0           wc
        if_nc           jmp     #:wait


                        test    rxmask,ina      wc    'receive bit on rx pin
                        rcr     rxdata,#1
                        djnz    rxbits,#:bit

Here's another thought as well: add a SWTASK #n/D variant that would be just like the zero-param SWTASK, but store #n/D instead of PC+1. The FDS receive code would then look like:

receive                 test    rxtxmode,#%001  wz    'wait for start bit on rx pin
                        test    rxmask,ina      wc
        if_z_eq_c       swtask  #receive

                        mov     rxbits,#9             'ready to receive byte
                        mov     rxcnt,bitticks
                        shr     rxcnt,#1
                        add     rxcnt,cnt                          

:bit                    add     rxcnt,bitticks        'ready next bit period

:wait                   mov     t1,rxcnt              'check if bit receive period done
                        sub     t1,cnt
                        cmps    t1,#0           wc
        if_nc           swtask  #:wait

                        test    rxmask,ina      wc    'receive bit on rx pin
                        rcr     rxdata,#1
                        djnz    rxbits,#:bit

                        shr     rxdata,#32-9          'justify and trim received byte
                        and     rxdata,#$FF
                        test    rxtxmode,#%001  wz    'if rx inverted, invert byte
        if_nz           xor     rxdata,#$FF

                        rdlong  t2,par                'save received byte and inc head
                        add     t2,rxbuff
                        wrbyte  rxdata,t2
                        sub     t2,rxbuff
                        add     t2,#1
                        and     t2,#$0F
                        wrlong  t2,par

                        swtask  #receive              'byte done, receive next byte

This has zero impact on code space or execution time. No, it doesn't get rid of the jitter issue and may only marginally increase baud rate. Again, that's where the new pin stuff comes in.

John Abshier · 2014-04-08 07:46

I looks like we are starting to stray (featuritis) from the chip described in post #1. Hopefully we will not go another 9 months and have another still born chip.

Ken, time for the wet blanket. I recommend wool.

John Abshier

User Name · 2014-04-08 07:57

cgracey wrote: »

One aspect of having 4-long hub transfers is that with a few simple address tags (TLB, David?) we could direct out-of-cog addresses into 4-long register blocks which are serving as instruction caches. Part of the cog register RAM becomes the cache! No cache-line flipflops and mux's needed!

I like this idea a lot!! Nothing would turn a 200 MHz peregrine falcon into an 80 MHz buzzard faster than throwing a lot of mux's into the critical execution path. At this point I don't much care what this chip has or has not, so long as it doesn't compromise speed for gadgets.

jazzed · 2014-04-08 07:58

John, wool is good. Just hope we don't need PKP foam.

Chip,

Is the full HUB RAM space accessible by one hubexec COG at byte granularity? Saw something about 17 bits and that isn't enough for 512KB a byte at a time.

Bill Henning · 2014-04-08 08:07

Hi Brian!

Thank you for the replay I asked for, NOW I remember

Excellent question.

(the pace of postings these last few days has been overwhelming.... stack overflow...)

From what I have seen so far, a video cog, or a high speed sampling / signal generation cog could make use of every second instruction, which is actually every fourth hub slot.

Which means we could have 3 "fast" cogs, and many "slow" peripheral cogs.

Bandwidth

200Mhz / 16 cogs * 4 slots to fast cog = 50M hub slots per cog, 50M * 16 bytes = 800MB/sec bandwidth (with 32 / 128 slot "fast" cog)

Hubexec

(assuming executing out of the hub 4-long buffer, same as above)

200Mhz / 16 cogs * 4 slots to fast cog... we get prefetch for free! (with 32 / 128 slot "fast" cog)

(as next hub cycle delivers the next 4 longs, all it needs is auto address increment)

100MIPS :-) *for simple instructions, **4x faster than non-cached LMM

You like? I LIKE!

Brian Fairchild wrote: »

My question is now...."if your new proposal was implemented, given the above scenario, what speed increase would we see in the COGs running from HUB which currently run at 23.5MIPS?"

cgracey · 2014-04-08 08:08

jazzed wrote: »

John, wool is good. Just hope we don't need PKP foam.

Chip,

Is the full HUB RAM space accessible by one hubexec COG at byte granularity? Saw something about 17 bits and that isn't enough for 512KB a byte at a time.

17 bits is the PC width, which addresses longs that are instructions. A hub exec program can do RDBYTE/RDBYTEC/WRBYTE using a 19-bit PTRA or PTRB. Because of the way the DCACHE works, only PTRA/B accesses are allowed - no no S register. The reason is because I must force the DCACHE address ($1F0) into the S fetch to access the DCACHE quads. In all the code I wrote for the Prop2, only once did I use S for address. The pointers are very easy to load and use.

ctwardell · 2014-04-08 08:30

Hardware threads...

One of the problems with all the add-ons is they seem to get to *almost* something else, so "just on more little thing" and pretty soon we are back at P2.

With the hardware threads it will soon become...

- Can each thread do HUBEXEC?
- Now that they do HUBEXEC can we have preemptive HUBEXEC?
- Can each thread get it's own set of pointers?
- How do we divvy up hub access?
- Can each thread...well you get the point.

I don't hate threads, and I do see some nice use cases, so how about coming up with something really simple and closed ended.

May I suggest:

I know we hate modes, but...

Normal Mode: One thread, can use HUBEXEC.
Threaded Mode: Two threads, equal time each, NO HUBEXEC, NO Duplicate Pointers, only the flags and PC unique to each thread, first come first serve on HUB access.

I would rather just leave threading out of this chip, but if it is deemed a requirement please keep it simple and don't let it be a stepping stone to "just one more thing"...

C.W.

jazzed · 2014-04-08 08:34

cgracey wrote: »

17 bits is the PC width, which addresses longs that are instructions. A hub exec program can do RDBYTE/RDBYTEC/WRBYTE using a 19-bit PTRA or PTRB. Because of the way the DCACHE works, only PTRA/B accesses are allowed - no no S register. The reason is because I must force the DCACHE address ($1F0) into the S fetch to access the DCACHE quads. In all the code I wrote for the Prop2, only once did I use S for address. The pointers are very easy to load and use.

Great.

Can you post an update of the feature spec (minus instruction lists)?

I know you are trying to exercise restraint and optimize too, but a spec that can be maintained can be followed. The P2 spec just sat there and never got updated.

Please avoid too much threading.

I suppose it''s also useful to know what P1 compatibility if any may be disappearing. For example, I noticed HUBOP in the list(s) of unused instructions ... that is a misnomer of course because COGNEW is a HUBOP ;-)

Thanks.

cgracey · 2014-04-08 08:39

jazzed wrote: »

Great.

Can you post an update of the feature spec (minus instruction lists)?

I know you are trying to exercise restraint and optimize too, but a spec that can be maintained can be followed. The P2 spec just sat there and never got updated.

Please avoid too much threading.

I suppose it''s also useful to know what P1 compatibility if any may be disappearing. For example, I noticed HUBOP in the list(s) of unused instructions ... that is a misnomer of course because COGNEW is a HUBOP ;-)

Thanks.

HUBOP is becoming a clearinghouse for all functions that have to do with the hub, or possibly even video! I'm trying to focus the cog on being efficient at flow control and computation. The more generic we can make its peripheral interfaces, the simpler and faster it can become. I'm really tired now, as I've been up over 24 hours, so I need to get some sleep.

Dave Hein · 2014-04-08 08:39

It seems my previous post was ignored, so I'll ask again. Is this the planned list of features for P1+?

8-COG 16-COG P1 Core
4-port 2-port cog memory
20-bit 16-bit multiplier
256K 512K RAM/ROM
32-bit Multiply/Divide Engine in each cog the hub
Cordic Engine in each cog the hub
PTRA/PTRB
INDA/INDB
256-long CLUT/FIFO
PTRX/PTRY
Data Cache
360+ 200+ Instructions
256-bit 128-bit Hub Bus
4 tasks
hubex
4 Instruction Caches
serial I/O
Pre-emptive threads

cgracey · 2014-04-08 08:42

ctwardell wrote: »

Hardware threads...

One of the problems with all the add-ons is they seem to get to *almost* something else, so "just on more little thing" and pretty soon we are back at P2.

With the hardware threads it will soon become...

- Can each thread do HUBEXEC?
- Now that they do HUBEXEC can we have preemptive HUBEXEC?
- Can each thread get it's own set of pointers?
- How do we divvy up hub access?
- Can each thread...well you get the point.

I don't hate threads, and I do see some nice use cases, so how about coming up with something really simple and closed ended.

May I suggest:

I know we hate modes, but...

Normal Mode: One thread, can use HUBEXEC.
Threaded Mode: Two threads, equal time each, NO HUBEXEC, NO Duplicate Pointers, only the flags and PC unique to each thread, first come first serve on HUB access.

I would rather just leave threading out of this chip, but if it is deemed a requirement please keep it simple and don't let it be a stepping stone to "just one more thing"...

C.W.

I agree with not making duplicate pointers per task, etc. But to get bare bones multi-tasking, all we need is 1..3 more Z/C/PC's and some mux's for them. It's a big deal for a program like the ROM_Monitor. It wouldn't work in one cog without multi-tasking.

rjo__ · 2014-04-08 08:48

(related to Chip's estimate above) The fact that the new chip will have fabulous cog bandwidth but only 1/8 the effective processing power of the prior design, speaks to the power of the instruction set that is in that prior design. There have been references to design creep. To me it seemed that the driving ethic of the design effort was to optimize performance by replacing common functions with single instructions. Every time you can replace a sequence of common code with a single instruction, the instruction set seems to bloat…but in fact these new instructions function more like macros than the instructions in the original P1, and the impact that they can have on total functionality is obviously huge. In my view, knowing that Chip was looking for places in the design, where this ethic could be applied, there were lots of suggestions about potential opportunities. I hope we can eventually get back to this design ideology and can find a way to better characterize the instruction set… so that the core instructions are set well apart and easily distinguished from these compound instructions. This should be easy to do. The only issue I was concerned about was the complexity of the addressing. I have never been good with this kind of programming and I was really afraid that I might never fully understand it… Of course that wouldn't stop me from using it:)

ctwardell · 2014-04-08 08:52

cgracey wrote: »

I agree with not making duplicate pointers per task, etc. But to get bare bones multi-tasking, all we need is 1..3 more Z/C/PC's and some mux's for them. It's a big deal for a program like the ROM_Monitor. It wouldn't work in one cog without multi-tasking.

My main concern is that it not lead to more feature creep.

C.W.

Bill Henning · 2014-04-08 08:55

excellent suggestion!

jmg wrote: »

i'll relabel this a little with a more descriptive term, that better reflects what actually happens here

video (clut) memory sharing

borrowing a term from the pc world of cheap systems where they have one memory array and code & video share the bus.
Saves the die area of a separate clut, but shares cog ram and slots (50%) to do this.

The video hw shifts each 8b (/4b/2b/1b?) pixel and uses that as the clut index, and sends that 32 bit read, split to the dac or direct to pins.

Bill Henning · 2014-04-08 09:01

Excellent! Saves a ton of hub memory, better for elegant pasm, better for compilers... THANK YOU!

cgracey wrote: »

Yes, but I can't get it from my work computer (that hates the internet) onto my laptop, without my thumb drive that is back in the house.

It needs refining, anyway. It's yet a mess, with opcodes undefined. Just the instructions are there properly.

For what it's worth, I added PTRA and PTRB into it, to facilitate efficient hub access and hub exec. There's also AUGS and AUGD and the immediate 17-bit JMP/CALL/LINK instructions. The hub exec cache is a section of cog ram this is used for hardware registers, so nothing there gets wasted. There's another section up top for DCACHE that is otherwise used for read-only registers like CNT/RND/INA/INB. I figure that for hub exec, there's no benefit in having more than one 4-long cache line, since it will be exhausted after every four instructions, anyway. So, it should run 50 MIPS without branches or hub reads/writes.

Bill Henning · 2014-04-08 09:03

Only instruction addresses are 17 bit long address encoded, saves two bits on encoding, and Px's don't support non-aligned instructions.

jazzed wrote: »

John, wool is good. Just hope we don't need PKP foam.

Chip,

Is the full HUB RAM space accessible by one hubexec COG at byte granularity? Saw something about 17 bits and that isn't enough for 512KB a byte at a time.

cgracey · 2014-04-08 09:05

Dave Hein wrote: »

It seems my previous post was ignored, so I'll ask again. Is this the planned list of features for P1+?

8-COG 16-COG P1 Core
4-port 2-port cog memory
20-bit 16-bit multiplier
256K 512K RAM/ROM
32-bit Multiply/Divide Engine in each cog the hub
Cordic Engine in each cog the hub
PTRA/PTRB
INDA/INDB
256-long CLUT/FIFO
PTRX/PTRY
Data Cache
360+ 200+ Instructions
256-bit 128-bit Hub Bus
4 tasks
hubex
4 Instruction Caches
serial I/O
Pre-emptive threads

That looks correct, but INDA and INDB will exist. In a non-pipelined architecture like this, they are very simple to do.

I'm counting about 170 instructions now. There is only one instruction cache line, since adding a bunch more would only make loops faster and necessitate a bunch of 15-bit comparators. With the planned setup, hub execution will be half the speed of cog execution. If it turns out we can go 256 bits wide, after all, it will be full speed in a straight line.

Here is the new register map. The DCACHE and ICACHE areas are in locales where the RAM is neither read nor written by instructions:

addr		read		write		name		background
--------------------------------------------------------------------------
000..1EF	RAM		RAM		-		-

1F0		CNT		-		CNT		DCACHE0
1F1		RND		-		RND		DCACHE1
1F2		INA		-		INA		DCACHE2
1F3		INB		-		INB		DCACHE3
1F4		RAM		RAM+OUTA	OUTA		-
1F5		RAM		RAM+OUTB	OUTB		-
1F6		RAM		RAM+DIRA	DIRA		-
1F7		RAM		RAM+DIRB	DIRB		-
1F8		RAM		RAM+CTRA	CTRA		-
1F9		RAM		RAM+CTRB	CTRB		-
1FA		RAM		RAM+FRQA	FRQA		-
1FB		RAM		RAM+FRQB	FRQB		-
1FC		PHSA		PHSA		PHSA		ICACHE0
1FD		PHSB		PHSB		PHSB		ICACHE1
1FE		indirect	indirect	INDA		ICACHE2
1FF		indirect	indirect	INDB		ICACHE3

Bill Henning · 2014-04-08 09:05

Since without pipelining it seems cheap, I'd go for it. Monitor in one cog? Heck, I might leave it running all the time!!!!

cgracey wrote: »

I agree with not making duplicate pointers per task, etc. But to get bare bones multi-tasking, all we need is 1..3 more Z/C/PC's and some mux's for them. It's a big deal for a program like the ROM_Monitor. It wouldn't work in one cog without multi-tasking.

mindrobots · 2014-04-08 09:13

Bill Henning wrote: »

Since without pipelining it seems cheap, I'd go for it. Monitor in one cog? Heck, I might leave it running all the time!!!!

16 cogs? A primitive monitor, debugger, serial HIM running all the time - you bet! (Forth kernel? - I think I can mention it parenthetically!

)

Heater. · 2014-04-08 09:32

You know me. 170 instructions is already quite enough.

cgracey · 2014-04-08 09:36

Heater. wrote: »

You know me. 170 instructions is already quite enough.

Yes, I cringed when I thought you'd see that.

It just takes that much to get program flow and computation running smoothly.

I'd post the instruction set, but it's still a mess. I just got rid of the pixel blending instructions. They are totally fun to play with, but an excess in this chip.

Sapieha · 2014-04-08 10:21

Hi Chip.

I understand Yours problem.
And I think You now have good insight in what can be done and even usable from P2 work.
I know You still will come with clever solutions to made that IC usable
I will hang on as long my health give me.

BUT still -- It is possible have Instructions info to last BIN FPGA code --- S I can have any thing I can work on even if not so usable in NEXT STEP of IC.
Only work with electronics hold me little more (as my life last years are sleep and siting with computer with programing/thinking) else it will be only sleeping and that not help my health.

cgracey wrote: »

Sapieha,

The new chip will have a lot of good things in it, including hub exec. We just couldn't get the P2 to fit in 180nm in any adequate manner. So, we are going back to the basics, but adding a few key elements from the Prop2 development. It's true that these cogs won't be as fast, but there will be more of them, so that the total MIPS will be higher, but the power will be 1/8th.

When this is done, we'll pick up where we left off on Prop2. That's the best we can do right now, in order to get a real chip into production. Hang in there, please.

Baggers · 2014-04-08 10:25

I agree, the pixel blending will be overkill for this chip.

jazzed · 2014-04-08 10:45

Bill Henning wrote: »

Only instruction addresses are 17 bit long address encoded, saves two bits on encoding, and Px's don't support non-aligned instructions.

Yes, Chip answered that already. Guess you missed that.

Roy Eltham · 2014-04-08 10:53

cgracey wrote: »

Yes, I cringed when I thought you'd see that.

It just takes that much to get program flow and computation running smoothly.

I'd post the instruction set, but it's still a mess. I just got rid of the pixel blending instructions. They are totally fun to play with, but an excess in this chip.

Could they possibly live as a hub resource along side the cordic/math stuff? Maybe a trimmed down subset? They are not strongly needed, but they are super nice for doing GUIs.

Martin Hodge · 2014-04-08 11:03

Seairth wrote: »

Frankly, I'd take the P1+ with no additional task support if it would mean we get it sooner.

+1x10¹⁰⁰

Bill Henning · 2014-04-08 11:17

Yep, I missed his answer to you.

I was trying to save him time

I really don't know how Chip manages with so little sleep.

jazzed wrote: »

Yes, Chip answered that already. Guess you missed that.

potatohead · 2014-04-08 11:25

Please write the monitor so that we can hook into it.

Say, a U command, for user debugger, or anything really. Monitor arguments passed in, and our program can return to it easily. This allows an upload to include whatever debugging package the developer deems necessary, and it takes advantage of the serial link already setup and established.

The New 16-Cog, 512KB, 64 analog I/O Propeller Chip

Comments