The New 16-Cog, 512KB, 64 analog I/O Propeller Chip

jmg · 2014-04-07 18:08

dr hydra wrote: »

Would it be possible to set the number of cogs that can access the hub memory...therefore increasing the bandwidth to hub memory...

Yes, there are a number of ways to add hub-allocations. A mapping array, with the right controls, seems universally flexible.

Seairth · 2014-04-07 18:21

cgracey wrote: »

I don't know if cog-to-cog 32-bit links will be practical.

From my perspective, this doesn't need to act like memory-mapped I/O. This could just as easily be a simple pair of RDxxx/WRxxx instructions. By pairing every two cogs, this would allow at least each pair to communicate (at full speed) without using the hub or a large, muxed, bus ring.

Seairth · 2014-04-07 18:29

jmg wrote: »

Yes, there are a number of ways to add hub-allocations. A mapping array, with the right controls, seems universally flexible.

I'm not sure if I'm alone in this, but I think the hub access should be left as-is for this chip. Not even the P2 had this sort of feature. Designing something entirely new like this at this point in time seems like a risky endeavor. And this chip needs to minimize risk if its going to get out the door as quickly as possible.

Cluso99 · 2014-04-07 18:44

WOW... Only catching up now! WTG CHIP !!! Wondered why you were so quiet

64 I/O, 16 P1+ cogs, 512KB hub, security, QFP100, etc. Yeah

Comments/answers in no particular order...

P1 instruction compatibility not required although as close as possible pasm source compatible desirable.
Will the NR bit be removed? If so, then could you guys counting instructions check this. IIRC I use ANDN xx,xx NR rather than TESTN.
Most of the little used instructions I won't miss either except CMPSUB. But I did like CMPR in the P2.
I noted a long time ago that DJNZ/TJZ/TJNZ could be combined and expanded. Worth a quick look vs P2.
Color Composite Video - I am really going to miss this. The car videos are cheap, nice and standalone. IMHO NTSC is fine w/o PAL.
- If it takes too much space, could perhaps 1 cog (maybe the last) be different and include this, or could a separate single circuit be added like cordic?
Hubexec - I am really going to miss some form of hubexec.
- I believe it solves the argument against the 2K limitation of cogs.
- It reduces power usage over LMM modes (which can still be done anyway)
- It does not need to be anywhere near the complexity of the P2 (no autofetch, no LRU, no 4 deep 8 long cache)
- Those interested in this, could you please continue discussion over on the P16+ Hubexec thread
- http://forums.parallax.com/showthread.php/155121-Possible-simple-HUBEXEC-method-for-P16-X32B-discussion
- Bill, could you please copy your Post#73 there to begin as a basis?
Simple video read/write for use as serial in and out would be nice.
Some form of hub slot allocation would be nice.
Some form of comms between props w/o going via hub would be nice
- R/W 32bit directly between adjacent cogs would be nice (or between pairs?)
- Else a simple Port C & D (64 bits, simply OR'ed, no DIR required, just read & write)
Simply love the Quad width hub & cogs!
I would really like to see some simple hub allocation table and mooch if possible.
Any spare silicon would be nice to have a few additional cogs w/o video to fit more in. (I am not concerned they are crippled)

W9GFO · 2014-04-07 19:12

Seairth wrote: »

By the way, what nickname are we giving this thing?

I would suggest "Lemonade" - I think it is appropriate.

tonyp12 · 2014-04-07 19:21

>2.Will the NR bit be removed? If so, then could you guys counting instructions check this. IIRC I use ANDN xx,xx NR rather than TESTN.

TESTN is already not its own opcode, it's a ANDN with NR
Could NR bit be removed?, there are 64 Opcodes and 16 more are 'emulated' with a NR.
get rid of NR and Having 128 opcodes though some would be near duplicates like sub vs cmp, would give more choices but would also use more logic.
Or don't allow the compiler to use the NR flag on opcOdes that does not make sence, and instead the NR is the 7bit of a new opcode, should add another 10-15 opcodes

Beau Schwabe · 2014-04-07 19:28

W9GFO - "I would suggest "Lemonade" - I think it is appropriate."

If you can take "Melons" and make "Lemonade" then that's just the dyslexia talking.

Personally I like "Phoenix" , but that is totally an unofficial personal preference.

jmg · 2014-04-07 19:37

Seairth wrote: »

I'm not sure if I'm alone in this, but I think the hub access should be left as-is for this chip. Not even the P2 had this sort of feature. Designing something entirely new like this at this point in time seems like a risky endeavor. And this chip needs to minimize risk if its going to get out the door as quickly as possible.

It is not really entirely new, it is very similar to the Task allocator already done in P2, and it has a safe-superset operation.

So it can power up, looking just like the locked-choice users are familiar with now.
There is only one Mapper needed, not one per COG, so die impact is very small.

jmg · 2014-04-07 19:41

Seairth wrote: »

From my perspective, this doesn't need to act like memory-mapped I/O. This could just as easily be a simple pair of RDxxx/WRxxx instructions. By pairing every two cogs, this would allow at least each pair to communicate (at full speed) without using the hub or a large, muxed, bus ring.

Is serial too slow ? Chip's new discussion idea of per-pin state engines could allow something like this ?

rjo__ · 2014-04-07 19:43

P16X32 works for me. I am not exactly sure how he does it… but every time Chip makes this sort of turn, I like what comes out better than what went in.
My nearly 3 year old grandson absolutely loves it when I slam my car to a halt to avoid driving through a red light. Laughs and giggles his heart out.
I am just sitting here laughing my heart out right now. My grandson always knows where we are going, because we discuss it before he will let me put his shoes on him.
By the time we get to the end of our drive he will ask "where is( ______)". And I say…"over there and point in the general direction." At every stop, he asks again…and
he gets the exact same answer, with just as much enthusiasm as I can muster.
It is our little game.

Where are we going next? Over there-->P3.

Cluso99 · 2014-04-07 19:53

Reading thru' all these posts, I missed if the cogs will be 2clock instructions ?

Bob Lawrence (VE1RLL) · 2014-04-07 19:54

By the way, what nickname are we giving this thing?

Maybe it should be called Finally! LOL

Seairth · 2014-04-07 19:55

jmg wrote: »

Is serial too slow ? Chip's new discussion idea of per-pin state engines could allow something like this ?

I havent seen that discussion yet. But if the serial takes longer than 16 clock cycles to transfer 32-bits, then hub would be faster. Direct communication would allow 2-cycle writes and potentially 2-cycle reads. Though with a bit of handshaking, it migh t more reasonably be 8-10 clock cycle reads.

In the end, there's always the hub approach, which works well enough for many applications. So this is more of a nice-to-have than a need-to-have.

And if the simple two-task approach is supported, I suspect many of the cases where paired cogs would be used could be replaced by a single cog with two tasks.

Seairth · 2014-04-07 19:57

Cluso99 wrote: »

Reading thru' all these posts, I missed if the cogs will be 2clock instructions ?

Yes. See Chip's response to my list of questions above.

porcupine · 2014-04-07 19:58

Question, since I can't follow the threads, they're so active.

Will the proposed new "Prop 1 TNG" have fast multiply/divide instructions?
SERDES?
Faster external RAM?

Those are things I've heard bandied about for Prop 2 that got me excited. Frankly, I'm actually fine with the P1 as it is, just with more (much more RAM) and better fixed point math performance. I'm doing simple hobby audio DSP (synthy stuff... FM synthesis and wavetable) and I just flat out run out of RAM. I know I could squeeze more in by writing in raw PASM, but I like working in a subset of C++.

BTW I think using something close to standard C/C++ puts the prop ahead of the XCore stuff (which I've also been playing with), with their proprietary "XC" language. To me it's a long term winning strategy, but the memory thing is the biggest problem. 512k or 1m on-chip RAM would certainly help, though.

Heater. · 2014-04-07 19:58

Cluso,

...I missed if the cogs will be 2clock instructions ?

From the very first post by Chip: "This also gives cogs running at 200MHz (100 MIPS)"

Two clocks per instruction it is.

mindrobots · 2014-04-07 20:00

Since we're all starting to make requests on the specs of this new chip, can I ask for the SERIN/SEROUT instructions? Are they big power hogs or large blocks of silicon?? Maybe just SERA and not SERB? I thought it was pretty impressive that with those two instruction, you can eliminate 30-40 LONGS and still have serial I/O in a COG at up to 2mbits. If we don't have HUBEXEC, then COGRAM is back to being a precious commodity.

W9GFO · 2014-04-07 20:20

Beau Schwabe (Parallax) wrote: »

Personally I like "Phoenix" , but that is totally an unofficial personal preference.

The problem with "Phoenix" is there are no ashes, the P2 will still move forward - right?.

Also, Phoenix is bloody hot, kinda like the P2, not the P16X32B. ;-)

Beau Schwabe · 2014-04-07 20:36

W9GFO,

"The problem with "Phoenix" is there are no ashes, the P2 will still move forward - right?. Also, Phoenix is bloody hot, kinda like the P2, not the P16X32B. ;-) "
Your reading too much into it :-) ... I like the mystery of what the Phoenix represents.

Ashes is a pre-sign that something new and exciting is about to happen
Hot can also be a metaphor for setting a path toward better speed, efficiency, and unique approach.

While still keeping the idea of a Propeller , Wings of bird that can take flight to new horizons and tying back to the mystery. .... The things people will be able program in the future that are impossible to foresee now in the present.

W9GFO · 2014-04-07 20:38

Cluso99 wrote: »

Reading thru' all these posts, I missed if the cogs will be 2clock instructions ?

Yes, see this post: http://forums.parallax.com/showthread.php/155132-The-New-16-Cog-512KB-64-analog-I-O-Propeller-Chip?p=1257727&viewfull=1#post1257727

Cluso99 · 2014-04-07 20:51

Thanks guys - I missed the 2 clock although I presumed it to be the case. Then your mind says did I read it or assume it - or at least my senior mind does sometimes

Over here http://forums.parallax.com/showthread.php/155121-Very-Simple-HUBEXEC-for-New-16-Cog-512KB-64-analog-I-O?p=1257800&viewfull=1#post1257800
I posted a very basic implementation of hubexec that ought to be real-simple to implement. As this stands, its not the fastest - only comparable with LMM but 25% of the power.

Hubexec is really important to GCC and Catalina C, and to other HL languages.

It is also real important to remove the 2K limit of cog ram. LMM and overlaying can both help resolve this, but they are no means simple to the uninitiated. Hubexec is simple to program, just a few caveats compared to normal cog mode.

Here is the guts of my basic dead-simple hubexec...

Cluso99 wrote: »

.....
Basic HUBEXEC

Requires a new JMPRETX (JMPX/CALLX/RETX) instruction with 17bit for direct addressing.
CALLX will store return address (17bit) and flags in the fixed register $1EF

JMPX will not store the return address nor flags

RETX will JMPX to the return address stored in register $1EF, restoring the flags via WC & WZ

If the goto address is <$200 (ie hub byte address <$800) then the resulting jump will be to cog, else it will be hub.

The 17bit address (hub long address >>2) can be held in D+S for immediate mode.z

The cog's program counter will be increased from 9bits to 17bits (17bits hub long address)

When an instruction needs to be fetched from hub, it will wait for the hub cycle and the hub instruction will be read from the hub long.
ie the instruction can come from either cog ram or a fetched hub long.

No hub caching for instructions will be performed.
This keeps the design simplest.

Hubexec will execute 1 instruction for each available hub slot, except when delayed due to a hub data access.

Due to 1 per hub slot, the remaining clocks will cause the cog to idle in low power mode, reducing power consumption considerably.

Improved HUBEXEC
By increasing the complexity slightly, the following improvements are possible...
By adding a single QUAD register to hold the fetched long for an instruction would permit up to 4 instructions to execute in HUBEXEC mode per hub slot (presuming 1:16 clocks = 1:8 instructions).
This would give HUBEXEC up to 4x improvement in speed over the Basic Hubexec mode.

I will leave the details for Chip to work this out simply if he decides it has merit.

There seems no point in cache tags although saving hub fetching if it is the same address might be simple and save power.

David Betz · 2014-04-07 20:53

Cluso99 wrote: »

Thanks guys - I missed the 2 clock although I presumed it to be the case. Then your mind says did I read it or assume it - or at least my senior mind does sometimes

Over here http://forums.parallax.com/showthread.php/155121-Very-Simple-HUBEXEC-for-New-16-Cog-512KB-64-analog-I-O?p=1257800&viewfull=1#post1257800
I posted a very basic implementation of hubexec that ought to be real-simple to implement. As this stands, its not the fastest - only comparable with LMM but 25% of the power.

Hubexec is really important to GCC and Catalina C, and to other HL languages.

It is also real important to remove the 2K limit of cog ram. LMM and overlaying can both help resolve this, but they are no means simple to the uninitiated. Hubexec is simple to program, just a few caveats compared to normal cog mode.

Here is the guts of my basic dead-simple hubexec...

This sounds good and is pretty much exactly what we started with in P2 before we added the LRU cache. It was suggested over two years ago. It's nice to see that people are still interested in it! :-)

W9GFO · 2014-04-07 20:54

Beau Schwabe (Parallax) wrote:

While still keeping the idea of a Propeller...

Phoenix + propeller =

Cluso99 · 2014-04-07 20:55

A number of you have asked about more than 512KB of hub. With the QUAD x128bits, the hub is divided into 4 blocks of 128KB. It is not going to be possible to add anything other than another 512KB to keep this format. There is insufficient silicon space for 1MB.

cgracey · 2014-04-07 20:57

David Betz wrote: »

This sounds good and is pretty much exactly what we started with in P2 before we added the LRU cache. It was suggested over two years ago. It's nice to see that people are still interested in it! :-)

One aspect of having 4-long hub transfers is that with a few simple address tags (TLB, David?) we could direct out-of-cog addresses into 4-long register blocks which are serving as instruction caches. Part of the cog register RAM becomes the cache! No cache-line flipflops and mux's needed!

I think hub exec is going to happen, because it won't take much. What would really blow it wide open would be to have a 256-bit hub data path, so that each cog could do an 8-instruction fetch every 8 instructions. That would have the effect of jacking the power up quite a bit, I'm afraid. All cogs could run at 100% speed from the hub without branching or hub accesses.

I think Cluso thought up this possibility on the Prop2 effort.

Cluso99 · 2014-04-07 21:04

re HUBEXEC mode

David Betz wrote: »

This sounds good and is pretty much exactly what we started with in P2 before we added the LRU cache. It was suggested over two years ago. It's nice to see that people are still interested in it! :-)

It stops the detractors who see 2KB cog ram as too small.
It places the Call return address in a fixed location like GCC wants.
In its most basic form it runs at LMM speed and only uses 25% power.

You will note I used a separate JMPRETX instruction. This will force the user to know when he is writing hubexec code, and the compiler can be enhanced to check certain caveats.

The biggest thing to me is that I can write Hubexec Pasm simply. I went thru' all the problems of using LMM (without macros) for my P2 Debugger which made me understand the sw issues involved with LMM. Once Chip added hubexec, almost all of them went away. I could convert cog code to hubexec quite simply.

So, basic hubexec is definitely worth pursuing.

JRetSapDoog · 2014-04-07 21:10

Is the package size with its available die space really set in stone at this point? Are there any alternatives for packages that would allow for more die space and thus RAM? I'm delighted with the plan for 512KB! But more would definitely be nice for video support (or whatever). If there is any possibility that we'll eventually move to a larger package at 180nm, it's best to do so early in the game so as to avoid re-design work and not unnecessarily limit things.

I can see the P16X32 being kind of a SOC of sorts if it had 1MB or so of RAM (and even with 512KB/768KB). With more RAM, it could command a higher price and be suitable for more Apps. In that a highly-tuned but still general purpose niche device that stands out in the market is being designed, it's good to hit as big a niche as possible. With more RAM, the chip could overlap into the LCD driver space, but that would require digital video out to the pins (bypassing the DAC's), which I don't think Chip has embraced at this point (perhaps because the analog pin features are so "juicy" it's hard to imagine not using them for video). By the way, in using (or abusing) the term SOC, I'm obviously thinking of when the chip isn't paired with external SDRAM, though that would really expand the power (if enough pins exist).

But anyway, my inquiry has to do mostly with package/die size. Of course, a larger die with more RAM only would make sense if the RAM power requirements were modest (Chip's plan to "dynamically" shut off unused RAM could help there). If the package/die size is set in stone now, then fine. But if not, maybe pin it down early (and at Day 1 or 2, it's "early days" for this design). That way, the design(er) can sprint more directly towards the center target instead of spiraling inwards. Spiraling is often unavoidable, but when it is avoidable, it's usually good to do so.

Anyway, whatever the case, I'm lovin' the chip so far. It appears that Chip is quite pleased, too, which gives me more confidence. The refactoring is exactly as Ken described and some of us have experienced when we've lost code/work or had to start over (from a new beginning closer to the finish line). The result is even better than before. Lastly, seeing the rapid progress being made, I'd say it's time to begin a new general forum area for the P16X32B (as I speculated about elsewhere) and migrate the applicable threads over to it.

Bill Henning · 2014-04-07 21:14

80 I/O is almost infinitely better than 64 - precisely because it makes SDRAM practical!

Kerry S wrote: »

Ok... So then we would have 36 I/O available after SDRAM. With 4 used for VGA we are left with 32. Same as what I have now to work with (P1). I would have to give up my 4 hard inputs (direct to Prop) to get the Mouse and Keyboard serial ports that I am now getting from the grafted Raspberry Pi. Not optimal, but doable.

As for memory, if you are planning on this to have SDRAM typically, can you not (don't hang me) make the I/O pins for that interface just digital and drop the analog from them? That would free up area for more memory for the LCD/VGA guys. If you don't need the SDRAM they would still be available for regular digital I/O applications. Would there really be a practical use for 80 analog pins on one chip?

Even with the extra I/O 16 cogs is fine. Please don't give Ken a stroke! He has been very good with our insanity up til now and we need him to be 100% working his marketing magic.

jazzed · 2014-04-07 21:17

Beau Schwabe (Parallax) wrote: »

W9GFO,

"The problem with "Phoenix" is there are no ashes, the P2 will still move forward - right?. Also, Phoenix is bloody hot, kinda like the P2, not the P16X32B. ;-) "

Well at least no one has suggested Chernobyl.

I suppose Cavitator is out of the question too I.E. one that creates cavitation:

Cavitation "a : the formation of partial vacuums in a liquid by a swiftly moving solid body (as a propeller) or by high-intensity sound waves; also : the pitting and wearing away of solid surfaces (as of metal or concrete) as a result of the collapse of these vacuums in surrounding liquid. b : the formation of cavities in an organ or tissue especially in disease"

Part a is ok, part b not so much. The piece about forming vacuums feels oddly familiar though.

Lawson · 2014-04-07 21:19

cgracey wrote: »

I think a 16x16 multiplier would be good, per cog. Any thoughts on whether that would be precise enough? 16x16 yields a convenient 32-bit result, at least.

16x16 multiply sounds fine to me. I'd use it right now, and be perfectly happy to build up a 32x32 multiply out of the 16x16 multiply, shifts, and adds.

I have used ABSNEG. It's nice for setting PHSx to a negative immediate value. (used with NCO mode for fixed width pulse generation) If it was eliminated, I could work around that with a one long constant and a MOV.

I've also used a P1 counter in NCO mode with FRQx = 0 as an serial data output shifter. (just RCL PHSx to output the next bit.) It'd be nice if the counters had a mode where PinA was set by Bit31 of PHSx and PinB sets the state of Bit0 of PHSx. This would allow bi-directional SPI using an unrolled 2-instruction loop. (potentially 1-instruction, if a free-running counter can be phased and gated correctly)

16cogs, 512KB, 100MIPS each with analog I/O and the "cheap" high value bits from the P2 work? Oh heck yes!

Don't worry about "wasting" the work done on the P2 so far. That is a useful exploration of the parameter space, and will make work on this chip go significantly faster and produce a better result. I'd say the 128-bit cog/hub memory buses MUXd down to 32-bits is just one example of this.

Looking at the power pad TQFP-100 package that's planned, I bet I can make a layout for it on a 2-layer board that will be hand-solderable. (i.e. a few BIG solder filled thermal vias) I'm thinking a VCC and VIO ring on the top side between the ground pad and pins, with VIO connecting directly and VCC making a short connection via the bottom layer. The bottom layer would be mostly ground plane with most of the ground plane intact between the power pad thermal vias and the rest of the board. I think for bypass caps I'd use 0603 on a minimal footprint arranged radially around the chip. If space gets tight, the VCC bypass caps could easily move to the back side.

Marty