P16+X32B - what might you want as a minimum ?
Cluso99
Posts: 18,069
What might make a nice compromise to add some P2 features into a P16X32B ?
Let's start with some basics which seems to have a lot of consensus
1. 512KB hub ram (32bits wide) Bill wants 512MB - couldn't resist that typo
2. 16-32 P1 2-clock little lean and mean Cogs
3. 160-200 MHz
4. 64 I/O with ADC etc from the P2
5. TQFP100 0.5mm with thermal ground pad
6. P2 Security fuses and mode
7. P2 simple boot and monitor
8. P2 style boot from SPI Flash (not I2C Eeprom)
9. WAITPEQ/WAITPNE - need method for PORT A/B selection
-- could be NR bit if P1 instruction set, but not if P2 subset
What minimum features are necessary to inherit from P2 ?
10. Some form of hub slot allocation by sw (Bill has made some excellent suggestions)
11. Simple hubexec as I suggested
-- jumps/calls become relative when executed from hub
-- runs at hub mode (no cache)
-- add a jump direct instruction (17 bits required)
What minimum features would be nice to inherit from P2 ?
12. P2 instruction format for P1 instructions (becomes subset for later P2) ???
13. Some form of cog-cog parallel communication (relieves hub bandwidth, and faster) ???
-- I suggested a RD & WR 32bit register between each adjacent cog so that each cog
-- can R/W to both the next lower and next upper cog (with cog wrapping to 0)
14. Add P2 single MATHS, CORDIC, maybe a P2 COUNTER, etc shared subsystem ???
-- note the new P2 counter takes the same space as the old P1 cog took!
15. P1 Video block - add read capability to perform simple serial input (can already do output) ???
16. P2 Video block - is it possible to add a single shared video block ??? Is 1 enough ???
17. Add P2 instructions TARG (simple variant), maybe AUGS and AUGD ???
-- TARG (only valid for next instruction) - if P2 instruction subset we don't have NR
-- Is it simple and doable ???
Anything else ?
18. A "WAIT n" instruction to conserve power would be nice
NOTES:
We need to keep this simple and doable with no surprises.
My preference is to have > 16 cogs. Maybe 24 would be workable.
Obviously this list has things that I see as being really helpful to add. YMMV.
Let's start with some basics which seems to have a lot of consensus
1. 512KB hub ram (32bits wide) Bill wants 512MB - couldn't resist that typo
2. 16-32 P1 2-clock little lean and mean Cogs
3. 160-200 MHz
4. 64 I/O with ADC etc from the P2
5. TQFP100 0.5mm with thermal ground pad
6. P2 Security fuses and mode
7. P2 simple boot and monitor
8. P2 style boot from SPI Flash (not I2C Eeprom)
9. WAITPEQ/WAITPNE - need method for PORT A/B selection
-- could be NR bit if P1 instruction set, but not if P2 subset
What minimum features are necessary to inherit from P2 ?
10. Some form of hub slot allocation by sw (Bill has made some excellent suggestions)
11. Simple hubexec as I suggested
-- jumps/calls become relative when executed from hub
-- runs at hub mode (no cache)
-- add a jump direct instruction (17 bits required)
What minimum features would be nice to inherit from P2 ?
12. P2 instruction format for P1 instructions (becomes subset for later P2) ???
13. Some form of cog-cog parallel communication (relieves hub bandwidth, and faster) ???
-- I suggested a RD & WR 32bit register between each adjacent cog so that each cog
-- can R/W to both the next lower and next upper cog (with cog wrapping to 0)
14. Add P2 single MATHS, CORDIC, maybe a P2 COUNTER, etc shared subsystem ???
-- note the new P2 counter takes the same space as the old P1 cog took!
15. P1 Video block - add read capability to perform simple serial input (can already do output) ???
16. P2 Video block - is it possible to add a single shared video block ??? Is 1 enough ???
17. Add P2 instructions TARG (simple variant), maybe AUGS and AUGD ???
-- TARG (only valid for next instruction) - if P2 instruction subset we don't have NR
-- Is it simple and doable ???
Anything else ?
18. A "WAIT n" instruction to conserve power would be nice
NOTES:
We need to keep this simple and doable with no surprises.
My preference is to have > 16 cogs. Maybe 24 would be workable.
Obviously this list has things that I see as being really helpful to add. YMMV.
Comments
9b) At least one P2b* COG
* A P2b COG is a smaller P2 one, with the BIG MathOps removed into common space, as Chip is looking at doing now.
That nicely swallows most of 11 thru 18, without needing the delays of re-testing proven FPGA code.
1) maximum acceptable power ceiling - if you don't specify that, history will repeat itself.
2) requirements frozen on mm/dd/yy to make mm/yy shuttle so chip can be GA from Parallax by mm/yy. People have expectations of when they want/expect thi to be available, if you don't set those expectations, the schedule will creep just like features.
After you set these two requirements, you figure out what fits as far as features.
(Not that anybody reads this stuff unless they want to feel offended)
All it needs is three additional instructions. Existing JMP/CALL can stay the same for "cog" mode. Adds hubexec mode much faster than pure cacheless LMM, best guess 1/2 P2 style hubexec performance. Not bad for not needing WIDE/QUAD/caches. Extremely simple, totally P1 "KISS" principle compliant.
May allow for HD video and one fast hubexec cog if implemented as I proposed (32 cogs, 128 entry hub slot assignment table, mooch).
With Chip's hub based cordic/mul/div, gives us a good chunk of P2 capabilities practically free.
32 cogs, most can run 1/128 hub cycles, freely mix/match hubexec objects, identical to P1 cogs if you ignore 3 new instructions for hubexec support.
http://forums.parallax.com/showthread.php/155083-Consensus-on-the-P16X32B?p=1257101&viewfull=1#post1257101
Correct, and note we do not yet have ANY OnSemi Sim Power values on the 2 Clock COG yet.
Die area info for a P2b COG, and area info for a P1E COG, with P2-morphs, are also needed.
Once that is available, then possible mix combinations of P1E/P2S, P2b, Memory can be defined within the target die area, IIRC ~ 7x7mm
A practical power target is~2W (Max-Sim) Typical Code use, used to set typical use profiles of MHz/Volts/Watts,
with about 4W as a package-pushed ceiling, for less typical code use profiles of MHz/Volts/Watts.
Be interested to see Chips resource usage on this.
A low-impact Hub Exec on a P2S COG would be great.
Note we do not have any power Sim values yet,on 2 Clock COGS, so COG counts cannot be defined.
What I think the P1b should have is far more modest and much closer to the current P1.
The changes should be limited to what the large scale customers are asking for. That list has already been posted in another thread so I'm not going to post it here. The P1 came with the implied promise that there would be another version of the chip with a total of 64 IO pins that would be where I would start and so far no one has seriously suggested more. Chip has made it clear that the P1b would have 16 - 32 cogs but everyone that's looked into it has agreed that 16 is a realistic limit. It's also been made clear clear by Chip that the IO pins would be D/A, A/D.
What we need to be careful of is the feature creep that gave us a P2 that uses 5+W of power. I think we can all agree that the P1 needs an update and the sooner it's released the better.
I couldn't agree more and can't say it better!
I think even your minimum list is way too long. For instance, I do not recall seeing requests for either 7, or 8. Did I miss these somewhere along the line (entirely possible)?
On 9, Chip has proposed using whether D is even or odd, so this may not require any instruction set change.
Nothing from 10 onward should be included in a "minimum" (yes, I know we need a way to address the larger hub size, but your suggested solution is only one possible way of doing so).
Setting some arbitrary "minimum" above and beyond what Chip himself has suggested doesn't seem very sensible at this point, and only reduces the likelihood we will see a P16X32B at all.
Ross.
I new if anyone you can figure it out. Simple Hubexec compatible to P1. Nice! This could work out pretty good.
What I do not get here is your wish for 32 cogs. IMHO 16 would be a nice increase. You still can keep 128 entries and have a even finer granuality. 16 cogs would halve the needed power/size, right? wrong?
16 cogs would give the same hub-ratio as the current prop but 5x faster? if all of them need the slot.
I think one 0.5 cog per pin is absolutely overkill. save the space for less heat and more speed.
Because the argument that P2 has less cogs as the P1(xxx) will hunt us again. just later.
Enjoy!
Mike
The 32?16?N? counts are only vague indications, until there are Sim & Area numbers from OnSemi.
Certainly, fewer COGs will free up space for more memory, but the Power Envelope may have more say here, just as it has done on the 8 COG P2.
The 32 cogs is as a replacement to losing tasks & threads.
With the slot mapping, no need to have merged kb/mouse/serial drivers!
Here is how I'd use a lot of cogs - with lots of extremely simple Obex drivers!!!
insanely simple mouse driver, most of the time waiting, 1/128 hub slots.
insanely simple keyboard driver, most of the time waiting, 1/128 hub slots.
insanely simple serial port number 1, most of the time waiting, 1/128 slot
insanely simple serial port number 2, most of the time waiting, 1/128 slot
insanely simple serial port number 3, most of the time waiting, 1/128 slot
insanely simple serial port number 4, most of the time waiting, 1/128 slot
insanely simple serial port number 5, most of the time waiting, 1/128 slot
insanely simple serial port number 6, most of the time waiting, 1/128 slot
insanely simple serial port number 7, most of the time waiting, 1/128 slot
insanely simple serial port number 8, most of the time waiting, 1/128 slot
extremely simple SPI master driver #1 for MCP3208, most of the time waiting, 1/128 slots
extremely simple SPI master driver #2 for ENCJ NIC, most of the time waiting, 4/128 slots
extremely simple SPI master driver #3 for SD card, most of the time waiting, 4/128 slots
insanely simple I2C driver #1, most of the time waiting, 1/128 slot
insanely simple I2C driver #2, most of the time waiting, 1/128 slot
Cogs used so far: 15, Hub slots used so far: 21/128
hubexec fast C code, 1 cog, 64/128 slots
simple vga display refresh cog, 4/18 slots
sprite engine #1, 8/128
sprite engine #2, 8/128
sprite engine #3, 8/128
sprite engine #4, 8/128
Cogs used so far: 21, Hub slots used so far: 121/128
Resources left free: 11 cogs, 7/128 hub slots
100% deterministic timing, by user! TOTALLY configurable!
Removes need for tasks, threads totally!
Don't need determinism? Fire up a cog without an assigned hub slot, let it mooch the leftovers!
Don't need so many cogs? Don't COGINIT them.
No need for complex power management.
Only assign as much bandwidth per cog as you need.
This is everything P1 style cog enthusiasts said they wanted - 100% deterministic, extremely simple to program, no tasks, no threads, even simpler for drivers than original P1, no need for merged drivers.
The minimum is P1 running at 160-200 MHz with 64 I/O and 256K Hub RAM.
Chip says 16 cogs and 2 cycle instruction execution are easy, so I'll take those.
EVERYTHING ELSE IS GRAVY. DAC or ADC resistors under the I/O pads? If easy do it, if not drop. That is the watchword for everything else. If it's easy do it, but if it's not easy just get us a chip with the better clock, more I/O and RAM. We've been waiting too long.
P1 is bottlenecked because it doesn't have enough business logic RAM and the best schemes for external RAM use too many I/O pins. More RAM! More pins! Everything else is gravy! Tasty delicious gravy and we want it very much, sure, but before we get to the gravy we have to have the MEAT. I would rather have the processor that is twice as fast with twice the pins and twice (well all right eight times) the RAM that can actually get produced than the one with five times, ten times, twenty times, and magic pixie dust that remains a simulation on a $500 FPGA.
With this mapping system, does the silicon even need mooch support ?
Almost, HUB bandwidth alone does not equate to Power management, and Power management will be needed.
That's why I've suggested the mapping fields include a COG enable field.
The same mapping place you allocate that 1/128 HUB slice, you can allocate Power Resource, and you can still optionally use WAITCNT in SW for an idle-form of power control, (~10%) but you do not have to use WAITCNT.
If users want to push timing to 200MHz and run a lot of COGS, then some form of Power Envelope control will be needed.
I wasn't voting for more ram but just empty space. Just 16 cogs the same ram and less used space. Not used die area will not consume any power. No need to max everything out.
May even has a thermal effect, but I have no clue.
Less overall functionality may get a tighter and faster synthesis block. and some empty space around until we hit the outer ring. So what.
Enjoy!
Mike
Or in other words, the P16X32B.
Ross,
Either you misread or I didn't make it clear.
Items 1-9 seem to be the base model.
If the instructions coding is kept as per P1, then the WAITPEQ/WAITPEQ would be nicer/simpler to use the NR bit to define the B port. I don't like aligning instructions if at all possible to avoid.
10 & 11 have been discussed and seem to be the next 2 most preferable items to add to aid in performance.
Some form of hub slot allocation seems inevitable. Currently Bill's proposal makes the most sense.
Hubexec solves the 512/496 cog limitation. I am going to post a separate thread to illustrate its benefits.
12 to 18 are ideas up for discussion. WAIT n perhaps should have the most priority as it saves power. Basically it is a NOP n, or a WAITCNT without the overhead, used for short delays leaving WAITCNT available.
Chip is already talking about adding some parts of 14.
Lets see if there is some middle ground that will satisfy the P2 crowd in the short-term?
Edit: But still gravy. Give me 32 cogs if it's easy. If it starts making things complicated, 16 will do.
Yes.
Imagine fsck running off only mooch cycles, or a garbage collector. There is no real cost to allowing mooch.
If a cog is not started, it is not toggling flops at all, so only using quiescent current.
With any running cog, drivers will mostly be waiting on time or pin, so only using quiescent current.
Cogs waiting for hub cycle acces (for cogs allocated only few slices) will also be only using quiescent current.
I think I am having a blond moment. I do not see how the above would save any additional power; I see nothing wrong with WAITCNT with this number of cogs available that only "sip" hub bandwidth. If they are in a WAITCNT, and don't use 1/128 slots, it can be mooched.
I really, really would love to understand how your power resource bit would allow saving more power. I am NOT being sarcastic.
I love learning new stuff!!!!
Okay, I am happy to vote for 1-9. Items 7, 8 and 9 are technical details that I'm sure Chip can sort out.
However, I'd still like to know where 7 & 8 crept in, because I must have missed that part of the discussion. Can you provide a link?
Ross.
These ?
7. P2 simple boot and monitor
8. P2 style boot from SPI Flash (not I2C Eeprom)
The larger memory pushes the move from i2c to SPI, and Chip has indicated SPI already, so that is 8.
The way ROM is managed in this flow, pretty much dictates 7. ( ROM is stuck RAM, so needs to be kept small )
?
That is not quite my understanding. COGS can interleave a lot of other stuff between those HUB slots, surely ?
What about a COG with no HUB slots ?
I think we have different use profiles, you have somewhat combined HUB and COG Clock enable.
I am treating those separately, so a user can give a COG 1 HUB cycle and 16 other clocks for example.
(whatever it needs to just keep up with feeding the Hub )
In my case, both Not-Hub and not-Clocked (overlaps hub anyway) are needed to draw power.
Having to use a lot of WAITCNTs to manage power envelopes drops power management into running code, which is possible, but more clumsy, and rather works against OBEX libraries.
A part this small, ( 8 COGS, 180nm) might even also fit into P1 bonding ?
I wonder if OnSemi have a regulator macro, to save an external regulator - but that would increase on chip heating.
P1 COG Sim and Sim Power Numbers would be needed to know what MHz is possible.
Yes, those.
On 7 - The P16X32B is a P1 variant, not a P2. I'd expect it to boot into Spin the same way the P1 does.
On 8 - I2C EEPROMs up to 512Kb are available. You don't need flash, and in some cases wouldn't want it. Flash is much less unreliable and more difficult to program - and I suspect it is more expensive.
I just don't believe these are a "done deal", but I can't be sure, since I appear to have missed all the discussions on these issues.
Ross.
IMHO, this is a near freebie and a very nice way to interact with the chip compared to what P1 currently does. Not a done deal, but compelling!
Should be discussed.
I don't disagree - but it is merely a "nice to have", and should not really form part of the "minimum requirement".
Would you be prepared to forego the P16X32B because of the lack of this?
Ross.
Those simple little cogs will not burn much power at all as they will mostly be waiting for something to do. And it is the props philosophy in a nutshell.
Then simple hubexec rounds this out to handle the big and faster programs.
Do you want P1 instruction compatibility or P2S (the P1 subset using P2 opcodes) ?
Chip worked out we could have 32 P1 2clock cogs and 512KB hub.
Let us scale that back slightly to say 24+ cogs so we have a little room for the hub slot mechanism and the simple hubexec mode.
If we kludge in the P1 ROM, does it plop down right in the new chip memory space? Or, is it at the top of RAM? Bottom?
Does booting into SPIN hose up code protect? Yeah, I'll walk on that most likely. It's important.
Like I said, it should be discussed.
Given the modest cost, having a baseline interactivity isn't an unreasonable minimum, and we need to do code protect, fuses, crypto anyway, and we know how to do it all already.
Honestly, if it's too much of a dog, I might walk on that too.
Those interleaved instructions only execute after a request has been filled, until the next hub cycle.
Example:
After the first RDLONG, the above code is locked to a P1 hub cycle, and a complete loop will take 32 clock cycles (plus extra sync time for the first RDLONG)
Remove all the add's and sub's, leaving only the two RDLONG's and the jmp to loop, it will execute in exactly the same time as above .
While waiting for the next access hub slot, it is my understanding that the pipeline stalls, and cog power usage plummets. My understanding is that at the low level, this pipeline stall is the same as those used by WAITPEQ/WAITPNE/WAITCNT.
Add the fourth add, now the loop will take 48 clock cycles per iteration. 'RDLONG dummy' will try to execute immediately after 'add d' will stall the pipeline while waiting for the next hub window.
My understanding is that stalling the pipeline reduces power utilization to quiescent current, thus the power savings for WAITPEQ/WAITPNE/WAITCNT.
Hope that helps, and hope someone corrects me if I misunderstood any of that.
Most of my crazy drivers rely on precise pipeline synch and waitcnt computed to single clock cycle resolution, that is how I was able to burst read external SRAM on Morpheus at 20MB/sec (minus extremely small setup delta, way less than 1%) and do 1024x768x2bpp XGA graphics on Morpheus. And pull off other crazy tricks
I think that due to my understanding the P1, I've always assumed that any hub access stalls the pipeline (thus only drawing quiescent current) until the hub cycle arrives, un-stalling the pipeline.
Based on your message, I think your understanding was that while waiting for the next hub cycle, the cog runs at full power, and only drops to quiescent current on WAITPEQ/WAITPNE/WAITCNT.
Based on my understanding, my proposal went to 32 cogs and a flexible hub cycle array, as when a simple driver (kb/mouse/ser) was assigned 1/128 hub slots, it would save just as much power as if its clock was turned off.
I now see that (if I am correct) your understanding was that while waiting for the next hub cycle, the cog could continue to burn power, flipping lots of flops, instead of the pipeline stall freezing it.
I think your understanding would be correct if the cog was running interleaved RDxxxx and WAITVID, but if it was doing waitvid to stream data out, it needs the clocks.
If my understanding is correct, then gating the clock would not save any power, as the stalled pipeline would have the same effect, thus any cogs that are stopped (or never started), or whose slots were all taken away (and not mooching) should only consume quiescent current.
At this point, I am REALLY interested in how much current a cog uses while waiting for a hub cycle!
If it is noticably more then quiescent current, then we should look for some clock enable, however given that the hub slot assignment array is indexed by say cnt&0x1f, I think we would need a separate 32 entry array, indexed by cogid, that said "ok, give my cogid a clock enable every X clkfreq cycles".
However if my understanding is correct, I don't see any benefit to this clock gating.
EXCELLENT DISCUSSION!
I've been thinking about the cool COGRUN options we worked out. It would be really nice to get the ability to start a COG without loading it, and or at a given address with or without loading it. Seems this one is cheap, and given the mapping scheme, might allow for a couple of high throughput COGS, able to jump on a task quickly, then go away when done, saving power.
Edit: I see the waiting may not consume power at all. (catching up)