Radical P2 design changes - discussion only
Cluso99
Posts: 18,069
Multiple threads and pre-emptive multitasking seems like it is chewing up silicon way too much, as well as the complexity it brings.
Maybe we can find a simple alternative, maybe we cannot. But it is worth a separate discussion.
Die sizes were (before removing some of the DAC bus and increasing hub size)
http://forums.parallax.com/showthread.php/125543-Propeller-II-update-BLOG?p=1223695&viewfull=1#post1223695
DAC bus = 10.33 square mm - this is being removed, freeing up the area
hub RAM = 1.76 square mm (x4)
cog RAM = 0.62 square mm (x8)
aux RAM = 0.20 square mm (x8)
core = 14.71 square mm
So a cog was basically..
0.62+0.2+(14.71/8) = 0.82 + 1.84 =2.66 mm2
Then hub was doubled from 1.76 x 4 to 1.76 x 8 = 14.08 mm2
DAC bus saved 10.33 less new hub 1.76 x 4 = 10.33 - 7.04 = 3.29 mm2 free for cog space
Thus we have 3.29 / 8 = 0.41 mm2 extra per cog (now 2.66 + 0.41 = 3.07 mm2)
So we have a total of 3.07 x 8 = 24.56 mm2 for the 8 cogs.
Now the question becomes, could we better utilise the space being consumed by the multitasking ?
See subsequent posts for some possibilities...
Maybe we can find a simple alternative, maybe we cannot. But it is worth a separate discussion.
Die sizes were (before removing some of the DAC bus and increasing hub size)
http://forums.parallax.com/showthread.php/125543-Propeller-II-update-BLOG?p=1223695&viewfull=1#post1223695
DAC bus = 10.33 square mm - this is being removed, freeing up the area
hub RAM = 1.76 square mm (x4)
cog RAM = 0.62 square mm (x8)
aux RAM = 0.20 square mm (x8)
core = 14.71 square mm
So a cog was basically..
0.62+0.2+(14.71/8) = 0.82 + 1.84 =2.66 mm2
Then hub was doubled from 1.76 x 4 to 1.76 x 8 = 14.08 mm2
DAC bus saved 10.33 less new hub 1.76 x 4 = 10.33 - 7.04 = 3.29 mm2 free for cog space
Thus we have 3.29 / 8 = 0.41 mm2 extra per cog (now 2.66 + 0.41 = 3.07 mm2)
So we have a total of 3.07 x 8 = 24.56 mm2 for the 8 cogs.
Now the question becomes, could we better utilise the space being consumed by the multitasking ?
See subsequent posts for some possibilities...
Comments
I'm lost. You want to now remove the Multitasking Chip has designed in P2 !?
Nope. The space is well utilized as it is.
Tasking gives us up to 32 baby-cogs, even without the threading being put in right now. The hub/cog/aux memory architecture serves us well as it largely bypasses the Von Neuman bottleneck.
It was a huge waste to tie up a whole cog for simpler drivers, and for drivers, cooperative multi-tasking is not really appropriate (except for low speed devices) - ie not suitable for USB, video, or other high speed usage cases.
The new threading will give us a whole passel of threads for the higher level software threads (pthreads) for usages like waiting for incoming sockets / signals. Mind you, some instruction sequences will need to be guarded by TLOCK/FREE, but that is par for the course (avoid lots of extra logic, and potential development delays)
I agreed with your other post - threading is simple and sufficiently defined for now, time for SERDES & USB
I assume Clusso99 will be starting threads for those as well.
C.W.
It would have to be replaced with something so brilliant to deaden the pain!
Could we expand this further?
What if we added some special "trimmed" cogs that...
- Cog ram had 64 longs (8 wides)
- No Aux ram
- No video
- Perhaps no Cordic and no multiply/divide, macs, etc
These cogs could be used as the "multi-tasking" cogs. They would be configured at run-time to be share with a normal cog.This wold mean we would have "real" (with "limitations") cogs to do multitasking which would operate in parallel. They would not share clocks, and they would have their own stacks/pointers/etc. No need to share, no need to have special versions of waitcnt, waitpxx, etc.
The cog ram could be loaded in 1 cycle using wides (ie build the cog with 256bit access).
Might it be worthwhile to have a couple of cogs that only displayed video ?
Maybe there are other cog builds that would benefit ?
Maybe we could have 16+ cogs. With HUBEXEC mode, and a small wide cache (which could be a part of cog ram), the requirement for large cog and aux ram goes away to some extend. What would be the best compromise? Do we just need a deep FIFO for the stack? We now only need one stack?
How much silicon would we save by not having 4 sets of INDa/b, PTRa/b, PTRx/y, sharing the Instruction Caches, Data Caches, Bank switching (lower parts of cog ram), Task switching instructions, and other parts, all supporting multi-tasking?
Have we reached a point where we could totally do away with cog mode (only use hubexec with say 8 wide instruction cache and 2 wide data cache) ???
What if the cog ram were built as...
- 8 wide instruction cache
- 8 wide register space
- ie a total of 16 wide cog ram = 16 * 8 longs = 128 longs
- maybe 8 wide stack(s) making a total of 24 * 8 = 192 longs
What if we had 16 of the above cogs ???With wide loads, could we now use one of these cogs to do video using just the 8 wide register space in cog, instead of requiring aux memory (another big simplification) ??? Just use the cog for outputting video. Use another cog to generate the bitmaps etc? Am I missing something regarding the CLUT interface???
Could we have only 4 video generation blocks, where any cog could claim one of them?? Would this overcome the DAC bus sharing?
You will note in all of this, I am presuming we now re-build the cog/cache blocks as 256bit wide blocks so that a rd/wrwide instruction now takes 1 hub clock and no longer requires 8 clocks to load/store the cache to cog. So we now have WIDE cog transfers !!!
So does that mean that the only parts remaining in the custom layout are the cog ram, aux ram and hub ram blocks?
May I again suggest making the cog and aux rams wides, perhaps using standard cells??? What about the hub ram???
The other thing is, we shouldn't really assume anything with regard to size, complexity, importance, or Chip's time. Parallax are best qualified to estimate those. Some things that appear complex may boil down to just a few verilog lines and relatively few flops. The "trace" addition illustrated this nicely.
It will be good to get stuck into USB but understand the need to wrap complex things up before moving on. If Chip put out a DE0 release tomorrow which included those two instruction requests, (and other features deleted as necessary to fit), how long would you estimate before we would know if USB was workable, Cluso?
regards
Lachlan
The DAC pin decision really doesn't change that, and in many cases it doesn't change at all, as all the DACS can be driven by all the COGS anyway. The difference is in whether or not they can be driven automatically, and that has pin group limits. None of us really liked that very much, but it allowed Chip some considerable freedom and it relaxed some critical path timing, some of which made other innovations possible. We did realize there will be a COG start order in some cases now.
In all other ways, the COGS are equal, and maintaining that is really important for reuse and scope of applicability of the chip.
Frankly, I think a bunch of assorted COGS would be a complete mess. I would not be inclined to participate further.
Not quite. Any cog can access any dac, but there is an "affinity" that a cog can only use one of three sets of four "nearby" dacs for video.
STRONGLY DISAGREE FOR P2
We have been feeling the limits of cog memory since early P1 days.
AUX ram allows color palettes, very fast stack, FIFO's, fast decodes, FIR's and much more.
I could almost see only having say four video capable cogs, except the video/dacs has so many more potential uses than video.
I like the cogs being identical.
For P3, we could think of asymetric cogs, but I doubt the transistor savings justify the headache. For the idea to make sense to me, we'd need "large" cogs, something like the current ones - say eight of them - and a whole passel of "baby" cogs, dedicated for small drivers. The catch: they'd still need about 256 longs each.
Except we could not fit 8 full cogs and 32 baby cogs. And a 200Mhz baby cog would be a waste of silicon, cycles, and power for most drivers.
In my opinion, this idea is a non-starter.
No it would not. A 64-long cog would need 8 hub cycles, which each take 8 clock cycles.
And 64 longs is far too little memory space even for a baby cog.
For P3, it may be worth reducing video-capable cogs to say 4.
But we'd keep the AUX anyway, too many great uses for it - including local fast 1 cycle stack, 1 cycle lookup etc.
The cache cannot be part of the cog ram, the implementation is too different.
AUX is needed for color lookup, fast stacks, other lookup etc - cannot go away without losing a lot of features & performance.
Two stacks is good , or one stack and FIFO. Flexibility is good.
Also, even though hubexec is great, it cannot be as deterministic as cog only mode. cog mode is here to stay, for the highest performance drivers.
Not much. Best guess? 5% of the cogs transistors?
DEFINITELY NOT!
We would lose guaranteed determinism.
Hubexec is FANTASTIC replacement for LMM. It is NOT a replacement for cog mode.
16 of those crippled cogs would take about the same amount of space as 12 of our current cogs. And complicate everything, and lose symmetry.
Yes, you are.
Different memory bus. Video access to it would block cog access for at least one port. Two busses are much better than one for this. We would also lose CLUT modes.
Not quite.
We could have just four video capable cogs, pinned to specific pins.
This would save some transistors, however AUX is still useful for lookups, FIR, stacks etc so we can't toss it.
For P3, switching to a 256 bit wide memory architecture for cog & aux is worth considering.
I am not sure it will work well for cog memory due to the additional MUXes required.
BUT
Even if the switch was made, it would still take 8 clock cycles to transfer one WIDE between the cog/aux and the hub - NOT 1 clock cycle.
For 1 cycle hub access, the hub would have to be eight-ported, which would reduce the hub size by a factor of 8.
Sorry, did not mean to sound harsh, but your suggestions would cripple the prop.
Now for P3...
If there was a process shrink, and we could have 4x the hub size, we could consider trading it off for 2x hub size, with 4 cycle access by dual porting the hub. But that is a few years away.
BTW I am not familiar with any of the upper levels of USB. However, I do have a comms design background from the 70s and 80s.
+1
Keep multi-tasking. It's a brilliant feature add for the P2.
Implement simple multi-thread support to let people play with it on the P2
SERDES and USB? YES!
Anything beyond this? P3 and OpenFPGA.
Let's get something finalized that is simple and elegant and symmetrical like the P2 should be!
More headroom is always good, and less risk is also good...
Is this 'very close to working' code, re-syncing to the edges ?
Is the 80MHz locked into the bitstream, or can you nudge that up enough, to overclock to fit in the 1 instruction ?
I seems to be hard enough getting the current PII design in focus and off the drawing board without running into the weeds with "radical ideas".
My Radical idea is this:
The P3 will be a 64 bit machine. It will be optimized as a JavaScript engine.
To that end it will have a hardware implementation of a garbage collector that runs in parallel with the main JS engine.
I'm sure there are a lot of operations a JavaScript engine does frequently that could be turbo charged if done in dedicated hardware, Hash table look up for object properties and functions for example. Array bounds checking and so on.
Of course for those drivers and other real time speedy bits one will be able access PASM code in COGs from JS seamlessly.
With the help of someone like Gordon Williams who has made JS for micro-controllers this could really fly.
There, that's my "radical idea"
P.S Any talk of asymmetrical COGS and all the chaos that would cause has to be nipped in the bud immediately.
Rather than taking each point, here are some misconceptions...
If the cog was built in wides, then only 1 clocks would be needed for the r/w plus perhaps a setup. The other 7 clocks would be usable for code - a big performance boost.
If the cog was used as the instruction cache, because the cogs are in wide blocks, a cog cache could be loaded in a single clock.
As far as I understand, the CLUT/AUX is no different to the cog in that the read port is just that, a read port. If the Cog ram were used, then one of the read ports formerly used for say the "I" read port would now be used as the clut output port. Just a simple 32bit mux.
By configuring the cog as blocks of wides, the cog can simply be reconfigured as blocks of cog ram (registers), cog ram (instructions & registers), instruction cache (instructions), aux (clut), aux (stacks).
While I respect your comments that P1 was always short of cog space (because I requested, as did you and others) that the new aux space be capable of being used as register and stack space if not used for video. However, with hubexec, a lot of those requirements can be solved with hubexec. When Chip asked if the cog space could be lowered, we all said "NO". We haven't even tested hubexec yet, but I do know that the cog space is no longer the big issue it was.
With Chips latest post, I think we could have the old (pre November) simple multi-tasking mode, and say 256 longs of cog ram (wide access) where blocks of the cog could be reconfigured as ICACHE, STACK, AUX. Without all the new multitasking instructions and banks of registers pre task, I believe we could have 16 new-standard cogs. With the addition of 1 extra block, we could configure 16 hub cycles to those cogs in whatever fashion we like.
If 4 of these cogs shared a hub cycle, we would have almost 400% improvement over multitasking 4 threads. ie 4 x 160/200MHz rather than 1/4 * 160/200MHz. What would you rather???
What would be needed to have 16 cogs rather than 8
- A few simple Verilog changes to make 16 copies rather than 8
- A few instruction changes (4 bits rather than 3)
Now we can simplify the code by removing the multiple copies of pointers etc. (ie back to the base multitasking that was prior to November). But use the new instruction set and features.What is expensive and/or time consuming...
- Making the cog ram 256bit wide
- Do we get the advantage of standard cells out of this???
We need to add some possible reconfigurable hub access method.If we get standard cells out of cog ram redesign, can we do the same for hub ram???
Would this then mean that all the P2 is standard Verilog code??
If yes, does this also mean we could shoot for a slightly smaller geometry ???
Wouldn't you rather an extra 8 real cogs at full speed, than full-blown threads at 25% speed that always will have caveats???
That's a different CPU. No thanks. Maybe it's not a bad CPU. Maybe it's a better one. But I'm not going to stick around for another 4 years to find out, unless I've got the one we've spent how many years working up to in the meantime.
At some point, we really do need to realize the P2 is the P2. And if we finish it, we might all have the pleasure of enjoying the "what is a P3? really going to be?" discussion.
If we don't finish it, then we are in a "nobody can agree on what the P2 CPU EVEN IS" discussion, which I have no desire to be a part of.
Seriously. People want to make some stuff, not continue to plan for the best thing ever next year, while not making stuff.
Chip already has added a load mode that can load 1 long from a wide per clock (ie 8 longs from a wide, hitting hub cycles)
I do like the cog memory being organized as wides, however that's too big a change for now, and I am concerned about the extra multiplexers. Chip posted earlier that the reason we can't have more icache lines is that the multiplexers are already the critical time path...
But cog memory cells cannot be used as an instruction cache without MAJOR re-working of the cog's guts.
Cog memory longs (or even wides) do not have the needed tag bits, counters etc either, and while it may be possible to have a small tag/counter block "on the side" for a small number of lines, it is unrealistic for every cog long/wide.
Sorry, that is incorrect.
As I understand it, cog ram has three read ports, and one write port (instruction read, source read, dest read, dest write), and uses all those ports.
NONE available for video. Even if re-organized for wides, it does not give us extra ports, so for a video read port, it would have to go to five ported memory, adding a transistor to every single cog memory bit. (25% increase in cog memory transistors), if we don't want to lose ports, for simultaneous transfer as well, six ported (50% increase).
Calculations based on my memory of how many different transistors cog/aux/hub take:
4 port: 512 longs * 32 bits * 4 transistors = 65536 transistors, five port: 81920 transistors, six port: 98304 transistors
CLUT/AUX ram is dual ported, one r/w port for the cog, and one r/w port for either video or sdram XFER. Also used as stack, lookups, fifo, FIR's etc.
2 port: 256 longs * 32 bits * 2 transistors = 16384 transistors
Therefore:
eliminating AUX, but having six port cog memory actually increases the transistor count!!! BAD
eliminating AUX, but having five port cog memory is the SAME transistor count as separate AUX, but cripples CLUT and XFER!!! VERY BAD
eliminating AUX, leaving four port cog memory saves 16K transistors at the cost of video, clut, fifo, lifo, fir, xfer capabilities!!! UNACCEPTABLE
No, it cannot. See above.
Totally disagree. I can fit four good drivers into one cog.
Hubexec is possibly the biggest win with P2, but it cannot be as deterministic as cog mode, not for very high speed drivers. It cannot replace cog mode.
I disagree on technical grounds, I believe you misunderstand how the memory architecture cannot be shared the way you would like.
And we definitely could not fit 16 cogs with enough registers, and not having aux would totally cripple them.
These cogs CANNOT share a hub cycle.
They would also be very wasted on slower tasking - hardware tasks make much better use of the hardware. And we have 32 of them.
Ray,
I strongly recommend you look more into the cog/aux transistor requirements, and the requirements for four ports for cog memory and two for aux, and the overlapped execution of cog access to aux along with video or xfer. Your proposed benefits are simply not possible without sacrificing a lot of capability, and even then, they are not really benefits.
Now for P3, if you want to discuss asymmetric cogs, with some "full" cogs, and lots of "baby" cogs, that is a different discussion, and as long as all cogs support hardware tasks, and they are not crippled for a lack of memory, then we have something really worth discussing
Let's face it, even a "baby" cog at 200Mhz would be wasted if it did not have tasks for the majority of drivers.
I am discussing rationally, and making technical arguments backed up with actual calculations, that you are ignoring.
ANY DAY I'LL TAKE 8 FULL COGS / POSSIBLE 32 TASKS OVER 16 CRIPPLED COGS!!!
And I'll be able to make much more interesting products with 8/32 vs 16 crippled.
I'LL TAKE THE 8 CURRENT COGS / 32 TASKS VS. 16 CRIPPLED COGS!!!!
It seems like a decade ago that Chip opened a thread by asking would we like more COGs or more RAM seeing as he had a lot more silicon available.
If I remember the overwhelming response was for more RAM.
With out doing any calculations I'm convinced that doubling COGS would, obviously, half HUB bandwidth and drag the PII peek MIPs down for anything that is bigger code than fits in a COG or needs to access HUB for data.
You can't fight Amdahl's law. Adding more processors has diminishing returns. I don't know where the sweet spot is for a shared RAM system like the Prop but I suspect 8 cores is about there,
Hmm, 1:8 might be a bit much to overclock.
IIRC Chip said one FPGA build had more room above the nom 80MHz than the other ?
You are missing the precise point of being able to reuse blocks of cog ram for different purposes. If you use it as aux ram, then two ports don't get used, you don't require a 5th port. As for caching, the extra tags would be just that, extra tags which would be linked to a cog block. So if you didn't use icache, only the tags would be unused, not the whole set of 4-8 * wides.
In other words, you can reconfigure the blocks of cog ram to be what you like. It makes better use of the transistors without adding lots of blocks that may or may not get used.
It doesn't prevent cogs doing simple 4 thread drivers. That is the same as it was prior to November (where ozpropdev showed us what could be done). Now of course, without the aux he would probably need 2 cogs, but if you have 16, then whats the difference, other than a lot more processing power. Now, the waitcnt/waitpxx/etc can truly power down the cog and save power. But when you want the full processing power, you now have 16 cogs.
BTW It is way easier to write drivers for singleton cogs than multithreaded cogs.
And what really concerns me is the problems with the latest round of multitasking additions...
- What bugs are lurking???
- Everyone was so opposed to slot sharing because it could wreck determinism. Now we have totally destroyed any possible determinism for multiple threads.
- Wasted silicon for what no one actually is presenting a real life use of this - its all theoretical.
- And its taking weeks!!!
- And still many are still arguing for more things to be added to make each task self contained!!!
What I am suggesting is simple, doable, and allows us to get on with USB and SERDES. In fact, I would rather get onto USB and SERDES now, and consider these options later versus any more multitasking changes.You are one missing the point. Those blocks CANNOT be reconfigured the way you want without having six memory ports. Even then, caching is an issue.
Linking tag blocks to a wide is possible, but there would be a limit on tag blocks.
To repeat:
Even if organized in wides, you still only have 4 ports. You CANNOT carve off some for AUX, etc. the way you want.
Actually there is a way to do it, but it would require a LOT of muxes, increase critical path lengths etc - have at least two 256 bit wide busses, chop memory up into say 256 long blocks, and "assign" (multiplex) them to cog or aux or cache. It would be slower, more complicated, and take more transistors.
But that is slower and takes more logic than keeping them separate. Don't forget, each use case have separate addresses, that change at the same time at the same clock - what you propose is putting back the Von Neuman bottleneck.
My friend, you fell in love with the idea of a single type of memory, and are trying to fit a square peg into a round hole.
And as Heater points out... You would cut the hub bandwidth in half (16 cycle hub access)
And waste a lot of power (due to getting rid of multi-tasking)
Sorry, it is a very bad idea.
FYI, multi-tasking has been working for months - just ask ozpropdev
And cooperative multi-tasking is NOT useful for something like propinvaders. While maybe not impossible at low resolution insanely bad to write cooperatively.
I'll respond to your bullets tomorrow, I'm off to bed.
This is the crux of 'many COGS' problems, the Opcode decode and execute, and mathlib support, comes at very high silicon cost.
Threads was a clever way to mitigate that cost, by allowing users to fill-up the time and memory much more efficiently.
Threads do not have to run at 25%, there are EIGHT COGS, if you have something that is really thread-averse, give that its own COG to play in. You still have 7 left, with 28 threads.
Those ram blocks, the 3.3V analog I/O pins, the PLLs, the fuses, and a reset/clock control block would be the only full-custom circuits.
No time to rearrange width on the RAMs.
Strawman/FUD. It is easier to write drivers for tasks than cooperative. No one should use threads for real time drivers.
Good question, We will find out. PropInvaders shows tasking works. Instructions may have bugs.
Factually Incorrect.
1) Not everyone was opposed
2) Not true for tasks
3) Determinism was never there for threads (not even for hubexec at a very fine grain)
Factually Incorrect.
1) Silicon not wasted, provides fantastic capabilites
2) many threads waiting for slow events (see real life select(), tons of code)
3) for tasks - easily write and pack four drivers into once cog.
That's how development goes. Most of the "weeks" is usually arguments against features.
Factually incorrect.
Most are quite happy with TLOCK/TFREE, some would prefer theoretically "nicer" solutions.
Factually incorrect.
Not simple, not easy, would not work the way you want/think, we'd lose a ton of video/clut capabilities.
Yes, leave the cogs as currently defined (barring minor tweaks whose need comes up during fpga testing) and get on with SERDES/USB.
Finally, we agree on something.