Assigning unused cog's hub slots to a "supercog"
Peter Jakacki
Posts: 10,193
I've experimented with higher speeds and less cogs and now since I've hooked up an SD card and a W5500 for Ethernet I'm looking at optimizing the hub access priority which is normally fixed. As soon as I finish off converting the W5200 driver to suit the W5500 I am intending to implement a "supercog" which uses up all the spare hub slots. If for instance I have Tachyon running in cog 0, serial receive in 1, and background functions in 2, then 3,4,5,6,7 are not being used and can essentially give their hub slots to cog 0. This doesn't change how the other cogs operate as they will still get their 1 in 8 as they do now but a supercog could easily speed up Spin too. Any thoughts?
Comments
This doesn't just apply to hub slots, although this was possibly the most contraversial topic discussed for the P2.
I am looking forwards to seeing the results!
however that depends on the implementation of the p1v (or the P2), and clock frequency - say a 200MHz/100Mips variant, with vacuum, will definitely not be outperformed by a $1 ARM, and there are the rest of the cores...
My basic stance is leaving performance on the table is silly, and limits what the chip could otherwise do.
Seems to me that if you need that kind of raw compute performance a Prop is not the way to get it.
A Prop has talents in a different area.
This debate has been going on for ages, notably during the old P2 fiasco. I guess we should let it lie.
Will a single P2 core outrun a $8 ARM for say Dhrystone or similar? Not likely.
Would it be better for precise timing? Indubidably.
Would 16 100Mips P2 cores - in total - outperform a multi-threaded integer compute benchmark when compared to a $8 168MHz ARM? Yes.
Woult the same 16 P2 cores win in a floating point benchmark against a $8 M4 with FPU? NOT likely.
Does the above mean we should not go for the maximum single core Px performance we can get? NO WAY.
Basically, if I need to slap an ARM/MIPS/QUARK/whatever on the side, I will.
However, if I can get a Px core fast enough not to need to slap another processor on for more applications, thats a big win (and lower BOM cost), which is why I try to maximize performance.
Yes, we are well aware that the purist don't approve of what the heretics are doing. How about we let Peter do his experiments without the threat of being excommunicated or burned at the stake.
C.W.
This may mean the difference between getting a design win and not because it may mean that the main program has to be done in an external processor and the BOM cost blows the design out of the water. Other cogs would be unharmed.
Why so much negativity if it was an option??? Don't answer! Lets just get on with the testing
WTG Peter
Regarding the ARM core, I think it's much better to improve a COG in basic ways so that it has good performance on large code run from the HUB. We have LMM and we have HUBEX.
I would actually prefer LMM, as that presents us with a nice "supervisor" type capability, in that native PASM runs in the COG, and if it's possible to get say 80 percent of that speed running LMM somehow, no brainer. We get pseudo-ops, we get "interrupts" and we get illegal instruction trapping implemented how we want it implemented because PASM is micro-code in that model. Your program ends up in a very lean, capable, virtual machine, which is advanced capability for a micro-controller.
And that keeps the COG Von Newman, allowing for the maximum tricks and very highly functioning multi-processing capability. Whatever ends up in the HUB can be a super set of PASM, with some very specific things implemented as pseudo ops. Raw performance would be higher with HUBEX, as we more or less saw on the "hot" edition design. But maybe LMM performance can hit that 80 percent number, or get close. If so, I would gladly trade sophistication for a little speed.
That's the line of reasoning I'm taking on larger programs at present. Will be good to see what everybody does and understand the dynamics better.
@Cluso: Indeed!
By now you have 2 instructions between Hub instructions. freeing up more slots for cog0 will not really gain much. Maybe the first one. But after more then one free slot?
What do you want to do with those slots? One rdlong after the other, without processing of the data?
Maybe I do have some gross misunderstanding here on my side, but I do not really see any speed improvement gained thru more slots. Maybe with one more slot. OK. but more?
Please enlighten me!
Mike
This is what I like about this FPGA adventure. New(and some old) ideas can now be tested.
Although I've looked at other more complicated schemes this one just seems sensible, even if for whatever illogical reason it affects some people's sensibilities. If for instance I only have a Spin cog running and all the other cogs aren't enabled it just means that all those unwanted hub slots are now automatically directed to cog 0 and Spin flies. Now if we load up all the cogs including cog 0 then that is no different to before as each cog will only have a single slot. Making cog 0 the supercog is a no-brainer as it's normally the startup cog running Spin and having it run faster is only a bonus. This implementation also does not require any configuration, all cogs are identical, even cog 0, it's just the hub (dig.v) that's modified
Sounds simple enough to trial, but could be tricky to pipeline so it has no speed impact.
Rather than a simple and fast scan, this now has to check each slot for 'Mine(N)' or 'Yield to M'
Other ideas for higher bandwidths :
Another method of giving more speed to some COGS, is to split HUB memory so instead of all memory being 1:8 time sliced, some blocks can be allocated to only 1 or 2 COGS.
This becomes more private rather than openly public, but is much faster.
A FPGA allows this to be quite easily re-defined, so is an ideal test platform.
Silicon has to be more generic, and hat can be less optimal for any one design.
Chip has done something clever in the P2 so that is was 1:8 for 8 cogs. Perhaps there is something here that can be done to make it 1:8 (ie a slot every clock instead of every 2 clocks)?
This lone would double the hub bandwidth.
I have no objections to anyone to taking the P1 design files and mangling them in anyway they like. How could I have? It's their code to do with as they please.
It's educational, especially for anyone getting into HDL and or CPU design for the first time. It will show us what works and what does not. What has useful benefits and so on.
If I find the will and the time I'll tweak on the Verilog myself. Not that I expect to be achieving anything significant or useful.
Besides someone may well have optimizations in mind that perfectly suit the application they have in mind that are not of general interest.
This whole "why asymmetric COGs are bad" idea may soon me demonstrated in a concrete way:)
Full speed ahead Peter.
I like Chip's "sausage grinder" on the P2 - way higher overall bandwdith, even if it does not help latency.
Actually if the number of gates was there, the more slots the merrier. My back of the envelope calculations suggested there should be a minimum of four slots per cog, ideally 8 or more. This was based on expected bandwidth usage of low-to-medium bandwidth peripherals, vs. bandwidth hungry video and byte code / vm / lmm / hubexec cogs.
The best argument I remember that was made against many slots was the lower fMax due to the additional lookup.
What I really like about the Verilog release is that we can all try many different ideas!
Heck, I think it is only a matter of time before someone grafts on a Wishbone bus, and an FPU.
A simple trial/minimal auto-sizing version which has no config requirements, can be made by making the modulus equal to the Top used Cog.
eg If you know you will use 4 COGS eventually, then assign CogID of 3. to give a 4:1 scan density, or CogID=2 for 3 COGS and a 3:1 scan density.