Assigning unused cog's hub slots to a "supercog"

Peter Jakacki · 2014-08-16 08:14

I've experimented with higher speeds and less cogs and now since I've hooked up an SD card and a W5500 for Ethernet I'm looking at optimizing the hub access priority which is normally fixed. As soon as I finish off converting the W5200 driver to suit the W5500 I am intending to implement a "supercog" which uses up all the spare hub slots. If for instance I have Tachyon running in cog 0, serial receive in 1, and background functions in 2, then 3,4,5,6,7 are not being used and can essentially give their hub slots to cog 0. This doesn't change how the other cogs operate as they will still get their 1 in 8 as they do now but a supercog could easily speed up Spin too. Any thoughts?

Cluso99 · 2014-08-16 08:24

Why not. We now can test the benefits.
This doesn't just apply to hub slots, although this was possibly the most contraversial topic discussed for the P2.

Bill Henning · 2014-08-16 09:09

I think vacuuming up unused slots to increase performance for cogs that can use it is the way to go... I strongly disagreed with all the arguments against hungry/mooch/green/vacuum.

I am looking forwards to seeing the results!

Heater. · 2014-08-16 11:32

So you don't actually want to use a Propeller, but something more like an ARM Cortex whatever. They are very cheap and do what you want. No messing with HUB slot timing required.

Bill Henning · 2014-08-16 11:54

No, I just hate seeing wasted slots, that could speed up video/HLL cogs on a p1 variant, with no impact on determinism of other cogs.

Heater. · 2014-08-16 12:02

I know what you mean but it does not matter how many HUB slots you steal performance will still fall behind a $1 ARM Cortex whatever.

Bill Henning · 2014-08-16 12:16

I largely agree re/ the ARM (other than the stealing bit)

however that depends on the implementation of the p1v (or the P2), and clock frequency - say a 200MHz/100Mips variant, with vacuum, will definitely not be outperformed by a $1 ARM, and there are the rest of the cores...

My basic stance is leaving performance on the table is silly, and limits what the chip could otherwise do.

Heater. wrote: »

I know what you mean but it does not matter how many HUB slots you steal performance will still fall behind a $1 ARM Cortex whatever.

David Betz · 2014-08-16 12:42

Bill Henning wrote: »

I largely agree re/ the ARM (other than the stealing bit)

however that depends on the implementation of the p1v (or the P2), and clock frequency - say a 200MHz/100Mips variant, with vacuum, will definitely not be outperformed by a $1 ARM, and there are the rest of the cores...

My basic stance is leaving performance on the table is silly, and limits what the chip could otherwise do.

Probably not but would a P1V or P2 outperform a $8 ARM? I'd still like a combination of the two. If there was an open source ARM core I guess we could try that but I don't think there is.

Heater. · 2014-08-16 12:49

I was counting all cores working together vs a cheap ARM Cortex whatever.

Seems to me that if you need that kind of raw compute performance a Prop is not the way to get it.

A Prop has talents in a different area.

This debate has been going on for ages, notably during the old P2 fiasco. I guess we should let it lie.

David Betz · 2014-08-16 13:00

Heater. wrote: »

I was counting all cores working together vs a cheap ARM Cortex whatever.

Seems to me that if you need that kind of raw compute performance a Prop is not the way to get it.

A Prop has talents in a different area.

This debate has been going on for ages, notably during the old P2 fiasco. I guess we should let it lie.

I guess I thought it was odd that you would compare a $1 ARM with what is likely to be a $8-$10 P2. Anyway, I certainly agree that the Propeller has advantages in some areas that the ARM can't touch.

Bill Henning · 2014-08-16 15:10

That would depend on the application, and weather the aggregate performance of the cores is greater than the arms.

Will a single P2 core outrun a $8 ARM for say Dhrystone or similar? Not likely.

Would it be better for precise timing? Indubidably.

Would 16 100Mips P2 cores - in total - outperform a multi-threaded integer compute benchmark when compared to a $8 168MHz ARM? Yes.

Woult the same 16 P2 cores win in a floating point benchmark against a $8 M4 with FPU? NOT likely.

Does the above mean we should not go for the maximum single core Px performance we can get? NO WAY.

David Betz wrote: »

Probably not but would a P1V or P2 outperform a $8 ARM? I'd still like a combination of the two. If there was an open source ARM core I guess we could try that but I don't think there is.

Bill Henning · 2014-08-16 15:13

Yes, old debate.

Basically, if I need to slap an ARM/MIPS/QUARK/whatever on the side, I will.

However, if I can get a Px core fast enough not to need to slap another processor on for more applications, thats a big win (and lower BOM cost), which is why I try to maximize performance.

Heater. wrote: »

I was counting all cores working together vs a cheap ARM Cortex whatever.

Seems to me that if you need that kind of raw compute performance a Prop is not the way to get it.

A Prop has talents in a different area.

This debate has been going on for ages, notably during the old P2 fiasco. I guess we should let it lie.

ctwardell · 2014-08-16 15:16

Heater. wrote: »

This debate has been going on for ages, notably during the old P2 fiasco. I guess we should let it lie.

Yes, we are well aware that the purist don't approve of what the heretics are doing. How about we let Peter do his experiments without the threat of being excommunicated or burned at the stake.

C.W.

User Name · 2014-08-16 15:35

Isn't it lovely that someone can try such a thing out now? No consensus required! No intractable arguments that drive everyone crazy!

Cluso99 · 2014-08-16 15:54

As some of you know, I totally agree with Bill's comments. No point in leaving unused performance on the table.

This may mean the difference between getting a design win and not because it may mean that the main program has to be done in an external processor and the BOM cost blows the design out of the water. Other cogs would be unharmed.

Why so much negativity if it was an option??? Don't answer! Lets just get on with the testing

WTG Peter

potatohead · 2014-08-16 16:06

I'm really pleased this is getting tested out. Nice work.

Regarding the ARM core, I think it's much better to improve a COG in basic ways so that it has good performance on large code run from the HUB. We have LMM and we have HUBEX.

I would actually prefer LMM, as that presents us with a nice "supervisor" type capability, in that native PASM runs in the COG, and if it's possible to get say 80 percent of that speed running LMM somehow, no brainer. We get pseudo-ops, we get "interrupts" and we get illegal instruction trapping implemented how we want it implemented because PASM is micro-code in that model. Your program ends up in a very lean, capable, virtual machine, which is advanced capability for a micro-controller.

And that keeps the COG Von Newman, allowing for the maximum tricks and very highly functioning multi-processing capability. Whatever ends up in the HUB can be a super set of PASM, with some very specific things implemented as pseudo ops. Raw performance would be higher with HUBEX, as we more or less saw on the "hot" edition design. But maybe LMM performance can hit that 80 percent number, or get close. If so, I would gladly trade sophistication for a little speed.

That's the line of reasoning I'm taking on larger programs at present. Will be good to see what everybody does and understand the dynamics better.

@Cluso: Indeed!

KeithE · 2014-08-16 17:43

This could be especially interesting for FPGA designs running Tachyon. If you reduce the number of cogs to free up logic and assign the super cog as the main tachyon cog, but leave the others alone it might make it easier to use existing soft-peripherals as-is but give Tachyon a performance boost.

msrobots · 2014-08-16 18:27

I think the performance boost is not as big as anybody thinks.

By now you have 2 instructions between Hub instructions. freeing up more slots for cog0 will not really gain much. Maybe the first one. But after more then one free slot?

What do you want to do with those slots? One rdlong after the other, without processing of the data?

Maybe I do have some gross misunderstanding here on my side, but I do not really see any speed improvement gained thru more slots. Maybe with one more slot. OK. but more?

Please enlighten me!

Mike

ozpropdev · 2014-08-16 18:27

Looking forward to seeing what you come up with Peter.

This is what I like about this FPGA adventure. New(and some old) ideas can now be tested.

Peter Jakacki · 2014-08-16 18:48

I haven't had much time to spend on this and I'm just getting around again to finishing off the W5500 driver although it mostly works, just remapping the way the sockets are accessed. I am running P1V @120MHz and with 4 cores unused the basic Tachyon VM cycle time is 200ns or 600ns for stacking ops and empty FOR NEXT loops take an average 266ns.

Although I've looked at other more complicated schemes this one just seems sensible, even if for whatever illogical reason it affects some people's sensibilities. If for instance I only have a Spin cog running and all the other cogs aren't enabled it just means that all those unwanted hub slots are now automatically directed to cog 0 and Spin flies. Now if we load up all the cogs including cog 0 then that is no different to before as each cog will only have a single slot. Making cog 0 the supercog is a no-brainer as it's normally the startup cog running Spin and having it run faster is only a bonus. This implementation also does not require any configuration, all cogs are identical, even cog 0, it's just the hub (dig.v) that's modified

jmg · 2014-08-17 01:55

Peter Jakacki wrote: »

..... If for instance I have Tachyon running in cog 0, serial receive in 1, and background functions in 2, then 3,4,5,6,7 are not being used and can essentially give their hub slots to cog 0. This doesn't change how the other cogs operate as they will still get their 1 in 8 as they do now but a supercog could easily speed up Spin too. Any thoughts?

Sounds simple enough to trial, but could be tricky to pipeline so it has no speed impact.
Rather than a simple and fast scan, this now has to check each slot for 'Mine(N)' or 'Yield to M'

Other ideas for higher bandwidths :
Another method of giving more speed to some COGS, is to split HUB memory so instead of all memory being 1:8 time sliced, some blocks can be allocated to only 1 or 2 COGS.
This becomes more private rather than openly public, but is much faster.
A FPGA allows this to be quite easily re-defined, so is an ideal test platform.
Silicon has to be more generic, and hat can be less optimal for any one design.

Cluso99 · 2014-08-17 02:13

I presume that the FPGA is giving 1:16 hub slots.
Chip has done something clever in the P2 so that is was 1:8 for 8 cogs. Perhaps there is something here that can be done to make it 1:8 (ie a slot every clock instead of every 2 clocks)?
This lone would double the hub bandwidth.

Heater. · 2014-08-17 02:57

Just for the record...

I have no objections to anyone to taking the P1 design files and mangling them in anyway they like. How could I have? It's their code to do with as they please.

It's educational, especially for anyone getting into HDL and or CPU design for the first time. It will show us what works and what does not. What has useful benefits and so on.

If I find the will and the time I'll tweak on the Verilog myself. Not that I expect to be achieving anything significant or useful.

Besides someone may well have optimizations in mind that perfectly suit the application they have in mind that are not of general interest.

This whole "why asymmetric COGs are bad" idea may soon me demonstrated in a concrete way:)

Full speed ahead Peter.

KeithE · 2014-08-17 08:33

It looks like hub_mem is a single ported RAM, but FPGAs often have dual-ported RAMs for free. Someone might take advantage of this in an FPGA as part of these experiments. There are potentially really crazy things. But without thinking too much could odd/even cogs go to different ports and the hub rate of rotation be doubled?

Todd Marshall · 2014-08-17 14:15

Bill Henning wrote: »

No, I just hate seeing wasted slots, that could speed up video/HLL cogs on a p1 variant, with no impact on determinism of other cogs.

In the discussion on P2 I was concerned that 16 slots would be problematic and could hurt performance. I suggested indirect addressing the cogs through a vector with a settable modulus. For one thing, this would allow atomic rd/update/wr (even multiple) in a cycle. One issue is the mechanism for setting the cog sequence in the vector. The other is setting the modulus. For experimenting, it can be hardwired in the verilog code. I haven't studied deep enough to see how other run time parameters like clkmode and xinfreq get set ... but the same mechanism should allow populating this sequence vector and modulus. Then if you like the supercog idea, set all unused slots to cog 0 in the vector. If you like a quick 3 cog round robin, set the modulus to 3.

Bill Henning · 2014-08-17 14:29

Yep, I also suggested the slot mapping approach - with 128 slots to make it very fine grained

I like Chip's "sausage grinder" on the P2 - way higher overall bandwdith, even if it does not help latency.

Todd Marshall wrote: »

In the discussion on P2 I was concerned that 16 slots would be problematic and could hurt performance. I suggested indirect addressing the cogs through a vector with a settable modulus. For one thing, this would allow atomic rd/update/wr (even multiple) in a cycle. One issue is the mechanism for setting the cog sequence in the vector. The other is setting the modulus. For experimenting, it can be hardwired in the verilog code. I haven't studied deep enough to see how other run time parameters like clkmode and xinfreq get set ... but the same mechanism should allow populating this sequence vector and modulus. Then if you like the supercog idea, set all unused slots to cog 0 in the vector. If you like a quick 3 cog round robin, set the modulus to 3.

Todd Marshall · 2014-08-17 14:38

Bill Henning wrote: »

Yep, I also suggested the slot mapping approach - with 128 slots to make it very fine grained

And I suggested 1000 slots figuring the cost of such a vector should be minimal. This would allow some "sequences" requiring a relatively large but fixed number of clocks to be executed multiple times in a cycle with zero waiting ... e.g. for DSP. I was told I was wrong ... it's not minimal to set up such a vector and indirect through it. We'll see. I've compiled on a Cyclone V and I'm basking in space and resources.

Bill Henning · 2014-08-17 15:05

1000 slots? Hmmm in many ways that would be really nice, but it would be a fair number of gates. 1024 would be nicer

Actually if the number of gates was there, the more slots the merrier. My back of the envelope calculations suggested there should be a minimum of four slots per cog, ideally 8 or more. This was based on expected bandwidth usage of low-to-medium bandwidth peripherals, vs. bandwidth hungry video and byte code / vm / lmm / hubexec cogs.

The best argument I remember that was made against many slots was the lower fMax due to the additional lookup.

What I really like about the Verilog release is that we can all try many different ideas!

Heck, I think it is only a matter of time before someone grafts on a Wishbone bus, and an FPU.

Todd Marshall wrote: »

And I suggested 1000 slots figuring the cost of such a vector should be minimal. This would allow some "sequences" requiring a relatively large but fixed number of cycles to be executed multiple times in a cycle with zero waiting ... e.g. for DSP. I was told I was wrong ... it's not minimal to set up such a vector and indirect through it. We'll see. I've compiled on a Cyclone V and I'm basking in space and resources.

jmg · 2014-08-17 15:10

Bill Henning wrote: »

Yep, I also suggested the slot mapping approach - with 128 slots to make it very fine grained

A simple trial/minimal auto-sizing version which has no config requirements, can be made by making the modulus equal to the Top used Cog.
eg If you know you will use 4 COGS eventually, then assign CogID of 3. to give a 4:1 scan density, or CogID=2 for 3 COGS and a 3:1 scan density.

Bill Henning · 2014-08-17 15:33

Agreed. What I like is that we get to try out many different approaches on the P1V.

jmg wrote: »

A simple trial/minimal auto-sizing version which has no config requirements, can be made by making the modulus equal to the Top used Cog.
eg If you know you will use 4 COGS eventually, then assign CogID of 3. to give a 4:1 scan density, or CogID=2 for 3 COGS and a 3:1 scan density.

Peter Jakacki · 2014-08-17 16:06

Had a busy weekend although I hardly any Prop time and I will get around to posting some code soon. Although I looked at mapping slots and how to manage the maps etc I chose instead the simpler method of gating the output of the bus select and the cog enables back to cog 0 so it's a completely automatic method and if we increased the number of cogs beyond eight for instance then it still works the same. So if the bus select sequence goes to select cog 4 and it's not enabled then it just selects cog 0 instead, simple! On startup cog 0 will have all the slots to itself and lose them one by one when cogs are assigned. It's like joining a queue but if nobody else is there then you get in straight-away every time.

Assigning unused cog's hub slots to a "supercog"

Comments