The New 16-Cog, 512KB, 64 analog I/O Propeller Chip

David Betz · 2014-04-07 10:42

Bill Henning wrote: »

In order to avoid arguing with half the forum here, I pm'd him a very simple hubexec that suits this that would be ~50MIPS (simple instructions) for incredibly few gates (relatively speaking). With 32 entry slot mapping, it would get close to 100MIPS. WAY better C performance.

Would you mind posting this proposal here? Since you are probably not going to be the one to update PropGCC to use this new hub exec feature it would be nice if the people who are going to do the work could see any proposal before it gets set in stone. In any case, I'm sure others will be interested to read the proposal especially if it is substantially different from what Chip had planned for P2.

Thanks,
David

Bill Henning · 2014-04-07 10:42

Yet there is no need to remove actual personal attacks?

jazzed wrote: »

I see this thread has degenerated already. Moderators, please remove perceived personal attacks and posts about perceived personal attacks (including this one).

Bill Henning · 2014-04-07 10:43

I would be happy to, but I request Roy's, your and hopefully others help in case of personal attach, unsupported technical arguments, etc.

Can I count on you?

David Betz wrote: »

Would you mind posting this proposal here? Since you are probably not going to be the one to update PropGCC to use this new hub exec feature it would be nice if the people who are going to do the work could see any proposal before it gets set in stone. In any case, I'm sure others will be interested to read the proposal especially if it is substantially different from what Chip had planned for P2.

Thanks,
David

David Betz · 2014-04-07 10:45

Bill Henning wrote: »

I would be happy to, but I request Roy's, your and hopefully others help in case of personal attach, unsupported technical arguments, etc.

Can I count on you?

I certainly won't personally attack you although I might not agree entirely with the technical details of your proposal. I don't promise not to debate some of the points.

User Name · 2014-04-07 10:47

I very nearly sent Chip a private note, begging him to resist pressure to add HubEx. Few features drive up power consumption and slow down critical execution paths more than HubEx. If there is a magical way to implement it w/o shooting the whole project in the foot, that would be super! Otherwise, it's the same folly, over and over and over.

Roy Eltham · 2014-04-07 10:48

Bill,
I'm not a moderator, so I can't do anything about posts other than reply to them.

I do feel that sometimes things are taken as personal attacks that are not meant to be. I know things get pretty riled up in here, but I seriously doubt anyone is truly intending to personally attack someone when arguing/debating about things. They just come across that way when riled up in the midst of it all. It would be nice if particularly bad cases were addressed by moderation, but it's a tough job.

Heater. · 2014-04-07 10:48

About code compatibility with Prop I

Is this absolutely necessary?

When Intel created the 8086 it was not binary compatible with the old 8 bit 8080/8085. It was not even assembler source compatible. BUT they had made it similar enough that a tool could take 8080 source translate it to 8086 source code that could be assembled an run on the new device. Mostly a one to one translation of assembler statements, but not always. There was a great deal of architectural similarity that made this possible.

I'm not sure if anyone ever actually used that translation facility for real. But it was a "programmer mindset" compatibility rather than an actual code compatibility that was important.

That is the level of P1 compatibility I'm expecting. Anything else will be too restrictive of moving forward.

ctwardell · 2014-04-07 10:48

David Betz wrote: »

Would you mind posting this proposal here? Since you are probably not going to be the one to update PropGCC to use this new hub exec feature it would be nice if the people who are going to do the work could see any proposal before it gets set in stone. In any case, I'm sure others will be interested to read the proposal especially if it is substantially different from what Chip had planned for P2.

Thanks,
David

I think open discussion can have a benefit, and that in general it is a good thing. However, sometimes it is better to keep the discussion group small until a good general direction is set.

I encourage Chip to consider keeping some of these things limited to small groups so this can move quickly.
That said, maybe the HUBEXEC might be one of those cases for now and the discussion can be kept to a small group including some of the PropGCC team.

Once a good direction is set it can be presented to the rest of us for comment.

Chris Wardell

David Betz · 2014-04-07 10:49

User Name wrote: »

I very nearly sent Chip a private note, begging him to resist pressure to add HubEx. Few features drive up power consumption and slow down critical execution paths more than HubEx. If there is a magical way to implement it w/o shooting the whole project in the foot, that would be super! Otherwise, it's the same folly, over and over and over.

I thought that too but I think he's nearly there with increasing the hub with to 128 bits. He may have already paid a large part of the price in complexity. If that's not true then maybe it would be better to leave it out. That still leaves Ken's comment about improved C efficiency being a customer request though.

pedward · 2014-04-07 10:50

jazzed wrote: »

I see this thread has degenerated already. Moderators, please remove perceived personal attacks and posts about perceived personal attacks (including this one).

I didn't quite get this, then read a bit further back.

In general, I rely on my gut to tell me things a lot. The gut is intuition, which makes decisions that haven't been vetted by the analytical process, just your instinct.

Trying to invalidate an opinion or idea, or decision, simply because you didn't subject it to the analytical scientific process, doesn't mean the invalidation is correct.

"Gut" is a synonym for an internal vision you have, whether you justify it with umpteen statistical models or not, it's a design decision that you have and sometimes you are extrapolating because there is no ready data.

Take for instance, my suggestion for 1MB of RAM. This isn't predicated on any deep market analysis, but is based on personal experience and extrapolation of that experience.

Asking to keep cmpsub is based on actually seeing it in use, and my general recollection that it is a useful macro instruction, so half analytic and half gut.

Keeping the muxXX instructions is based on actual use, since it modifies a bit array based on flags, and all 4 variations are necessary for efficient bit twiddling. That one isn't a gut feeling, but based on actual use and review of code.

So, just because people have an intuition of what the right thing to do, doesn't mean their opinions are invalid because they haven't provided a deep technical analysis. This goes both ways.

Just to clarify, I wasn't riffing on you. I just think that debate can occur without a senate subcommittee ordering a technical study on how the universe works.

Seairth · 2014-04-07 10:52

I'm really liking the sound of this! I agree that code compatibility isn't necessary. Some additional questions:

What is the expected clock speed?
Will all cogs still be identical?
Will it use the 2-clock or 4-clock design?
Will it keep the ROM lookup tables or CORDIC?
Will it keep the old or new bootstrap?
Will it have the new monitor?
Will it have the debug/trace?
Will the CLUT become AUX?
Will the HUB access be every 32 clock cycles?
Will there be an equivalent to PORT_D?
Will there be 16 HUB locks?

And a few thoughts on the above questions:

For PORT_D, if it would be easy to add a hardwired bus between pairs of cogs (0 and 1, 2 and 3, etc.), this might make it easier to write efficient protocols that require two cogs. Additionally, it might be possible to add software support for an 8-bit (assuming the limited number of I/O pins) external RAM driver that is controllable via PORT_D from the "main" program. In other words, the driver would run in COG 1 and the main program would run in COG 0, commanding it over the hardwired port. The driver would still most likely transfer between external and HUB RAM, which would allow for larger memory models in the "main" program. If I had a choice in the bus architecture, I'd say two 32-bit registers that are cross-coupled such that the first is write-only and the second is read-only (i.e. no need for DIRx).

If there's not enough room for CORDIC in each cog, could you instead make a single instance available in the HUB? Since you will have a 128-bit data bus, it should be possible to start a CORDIC calculation with a single HUBOP (pointing to a block of registers) and read the results on the following hub slot. With this approach, there is obviously the potential for resource conflict. This simplest solution is to leave it up to the programmer to avoid accessing CORDIC from two cogs at the same time.

Bill Henning · 2014-04-07 10:53

At Roy's and David's request, here is my simplified hubexec proposal for Chip's new base design.

I respectfully request that moderators stomp on any personal attacks (ad hominem), should they occur.

I would very much appreciate my forum colleagues help in stamping on non-technically supported criticism, such as "handwaving" (strawman), "it's NOT the propeller way!", "LMM is faster" without showing how, etc.

I am delighted to respond to technical questions, criticism of the proposal that is well supported with a valid technical argument.

I would be extremely happy if someone came up with technical improvements, but please say how/why your improvement is better.

PM, slightly edited to remove irrelevant info

LMM comparisons made to base rate ie without fcache, as hubexec is also base rate (no reason hubexec cannot use an fcache)

Instead of a physical separate LRU capable cache, could you execute from the four longs from the last hub cycle?

That can't pre-fetch a following cache line, but it would still be MUCH faster than LMM (without fcache, which the compiler guys are not using anywhere near its potential, probably due to GCC code generator limitations)

If you can execute from an internal latched "this is what I got last hub cycle", it would still be minimum 2x-4x faster than any possible LMM.

Still avoids cache lines, etc.

HUNGRY would help the simplified hubexec very significantly

All we really need is 3 new instructions, and executing from the QUAD latch as above

JMP d/#hub17longaddr
CALL d/#hub17longaddr ' writes PC to LR, I suggest $1EF, or whatever the last register is before special registers
LOAD #hub17longaddr ' could the #hub17longaddr + two low zero bitswrite to LR, or one reg below it - for fast loading of pointers (cheap LOCPTRA replacement)

This would be roughly 4x faster than possible with LMM, and save parallax $$$$ over having to do a QLMM gcc.

The slot assignment table idea, would make it MUCH faster, almost twice as fast

Mooch would also help greatly

Even a single 4 long line (simple latch) and the RDxxxxC instructions would speed up VM's and data access for LMM & hubexec by roughly 4x.

- What is the area/transistor budget effect of a single 4 long cache for data?

- It sounds like you determined that 32 cogs are out (I assume too much die area).

Pity, with slot mapping it would add a lot, but what does not fit, does not fit.

Slot mapping (16 entries with 8 cogs minimum), would allow 2x the simple hubexec performance, twice the video bandwidth

at the cost of some slots from low speed drivers that don't need them

Bill Henning · 2014-04-07 10:57

I LOVE technical debate!

I hate "hand waving", "its not the propeller way" etc

If there is a potential improvement, I WANT TO HEAR IT

If there is a technical problem, PLEASE say so, but to save us all a lot of time, detail what is wrong with it, and make sure you read my undoubtedly really long, boring seeming post

I will even take unsupported suggestions to improve, but will NOT accept "that can't be right" IF NOT SUPPORTED BY VALID EVIDENCE

David Betz wrote: »

I certainly won't personally attack you although I might not agree entirely with the technical details of your proposal. I don't promise not to debate some of the points.

pedward · 2014-04-07 10:58

I recommend staying away from HUBEX on the P1 variant. Chip already said this would nearly double the complexity of the COG.

We went down this road before, what we got was a hermaphrodite with a penchant for power that is too big to fit in the skinny pants.

Let's try to contain the mods to specific instructions, counters/ADC/DAC/VIDEO, and memory.

We already know the video, counters, and ADC/DAC need to be reworked for the new I/O pads, so those are probably going to be heavy dev items. But, resist trying to add too much here. Adding AUX RAM may just go too far, because it brings in so many other changes.

Bill Henning · 2014-04-07 11:00

Roy, that is all I was asking for.

Some help in responding to personal attacks, unsupported knee jerk reactions etc. It takes too much of my time to address them all myself, and if I don't address them, those who do not clearly address the issue will bury the good ideas - and worse yet - influence others against them.

I really, really don't want to leave again, but without help, I may have no choice.

And I hate that.

Roy Eltham wrote: »

Bill,
I'm not a moderator, so I can't do anything about posts other than reply to them.

I do feel that sometimes things are taken as personal attacks that are not meant to be. I know things get pretty riled up in here, but I seriously doubt anyone is truly intending to personally attack someone when arguing/debating about things. They just come across that way when riled up in the midst of it all. It would be nice if particularly bad cases were addressed by moderation, but it's a tough job.

Dave Hein · 2014-04-07 11:01

Bill, I find your proposal a bit confusing. Aren't you just suggesting hubex with a single cache line without prefetch, or did I miss something?

ctwardell · 2014-04-07 11:02

pedward wrote: »

I recommend staying away from HUBEX on the P1 variant. Chip already said this would nearly double the complexity of the COG.

We went down this road before, what we got was a hermaphrodite with a penchant for power that is too big to fit in the skinny pants.

Let's try to contain the mods to specific instructions, counters/ADC/DAC/VIDEO, and memory.

We already know the video, counters, and ADC/DAC need to be reworked for the new I/O pads, so those are probably going to be heavy dev items. But, resist trying to add too much here. Adding AUX RAM may just go too far, because it brings in so many other changes.

I'm pretty sure Chip's comments were in regard to the P2 style HUBEXEC with the caches, prefetch, preemptive tasking, etc.
Bill's suggestion is much more simple, more like LMM with hardware assist.

C.W.

Dave Hein · 2014-04-07 11:04

So is Bill just suggesting RDLONGC?

Bill Henning · 2014-04-07 11:04

100% correct

Hubexec was not really worth it without the wide bus, due to minimal incremental performance (ok, some memory savings too, which is less critical with 512KB)

When I digested his new design, I saw how to cheaply improve it.

It is very likely he already has a 128 latch for each hub to latch the value before writing it to 128 bits of hub registers.

If not, maybe he can execute out of the four registers it was written to.

In either case, we are talking about a 128:4 multiplexer, PC changed to 17 bits, and the 3 instructions I proposed. I think the gate count for that would be very small compared to the size of even the tiny P1 cog.

David Betz wrote: »

I thought that too but I think he's nearly there with increasing the hub with to 128 bits. He may have already paid a large part of the price in complexity. If that's not true then maybe it would be better to leave it out. That still leaves Ken's comment about improved C efficiency being a customer request though.

Brian Fairchild · 2014-04-07 11:04

pedward wrote: »

I recommend staying away from HUBEX on the P1 variant.

The trouble is, without *some* sort of hubexec, it's back to square one.

Bill Henning · 2014-04-07 11:08

Sorry pedward, the technical basis of your recommendation is suspect.

Chip said the P2 style hubexec, with 4x8 long icache, 1x8 dcache, 256 bit wide hub memory bus, would nearly double the size of a P1 cog.

He already added a 128 bit wide bus, for memory bandwidth.

pedward wrote: »

I recommend staying away from HUBEX on the P1 variant. Chip already said this would nearly double the complexity of the COG.

We went down this road before, what we got was a hermaphrodite with a penchant for power that is too big to fit in the skinny pants.

Let's try to contain the mods to specific instructions, counters/ADC/DAC/VIDEO, and memory.

We already know the video, counters, and ADC/DAC need to be reworked for the new I/O pads, so those are probably going to be heavy dev items. But, resist trying to add too much here. Adding AUX RAM may just go too far, because it brings in so many other changes.

jazzed · 2014-04-07 11:11

Bill Henning wrote: »

Yet there is no need to remove actual personal attacks?

Do you don't You don't perceive actual personal attacks as perceived personal attacks?

Kerry S · 2014-04-07 11:12

Brian Fairchild wrote: »

The trouble is, without *some* sort of hubexec, it's back to square one.

Exactly.

Either hubexec (simple version) or some other method to run large C programs efficiently. It also needs to be relatively transparent (to the programmer) so that it compares to other chip's dev tools. Making that happen is up to the compiler wizards and Chip boiling down the minimum things they need to do that.

David Betz · 2014-04-07 11:20

Bill Henning wrote: »

At Roy's and David's request, here is my simplified hubexec proposal for Chip's new base design.

I respectfully request that moderators stomp on any personal attacks (ad hominem), should they occur.

I would very much appreciate my forum colleagues help in stamping on non-technically supported criticism, such as "handwaving" (strawman), "it's NOT the propeller way!", "LMM is faster" without showing how, etc.

I am delighted to respond to technical questions, criticism of the proposal that is well supported with a valid technical argument.

I would be extremely happy if someone came up with technical improvements, but please say how/why your improvement is better.

PM, slightly edited to remove irrelevant info

LMM comparisons made to base rate ie without fcache, as hubexec is also base rate (no reason hubexec cannot use an fcache)

I agree with this although I think we still need a way to load a 32 bit immediate value. I'm not sure 17 bits is good enough for constants. This is because you can't use an LMM-style macro to do this in hubexec mode. What you've described is pretty much exactly what I proposed to Chip several years ago right after you and Chip came up with RDLONGC and long before the LRU cache.

Bill Henning · 2014-04-07 11:21

Dave,

It is actually several potential performance boosts. Maybe I make my technical proposals too dense!

hubexec engine:

Step 1: no prefetch (absolute minimum gates required)

may not need a cache line if the hub bus is latched on read (can execute from there) or if Chip can execute from the destination four registers (don't know). If the above are not possible, needs a minimum of one 4 long icache line (1x4L to save future typing)

Performance is roughly 1/2 what it could be, as there is no possibility of prefetch.

Step 2: prefetch requires 2x4L icache, if there is gate/power budget for two cache lines (uses more gates). Obviously 4x4L would be better.

Step 3: mooch would help a lot (annoys some people on "principle")

minimum instructions needed for good improvement over LMM

The three instructions I pointed out are the bare minimum to minimize required verilog and gate.

JMP d/#hub17longaddr
CALL d/#hub17longaddr ' writes PC to LR, I suggest $1EF, or whatever the last register is before special registers
LOAD #hub17longaddr ' could the #hub17longaddr + two low zero bitswrite to LR, or one reg below it - for fast loading of pointers (cheap LOCPTRA replacement)

More would allow for more improvement, but I got it that people want minimal changes (weather I agree or not). I even used a LR as the gcc group prefers

possible further performance improvements, up to Chip which are small enough to fit - or even put in

Option 1: A single 4 long data cache so we can get RDxxxxC back. That hugely improves VM's, would even help LMM

Option 2: Simplified slot mapping, 2x number of cogs slots

Allows deterministic bandwidth assignment - less for serial/ps2/etc cogs, more for hubexec and video cogs. Should be very cheap in silicon (32x5 bits), but only Chip knows what % of gates increase it adds

Serial port? only one slot out of 32 (or 64 etc)

Video engine? Give it two slots! Once cog can now do 16bpp 1080p60 !!

Obvious extensions are say 64 slots if cheap enough for finer grain deterministic timing control

Dave Hein wrote: »

Bill, I find your proposal a bit confusing. Aren't you just suggesting hubex with a single cache line without prefetch, or did I miss something?

Bill Henning · 2014-04-07 11:27

I agree that a load const 32 would be very nice!!!

My proposal was pared down to the bone, for minimal gate requirements.

LOADK dest,##n

or AUGS ##

whichever takes fewer gates, would be great!

But everyone wanted super-simple, I tried to do that to keep everyone happy, so I omitted practically everything to keep it tiny. Heck, yesterday I posted a version for 32 bit P1+ bus!!!

David Betz wrote: »

I agree with this although I think we still need a way to load a 32 bit immediate value. I'm not sure 17 bits is good enough for constants. This is because you can't use an LMM-style macro to do this in hubexec mode. What you've described is pretty much exactly what I proposed to Chip several years ago right after you and Chip came up with RDLONGC and long before the LRU cache.

David Betz · 2014-04-07 11:29

Bill Henning wrote: »

I agree that a load const 32 would be very nice!!!

My proposal was pared down to the bone, for minimal gate requirements.

LOADK dest,##n

or AUGS ##

whichever takes fewer gates, would be great!

But everyone wanted super-simple, I tried to do that to keep everyone happy, so I omitted practically everything to keep it tiny. Heck, yesterday I posted a version for 32 bit P1+ bus!!!

Of course, LOADK is impossible since it would have to have a zero bit opcode! :-)
I think AUGS is the solution.

Dave Hein · 2014-04-07 11:30

The thing that makes it confusing is that you are suggesting several things. It seems like the minimal implementation would just be a RDLONGC. Latching the hub bus is equivalent to implementing a single cache line.

The cache line could be used for data, instructions or both. I.E., the latched hub bus would be a shared instruction/data cache.

I think all you need is the long JMP and CALL instructions. What's the purpose of the LOAD instruction? Isn't that the same as RDLONGC?

Kevin Wood · 2014-04-07 11:31

If Bill's simplified hubexec can be implemented as straight-forward as he believes, then IMO, it should be seriously considered, fine-tuned, and implemented.

LMM appeared quite by accident, and truthfully, it allowed the Propeller to move past a couple of limitations.

However, this is an opportunity to provide a mechanism by design vs by accident, and it shouldn't be unconsidered.

davejames · 2014-04-07 11:31

Lest anyone think that the Moderators are not active on this site, understand that there are not that many of us and that there are tens of tens messages to oversee.

That said - I'm locking this thread until it can be reviewed for moderation.

The New 16-Cog, 512KB, 64 analog I/O Propeller Chip

Comments