New Hub Scheme For Next Chip

jmg · 2014-05-13 20:27

Phil Pilgrim (PhiPi) wrote: »

I see it more like a lazy Susan in a Chinese restaurant. If you want to fill your plate quickly, you do a block read, starting with whatever is in front of you as the thing rotates, grabbing each dish in turn until your plate is full. Seconds on a particular item might happen right away, or you may have to wait awhile for it to come around again.
l

Yup, and if you can swallow at a certain rate and want to keep a clean plate, you can set a Skip value, and if that is an odd number, after some circuits you will have covered all dishes, in the smallest time.

David Betz · 2014-05-13 20:28

kwinn wrote: »

Wouldn't this require a 32 bit data and 19 bit address bus for each of the 16 ram blocks along with 16x51 (or maybe only 47) bit multiplexers. IOW an 816 bit buss between the hub and cogs? Or is there some other way of doing this?

Yeah, that's what I was wondering too. There must be some clever way of doing this that I'm not seeing.

Cluso99 · 2014-05-13 20:30

I am writing this on the fly as I read the thread.

Neat idea Roy & Chip. Knew it needed some thinking outside the square.

There could be a power issue if every cog accessed every slot, although that would be highly improbable.

A few ideas while reading...

Could there be an instruction that reads a huge block such as RDHUGE <hubaddr>,<cogaddr>,<count>.
- We can sort out the precise instruction(s) details later. (maybe <hubaddr> = <count> << 16 | <hubaddr>
- Would permit a full block of code/data to be loaded (program or video) extremely fast !!!
- Coginit could use this too
HUBEXEC: Instructions effectively take 2 clocks. If hubexec mode used every 2nd hub long, it could execute hubexec at full speed (100MIPS).
- Hubexec mode could be set to:
  - Interleaved every 2 hub longs (+2 and every 2 clocks = full speed 100 MIPS)
  - Interleaved every 4 hub longs (+4 and every 4 clocks = 50% speed 50 MIPS)
  - Interleaved every 8 hub longs (+8 and every 8 clocks = 25% speed 25 MIPS)
  - Sequential every 1 hub long (+1 and every 17 clocks = ~12% speed 12MIPS)
- The compiler would become more complicated to support the interleaving. Without modification 12 MIPS would work.
- If necessary, minimise hubexec support instructions to jmp/call/ret relative/direct
- Perhaps add deeper LIFO to 16+ (could it use the cog $1F0-1FF shadow ram ???)
- Requires the JMPRET fixed return address for GCC.
When reading or writing sequential bytes/words/longs, successive hub slots will be either 16 or 17 clocks
- presume we start on a long boundary...
  - Bytes: 0-15,16,16,16,17 etc
  - Words: 0-15, 16,17 etc
  - Longs: 0-15, 17 etc
- therefore we can determine how many instructions can be performed between hub accesses

Overall, I am impressed with this scheme. Major upside and a little downside. Of course, as others have said, there are many more wires/busses involved here, and I don't know how much space this will require.

jmg · 2014-05-13 20:30

David Betz wrote: »

Yeah but 1 x 128 = 128 but 16 x 32 = 512. Doesn't that mean that there are four times as many wires in this new scheme?

Yes, that is one of the losses

Cluso99 · 2014-05-13 20:31

Phil Pilgrim (PhiPi) wrote: »

I see it more like a lazy Susan in a Chinese restaurant. If you want to fill your plate quickly, you do a block read, starting with whatever is in front of you as the thing rotates, grabbing each dish in turn until your plate is full. Seconds on a particular item might happen right away, or you may have to wait awhile for it to come around again.

-Phil

A great analogy Phil.

jazzed · 2014-05-13 20:39

T Chap wrote: »

I know this will sound crazy but can there be a hardware feature that automatically copies a long or group of longs (simultaneously or immediately following a write to the cog's own ram) to another address belonging to another cog. This allows another cog that is to use the updated data immediately or a clock later than it was written. Sort of a dynamic memory.

Sort of. But that's called DMA

jmg · 2014-05-13 20:40

Cluso99 wrote: »

Could there be an instruction that reads a huge block such as RDHUGE <hubaddr>,<cogaddr>,<count>.

Probably, as it just needs a Nibble-Carry, and a decision on when to stop.

Cluso99 wrote: »

HUBEXEC: Instructions effectively take 2 clocks. I

I'm not certain HubExec is still in there, but you CAN grab 16 Longs in 16 Clocks, and work on those.

Cluso99 wrote: »

When reading or writing sequential bytes/words/longs, successive hub slots will be either 16 or 17 clocks

It now depends on how address changes between HUB access. See #72, it can be 15 or 17 for DEC/INC by 1, and other values for other address changes.

I am exploring ways to improve the pointer adder, to exploit this.

There are speed gains lurking here, for Burst & block transfers.

So far, it looks like any Odd value, can access all Page locations, and then a Nibble_Carry decision is needed.
Even values also work, but are more sparse in operation.

The same adder is used on Read as Write, but this is not ideal for random access , only for PTR++ PTR-- type of RW

JRetSapDoog · 2014-05-13 20:46

David Betz wrote: »

Yeah but 1 x 128 = 128 but 16 x 32 = 512. Doesn't that mean that there are four times as many wires in this new scheme?

Considerations like that are part of the reason why I didn't push for the "hub on steroids" scheme I was thinking out loud about. I figured Chip would be able to shoot down the idea after just a couple minutes of inspection due to wiring bottlenecks that could involve so much logic (not to mention possible timing issues) that it would start to eat into the space available for RAM. There were other complexities, too. Still, it's good to see out-of-the-box thinking generating some good discussion...and maybe it will result in the "perfect compromise."

Phil Pilgrim (PhiPi) · 2014-05-13 20:50

This idea is brilliant: both novel and non-obvious. There are times when it's okay to play your cards face-up and times when they should be held close to the vest. What I fear is that this should have been one of the latter for IP reasons. But the cat's out of the bag now...

-Phil

JRetSapDoog · 2014-05-13 21:05

Phil Pilgrim (PhiPi) wrote: »

This idea is brilliant: both novel and non-obvious. There are times when it's okay to play your cards face-up and times when they should be held close to the vest. What I fear is that this should have been one of the latter for IP reasons. But the cat's out of the bag now...

I recall doing a patent search once on Chip's name and think I found an occurrence under another company's name, I believe. These days at Parallax, Chip has clearly signaled that he's not very interested in the big time/money sink that IP can be, preferring to compete directly on the merits of the stuff Parallax designs/manufactures and sells. I'm sure a back-of-the-mind concern is whether anything Parallax innovates on their own has already been claimed by some company with big pockets that could interfere with Parallax's work. I'm not saying that there's no place for patents, but Parallax has a vehicle for making money (their products/store/customer base). A person (or small company) without that could very well benefit from a patent. But from reading Chip's comments on the matter, I think he hopes to stay as far away from patents as possible. Therefore, putting this scheme out in the open (should it be entirely novel, which seems unlikely), actually offers protection by preventing others from claiming it and locking it up. It is true, though, that the unique design of the Propeller and the way the HUB works (or could work) lends itself to this idea in such a way that other companies may not have seen or needed. I think it's mostly the latter, meaning that their architectures don't (or won't likely) require this.

jazzed · 2014-05-13 21:19

Phil Pilgrim (PhiPi) wrote: »

This idea is brilliant: both novel and non-obvious. There are times when it's okay to play your cards face-up and times when they should be held close to the vest. What I fear is that this should have been one of the latter for IP reasons. But the cat's out of the bag now...

-Phil

Well at least it's safe from any third party litigation now for at least 364.5 days.

Cluso99 · 2014-05-13 21:36

Thinking this thru a little more...

I don't see it being as easy to implement, nor as frugal with silicon, as the 2 table hub slot-sharing scheme
It is not as easy to understand (from how to take advantage of it point of view) as the 2 table either.

However, it does offer very significant bandwidth improvements over the 2 table scheme.
With the use of a more complex compiler, many performance improvements are possible over the standard LMM mode.
A simple hubexec model ought to be quite possible.

My current thoughts are this is definitely a WINNER

jmg · 2014-05-13 21:51

Cluso99 wrote: »

Thinking this thru a little more...

I don't see it being as easy to implement, nor as frugal with silicon, as the 2 table hub slot-sharing scheme
It is not as easy to understand (from how to take advantage of it point of view) as the 2 table either.

However, it does offer very significant bandwidth improvements over the 2 table scheme.

Yes, Still waiting on Chip's figures for possible peak bandwidths vs SW loops.
Delta on Address looks critical, for Bandwidth management.

I have a general case Special Delta Adder (Nibble. Nibble-Carry) that looks to meet FPGA MHz for this Rotate Scanner, and I think a more special case of Odd-values-only, could be smaller ( but more restrictive), ideally suited to block-move / Data Buffer applications.

kwinn · 2014-05-13 21:55

David Betz wrote: »

Yeah, that's what I was wondering too. There must be some clever way of doing this that I'm not seeing.

Do I ever hope you're right about that clever way. It would be fantastic if it could be done.

jmg · 2014-05-13 22:05

kwinn wrote: »

Do I ever hope you're right about that clever way. It would be fantastic if it could be done.

Be interesting to see what FPGA figures this memory alternative creates.

koehler · 2014-05-13 22:15

Hmm, I've wrapped my head around how the LSB selects the proper bank/page/'slice' of RAM and some of the timing.
The rest seem somewhat arcane, but then I'm a newbie.

/party-pooper alert

The timing issues seems to be the new 'management burden' that was complained about so often with the hub sharing schemes.

This is quite out of the box, however it seems like a Groundhog Day Part II, as we're now looking at hundreds of additional wiring/muxes(?), instructions, and probably power and die concerns again? This seems to have the most potential for bringing up the Mega-P2 'unintended consequences' that we've seen before.

And, personally, it seems to be far more difficult to explain the theory (Lazy Susan is nice though) to newcomers, and the timing idiosyncrasies just to start.

I personally found Seirths suggestion to be kind of cool, though not sure if there is any way to avoid restricting a Core to 32K. I get the impression, perhaps wrongly, that the Cores are going to all have their program/ram intertwined/spammed acrosss the range?

Before this gets too far into 'production', JMG normally asks for P&R or something to get an idea of die and power requirements.

If that is somehow acceptable, I'm sure the lure of GB/xfer is going to be too powerful to resist.

Just mentioning this, as several people have been so vociferous in arguing that a simply slot scheme towards this end (speed/ability) is not necessary for the Prop, is far too complex for the average User, is going to destroy the OBEX, kills symmetry/determinism, etc, etc.
I concede I am probably missing the simplicity of this plan at this time, however this idea seems to do all/most of this?
Having to worry about an address, which can jitter timing/determinism? Talk about putting a burden on the programmer?

/end-party-pooping

Its Chip chip though, so he gets to make the call of course.

BTW, if there is a timing difference between Inc/Dec/Same, for the sake of determinism, why not just stretch the shortest one/s out with a NOP so they all are the same, and deterministic?

jmg · 2014-05-13 22:35

koehler wrote: »

... as we're now looking at hundreds of additional wiring/muxes(?), instructions, and probably power and die concerns again?

Chip has claimed RAM power can go down , and it should be easy to get FPGA indicators on MUX costs

koehler wrote: »

And, personally, it seems to be far more difficult to explain the theory (Lazy Susan is nice though) to newcomers, and the timing idiosyncrasies just to start.

Perhaps increases demand for a good simulator ? - I think One(+) is already in the works.

koehler wrote: »

Before this gets too far into 'production', JMG normally asks for P&R or something to get an idea of die and power requirements.

Already done above, - and I've got some numbers for a Special Delta Adder, that can wring more BW from this.

koehler wrote: »

Having to worry about an address, which can jitter timing/determinism? Talk about putting a burden on the programmer?

BTW, if there is a timing difference between Inc/Dec/Same, for the sake of determinism, why not just stretch the shortest one/s out with a NOP so they all are the same, and deterministic?

A challenge and an opportunity

The timing difference is set by the Slot-scanner, so NOPs will not really fix that, but it is deterministic for a given Address delta, provided you know which Rotate you can land in.
Code does not tend too change from inc/dec/hold, it either polls, or writes in one direction, but yes, there is more to think about.

Tubular · 2014-05-13 23:30

Chip do you forsee any big problems with getting this up and running on a DE0-Nano, so we can code some test cases? I'm not so worried about the timeline as major obstacles that might make it not practical.

As I understand it the M9K memories can be arranged as 256 elements * 32 bits, and there must be something like 70 of them on the 4CE22. Even if we just started with using 16 of them (~16kByte / 4 kLong hub) to keep it simple initially

User Name · 2014-05-13 23:46

koehler wrote: »

... it seems like a Groundhog Day Part II, as we're now looking at hundreds of additional wiring/muxes(?), instructions, and probably power and die concerns again?

Inasmuch as the muxes aren't being stuffed in and around the ALUs, I'm a happy camper.

Cluso99 · 2014-05-14 00:05

Simplest HUBEXEC scheme for the new HUB SCHEME

Program Counter "PC" expanded to 17 bits (hub long addresses are 17bits + "00"). If address > $1EF then it's hub.
If PC points to hub, waits until hub long can be read, and reads this directly into ALU (ie is muxed with the instruction read from cog)
Additional support instructions (minimal)
- JMPF @/#nnnnnnnnnnnnnnnnn
  - @/# for relative/direct, n is 17 bits
- RETF
  - RETF is merely a JMP $1EF (indirect jump)
  - ie the return address (hub or cog) is stored at fixed cog address, say $1EF
- CALLF @/#nnnnnnnnnnnnnnnnn
  - @/# for relative/direct, n is 17 bits
  - same instruction for both jmp & call (ie jmp always stores return address) - else use another bit for 2 instructions
  - always stores return address in fixed location, say $1EF
  - operates the same for cog and hubexec modes
  - required by GCC guys
- Above instructions may optionally store/restore Z & C flags
- If relative is too complex, direct is adequate
Desirable (but not mandatory) instructions for hubexec
- JMP/CALL/RET @/#nnnnnnnnnnnnnnnnn
  - uses an internal LIFO (as in older P2)
  - Can it be at least 16 deep ?
  - maybe it can use cog $1F0-1FF shadow ram ???
  - also very useful for cog mode
- LOAD #nnnnnnnnnnnnnnnnnnnnnnn
  - n is 23 bits (IIRC that was the max that could be embedded in one instruction)
  - always stores an immediate value in a fixed location, say $1EE (better not to use $1EF ???)
Nice to have - this instruction would help hubexec run much faster (but would require interleaved code from a complex assembler/compiler)
- SETMODE #n
  - Sets the hubexec mode to increment the PC by 1(default) / 2 / 4 / 8 longs if the current PC address is in hub
  - This permits hub instructions to be interleaved to squeeze higher performance from the new hub scheme.

I think with these we could run hubexec nicely, although not as easy as the old P2 fpga.

jmg · 2014-05-14 00:14

Cluso99 wrote: »

[*]Additional support instructions (minimal)
...
[*]If relative is too complex, direct is adequate
..

Relative opcodes would also help copy-then-run code designs, and local (COG) function code loading.
Hopefully they get included.

Lawson · 2014-05-14 00:21

This... seems rather brilliant! This is going to need a mucking big cross-bar switch, so it's not without cost. The sequential, or fixed pattern hub bandwidth looks plenty high. Only fully random reads will really suffer, and the worst bit there is that timing becomes as random as the address sequence.

jmg · 2014-05-14 00:36

Lawson wrote: »

... Only fully random reads will really suffer, and the worst bit there is that timing becomes as random as the address sequence.

I'm wondering if a Nibble/tiny sized variant version of Waitcnt would help ?
ie a 4-5-6-9? bit reload counter, that can Wait_TC snap.

It would cost another opcode to be used when some timing-rate snap is needed, and work more like P1 HUB where the next-slot is snapped to.

Brian Fairchild · 2014-05-14 00:43

Lawson wrote: »

Only fully random reads will really suffer, and the worst bit there is that timing becomes as random as the address sequence.

So that's only 99% of your average compiled code that'll suffer then.

I have to ask the question....WHY? As in, why do we need this? As in, why the extra complication?

To make this new scheme work there are already loads of new opcodes needed.

koehler · 2014-05-14 01:17

Brian Fairchild wrote: »

So that's only 99% of your average compiled code that'll suffer then.
I have to ask the question....WHY? As in, why do we need this? As in, why the extra complication?
To make this new scheme work there are already loads of new opcodes needed.

Thats kind of what I was thinking
Probably great, in many cases for a subset of problems, however for the average problem, it sounds like there will be ___ what sort of effect?

This is an interesting out of the box idea, however seems far, far more complicated, at least too me.
It seems like whereas before additional b/w was not needed, or the impact of sharing was too much, we've jumped the shark, and there are shedloads more potential impacts for an increase in bandwidth used by a small/moderate subset of Prop use cases?

Funny, I was previously really for Parallax to find some way to increase b/w.
However, this seems far more complicated an undertaking, both in re-design effort and tools, and most especially, that oft heralded "No time for it, Ship it now, I want it NOW!, Parallax needs to ship now so it can get revenue", and especially so for all the 'dumb' users that we were previously so worried about not being intelligent enough to add a couple of Share lines to a CogInit/New...

However, if all of the Standard OBEX objects will still work without issue, and this is somehow voluntary and will not affect them, or normal usage, then sure.
Hopefully I am just sadly confused.

Tor · 2014-05-14 01:57

A very interesting scheme. It kind of screams for a simple software simulation (of the concept, not the Propeller) to explore the impact on the different usage scenarios discussed here.

-Tor

Roy Eltham · 2014-05-14 02:13

Guys,
The old scheme had a 128bit bus for each cog to hub and the hub was still divided into 16 memories because the stuff/synthesis/process/objects they are working with can't do a single block of ram large enough. So that 128bit bus was ultimately going to several memory blocks. Now each cog has 16 32bit busses each going to one of the 16 different memories. So yes, the cog has 4x more data lines connected to it directly, however all of the hub memories were connected to all 16 cogs before so they have the same number of data lines connected to them.

The very key differences right now, is that both cog and hub memories can be 32bit instead of 128bit, so they consume less power and can run at higher clock rates. The cog memories consuming less power is going to be the most important because their memories are running all the time when the cog is on. Also, having to route 128bit busses between cogs and multiple hub memories is more difficult than 32bit busses.

So yes there are more data lines connected to each cog, but overall I think the synthesis will be better with the 32bit busses instead of 128bit ones. Ultimately, this memory design is simpler that what Chip was going to have to do with the 128bit busses, and it gives us higher bandwidth in the process.

Anyway, the main thing I like about this setup is that we get higher read/write bandwidth between cogs and hub for all cogs equally, and all cogs are the same still.

There is a slight downside in that it's difficult to synchronize with the hub for random access reads, but even with compiler generated code, memory access is rarely random access per long/element. It's x longs in order, then a random new address, So you'll get the x longs in order faster, and then have a slightly worse "worst case" wait on the random new address. Also, that random new address is often actually very nearby where it was (like branching over a few longs), so you'll likely either already have it (in the 16 long block burst read cases), or get it without waiting as long.

Non-branching code (rare), executes at ~64% of full speed using block reads to fill the execution memory. Since the rdbloc instruction consumes 2 clocks itself and the 16 clocks for the actual read, and then those 16 longs take 32 clocks to execute, we have a 50clocks to run 32clocks worth of stuff ratio. I think from talking with Chip about this, in hubexec it's just the 16 clocks of reading (the 2 clocks of instruction execution get hidden someplace, I may be miss-remembering here) for 32 clocks of executing so it's ~66% of full speed. Anyway, typical small branch stuff (like from conditionals causing branches over 1 or a few longs) the performance drops down some because you get to your "read stall" more often, but for small loops that fit into 16 longs your perf jumps up to 100%. Chip and I discussed have 2 or 4 cache buffers in cog memory space for hubexec that would make larger loops and backward branching perform much more often at full speed. It's just a matter of get there in incremental steps and determining if the perf gain is worth the added cost in complexity/time/power/space.

For hand written PASM, this scheme rocks so much. You get 4x R/W bandwidth to hub for in order buffer dumps/copies/etc. I'm really excited about the potential for some massive performance on some low level drivers and what not...

RossH · 2014-05-14 02:19

Roy Eltham wrote: »

Anyway, the main thing I like about this setup is that we get higher read/write bandwidth between cogs and hub for all cogs equally, and all cogs are the same still.

I agree with this, which is why I think it is perhaps the best solution to maximizing hub bandwidth so far offered.

Roy Eltham wrote: »

For hand written PASM, this scheme rocks so much. You get 4x R/W bandwidth to hub for in order buffer dumps/copies/etc. I'm really excited about the potential for some massive performance on some low level drivers and what not...

I disagree with this - for anything that requires determinism, this scheme will be very hard to use (or you have to give up the speed advantage altogether). And complexity is a potential killer for a chip like the Propeller.

But overall, the benefits of this scheme probably outweigh the disadvantages.

Ross.

Roy Eltham · 2014-05-14 02:41

I don't understand why you disagree on the PASM code advantage? I also don't understand why everyone keeps calling this complex or more complex? It's not complex at all, it's very simple.

One way to think about it is, HUB is not 1 big HUB, but instead of 16 smaller HUBs. Each cog has access to each hub once ever 16 cycles. (or each cog always has access available to one of the HUBs). The hub memory addressing is arranged so that if you read all 16 HUBs, you'll end up with 16 longs that are sequential. Issuing a WR/RDBLOC has no delay, it always starts reading/writing immediately and when it's done you have looped around all the HUBs once.

All this has done is take something that was serial (only one cog can access HUB at a time) and made it parallel (all cogs can access HUB at the same time (just different parts of it)), and it's in exactly the same way that COGs make thing parallel. Each one is independent of the others.

It's really simple, straight forward, and fast. I think you guys are over thinking it, and making it more complex in your minds than it really is...

potatohead · 2014-05-14 03:03

I agree so far Roy. Now, I along with everybody else, am really itching for a new FPGA image.

The speed / determinism trade-off and "works this time" being opposed to "works all times" are core disagreements surrounding the other singular HUB access schemes. We valued things differently and that meant some wanted to emphasize control, others consistency, etc...

IMHO, this scheme puts a lot of options out there in software now. And it's fast! Like with the round robin, there will be some learning on how to make best use of the COGS and their HUB interaction. However, like the round robin, singular access scheme, once that learning happens, it's gonna work for everybody, and having that be true is important.

Personally, I really like how it's parallelism expressed in more ways now! And there are lots of ways people can choose to work to emphasize what they want to get done. We can work in big chunks, or little ones, use the striped data default, or structure it to center on one memory, etc... in the same kinds of ways we would use one COG, or a few, etc...

Issuing a WR/RDBLOC has no delay, it always starts reading/writing immediately and when it's done you have looped around all the HUBs once.

IMHO, this will work kind of like RDLONG did for people, in that sometimes it made sense to organize data in specific ways to maximize a HUB data transfer, only now it's a bigger transfer, and if there is a bit of waste, it's a bit bigger waste, but then again it's faster and there is more RAM too.

New Hub Scheme For Next Chip

Comments