New Hub Scheme For Next Chip

Brian Fairchild · 2014-05-14 03:09

Roy Eltham wrote: »

Guys,
...but even with compiler generated code, memory access is rarely random access per long/element. It's x longs in order, then a random new address, So you'll get the x longs in order faster, and then have a slightly worse "worst case" wait on the random new address.

Where, for 90% of real world embedded processor cases, x=1 or 2.

ctwardell · 2014-05-14 03:11

It is a clever scheme, but I fear it is one of those cure worse than the disease situations.
The hub slot schemes were for the most part about latency, this scheme doesn't improve latency and makes the existing latency less determinate.
The scheme will also lead to the desire to use sparse data structures to allow hub sync in many cases which makes that nice big 512k HUB more like something between 32k to 512k in actual use.

This definitely should be explored, but I also think a quad access with at least hub slot pairing should also be evaluated and the two should be compared against common usage patterns and for gate usage/power consumption.

Chris Wardell

jmg · 2014-05-14 03:13

Roy Eltham wrote: »

It's really simple, straight forward, and fast. I think you guys are over thinking it, and making it more complex in your minds than it really is...

Some are more prone to hyperbole and chanting 'complex' than others....

Roy Eltham wrote: »

For hand written PASM, this scheme rocks so much. You get 4x R/W bandwidth to hub for in order buffer dumps/copies/etc. I'm really excited about the potential for some massive performance on some low level drivers and what not...

Agreed, there is great potential for speed gains for Data Buffer and FIFO movement too, where tight loops can keep phase with the Rotate Scanner.

Info on the opcode latencies and up to date Timing on inline, DJNZ and REPS loops will help tune this.

Look forward to FPGA Build info, and any OnSemi sim runs on this.

jmg · 2014-05-14 03:32

ctwardell wrote: »

It is a clever scheme, but I fear it is one of those cure worse than the disease situations.
The hub slot schemes were for the most part about latency, this scheme doesn't improve latency and makes the existing latency less determinate.

'HubExec' (SW or HW or mixed) benchmarks will be interesting to see, but Latency outcomes are not all bad.
It will be possible to get faster than 16x cycle-deterministic data sampling/transfer with some nibble-carry arithmetic additions to Pointer AutoINC/AutoDEC.

First-access operations on P1 were also less determinate, & once the loop was running, you knew what the delays were (always 16N).
On this new design, the data transfer delays are also predictable, once you know the Pointer rules (next address) used.

ctwardell wrote:

The scheme will also lead to the desire to use sparse data structures to allow hub sync in many cases which makes that nice big 512k HUB more like something between 32k to 512k in actual use.

Yes and no. You can improve Data latency to match Code-loop speeds, and avoid going too sparse, in some cases.

ctwardell wrote:

This definitely should be explored, but I also think a quad access with at least hub slot pairing should also be evaluated and the two should be compared against common usage patterns and for gate usage/power consumption.

That depends on Chip, but certainly Dual-Table based hub slot Allocate is easy to implement, with various ways of doing the Secondary Table decision.

With ReLoad there are some cases where it can boost determinism, and remove the 16x shackles.

Seairth · 2014-05-14 03:58

rogloh wrote: »

Reading through this I was initially intrigued by the ideas presented above but I think you just identified the main pitfall here. For non-OBEX code it can be dealt with internally by the coder, but for OBEX you'd definitely need some rather strict convention to limit the RAM ranges used by the COG drivers, otherwise you'd have a hard time carving up the RAM when you combine objects and also when you want to make use of burst block transfers with RDBLOCX/WRBLOCX. However having strict upper/lower conventions like this put in place will limit what the OBEX code will be able to do and how it can use its memory. I just feel that's a bit of a problem.

Right now (with P8X32A), all drivers and user code must carefully share a single 32K hub memory. Split 8 ways, that means about 4K per cog. Assuming, of course, that no single cog needs more than 4K. Now, with what I was suggesting, each cog would have 16K AND a shared 256K. You still have to be careful, regardless of the scheme. But I think that the 16K/256K approach would still provide quite a lot of working space compared to what we have with the P8X32A. And the benefits you get in return would go a long way to helping cogs play nice with each other for hub space.

RossH · 2014-05-14 04:16

ctwardell wrote: »

It is a clever scheme, but I fear it is one of those cure worse than the disease situations.
The hub slot schemes were for the most part about latency, this scheme doesn't improve latency and makes the existing latency less determinate.
The scheme will also lead to the desire to use sparse data structures to allow hub sync in many cases which makes that nice big 512k HUB more like something between 32k to 512k in actual use.

This definitely should be explored, but I also think a quad access with at least hub slot pairing should also be evaluated and the two should be compared against common usage patterns and for gate usage/power consumption.

Chris Wardell

Yes, that's my view as well. Some people can't see what it is we would be giving up in order to to gain this additonal hub bandwidth. I must say I find this quite difficult to understand, because we all know the Propeller will never win new market share based simply on speed, no matter how cleverly we maximize the hub bandwidth.

Even with this scheme, the Propeller will still be quite a slow chip compared to its competitors - its advantages lay in other areas, such as its determinism, its symmetry and its simplicity.

However, at least with this scheme you can regain determinism should you really need it (but at the cost of some additional code complexity and of course the loss of any speed advantage this scheme might have had).

Ross.

RossH · 2014-05-14 04:19

jmg wrote: »

Some are more prone to hyperbole and chanting 'complex' than others....

... and others can't seem to understand that additional complexity always incurs additional cost.

Brian Fairchild · 2014-05-14 04:19

@Ross,

with your compiler hat on, how do you see this new scheme panning out? Is Catalina going to be able to make use of the speed-up feature at all?

RossH · 2014-05-14 04:25

Roy Eltham wrote: »

It's really simple, straight forward, and fast. I think you guys are over thinking it, and making it more complex in your minds than it really is...

It's not that we don't understand it Roy, or fail to appreciate the speed advantage. It's that the need to resynchronize with the hub when you need determinism will complicate and slow down the code. We don't have to worry about that on the P1.

Of course, not everyone will need to do it on the P2 - there are lots of objects that won't suffer from a little bit of "jitter", but there are others that will - and now we have to understand this and compensate for it.

I still say overall it seems to be a positive - but it is not without its disadvantages.

Ross.

JRetSapDoog · 2014-05-14 04:26

Regarding random access reads or writes (or the first access of a sustained transfer, hereinafter subsumed under the term "random access") being somewhat indeterminate (as is the case for the first hub-op with the P1), I wonder if an instruction could set a bit (individually for each cog) to turn on/off determinism for any access that wasn't a contiguous access with the last access for that cog. That way, if one needed total determinism, one would set the bit(s) for the relevant cog(s) and random access would always take 16 hub sub-cycles (even if access could have been performed in as little as 1 sub-cycle). But if strict determinism was not needed for random accesses, then the bit could be cleared (not sure which setting, determinism or non-determinism, would be the default). With such a mechanism, deterministic access would always take 16 sub-cycles (X + Y = 16, where X = the current sub-cycle for the cog at hand and Y = 16 - X as "filler" sub-cycles). In contrast, non-deterministic random access would take, on average, 8 sub-cycles, which is the average of 1..16. That would make non-deterministic random access twice as fast as deterministic random access (or deterministic random access half the speed as that of the non-deterministic kind), the factor of 2 being the price for the determinism. The foregoing has nothing to do with sustained reads in the contiguous (or logically contiguous) memory space (whichever way it's implemented), as it relates just to random access (again, including the first read of a sustained contiguous burst transfer). However, determining whether the last access was contiguous or not might be the tricky part (not sure ho to check for that so as to prevent every access, including contiguous ones, from taking 16 sub-cycles).

RossH · 2014-05-14 04:30

Brian Fairchild wrote: »

@Ross,

with your compiler hat on, how do you see this new scheme panning out? Is Catalina going to be able to make use of the speed-up feature at all?

Of course! Catalina could easily use just about any of the schemes so far proposed. Cogs executing C code would not care about the resulting "jitter" this scheme introduces - we would have had this even with the much simpler SuperCog scheme. Nor (I think) would any other high level language care much - the downside of this scheme is all about the other cogs, usually executing code written in hand-crafted PASM - it would certainly complicate those cogs if they needed strict determinism.

Ross.

JRetSapDoog · 2014-05-14 05:02

As I said in my one-and-only follow-up post (I dared not push it any further) to a post with a similar hub access scheme in another thread, using Chip/Roy's scheme (which appears to differ from what I proposed in how memory is wired up, as I had relied on the compiler to make data (such as a picture or code) logically contiguous) it's kind of comforting to know that two or more cogs won't be able to access the same memory location (long) at the same time.

JRetSapDoog wrote: »

The memory, though broken up into banks of a sort, would appear flat to the user due to the way the hardware would automatically span the banks. And all cogs could access any part of memory, but cogs might generally would have to wait...for random access as opposed to sustained access...But if reading or writing a "stream" of data (logically contiguous but not physically so), no waiting would be needed (except for a likely initial wait to start the "ball" rolling), and all cogs could do so simultaneously. The scheme doesn't allow for multiple cogs writing the same memory locations at the same time, but that's probably a feature. <bold emphasis added>

Brian Fairchild · 2014-05-14 05:05

Thanks Ross.

Here goes...

Since it's all up in the air (again) why not just admit that the original P1 scheme doesn't scale well and start again?

16 cores (cores 1-16) each with 4k of 32-bit longs for registers, no hubexec, smart pins as described
1 additional (identical) core 0 with no IO and hubexec
256k of central memory accessed in a strict round robin core0, core1, core0, core2, core0, core3....
central cordic and big multiplier

that's it, no slot allocation, no tables, no complicated schemes.

I hear the cry "but the core only has 9-bit address fields" to which I reply "so what, that's the last processor, the proposed one bears no resemblance to it, and existing objects won't run without modification anyways."

Kerry S · 2014-05-14 05:10

How about a variant of this:

Each cog is assigned 1 - 32K block of memory as its primary memory. It has access that that block every other clock (0, 2, 4, 8, ...).

Cogs can reach into other cog's blocks on the odd clocks using the round robin approach (1, 3, 5, 7, ...).

Now each cog has an integral block of fast access memory and yet can exchange data with each other on a round robin format.

This would eliminate the data interleaving and simplify how data is arranged.

As long as your program/data fits into 32 K you run full speed.

evanh · 2014-05-14 05:28

jmg wrote: »

Chip has claimed RAM power can go down , and it should be easy to get FPGA indicators on MUX costs

My understanding is Chip will be implementing a 16x16 crosspoint switch. Or maybe 8x8 with two-way muxing for alternating accesses. I've never tried to work out how hungry crosspoint switches tend to be but it wouldn't surprise me if this one ends up worse than the quad wide databus it's replacing.

I'm keen to know the answer.

JRetSapDoog · 2014-05-14 05:42

It might be best to post speculative schemes that substantially differ from the well-defined current one under consideration (i.e., the subject of this thread) in another thread to kind of keep this thread "pure" or avoid "polluting" it too much. I would kind of hate to see this thread BLOW UP like T he New 16-Cog Propeller thread did with a gazillion different hub access schemes (25 pages?). I'd say either create a new thread for lots of different schemes or dedicate a separate thread for each one (rather than use this thread or the aforementioned 16-Cog thread). This scheme obviously needs a dedicated (relatively pure) thread because [1] it is so radically different from the "ordinary" hub slot sharing schemes, and, [2] because it is kind of a formal proposal from Parallax (even if still under consideration). Of course, you might have a variant of the scheme that is closely related, and that could be posted here for comparison/contrast. Anyway, post whatever, wherever you want (sometimes having no rules is best); this is just a suggestion.

dMajo · 2014-05-14 05:43

RossH wrote: »

Of course! Catalina could easily use just about any of the schemes so far proposed. Cogs executing C code would not care about the resulting "jitter" this scheme introduces - we would have had this even with the much simpler SuperCog scheme. Nor (I think) would any other high level language care much - the downside of this scheme is all about the other cogs, usually executing code written in hand-crafted PASM - it would certainly complicate those cogs if they needed strict determinism.

Ross.

Correct me if I am wrong.

Any cog sequentially or randomly accessing at addresses Adr*16+CogID(0..15) will be totally deterministic and will have 1/16 of hub bandwidth ... If all doing so nothing changes compared to traditional scheme (except hub isolates and reduce to 32K, so at least one cog should perform differently to allow data exchange).

Any cog sequentially or randomly accessing at addresses Adr*8+0..7 will also be totally deterministic with 1/8 of the hub bandwidth

To take advantage of the fast sequential bank/page/hubs reading perhaps a sort of token-ring communication between the hubs

Perhaps some inc/dec instructions/pointers can be provided that allows this kind of math

T Chap · 2014-05-14 05:59

The more I think about this scheme, the more I like it. How else could you get as many as 16 simultaneous reads or writes without a 16-port RAM? In terms of efficiency and throughput, it's hard to top! It's like 16 keyholes into 16 small rooms, instead of one keyhole into one large room.

-Phil

In a few (very unqualified) responses in some other threads, I made the suggestion that the current scheme of 1 limited access ram was a "flawed concept". This may not be the best term to describe my view, but the new scheme appears to resolve part of the "flawed" nature of 16 cogs needing access to a hub(now maybe it is more a virtual hub?). The only thing that lacks is the ability to access the other ram not attached to that cog as quickly or at precise times. That is why I suggested a method in hardware to "mult" the writes to other cogs simultaneously, which gives a much closer approximation to a 16 port multi access ram.

evanh · 2014-05-14 06:12

T Chap wrote: »

... The only thing that lacks is the ability to access the other ram not attached to that cog as quickly or at precise times. That is why I suggested a method in hardware to "mult" the writes to other cogs simultaneously, which gives a much closer approximation to a 16 port multi access ram.

All of HubRAM is still equally accessible in the new scheme. The big change is there is now multiple blocks, with 32 bit databuses, being equally spread and rotating, in lock-step, around all the Cogs rather than one 128 bit wide data path serving one Cog at a time. You could sort of say there is multiple Hubs now.

The downside is determinism has to be carefully crafted.

evanh · 2014-05-14 06:27

Oh, cool, the 16 long block read/write is impressive. Maybe too good, I'd be happy with half that.

tonyp12 · 2014-05-14 06:57

Determinism:
Getting 16longs will always take 16 cycles starting from now, check.
Getting single longs from random hub address will with the use of compiler simulation be deterministic, it's not like you gone have to count the cycles by hand in your head.
The simulator will run loops incase the hub addresses keeps changing inside it, so you can see if there are any bad spots and then you can play around with what order the hub vars are allocated.

We will see a pattern after a while, if the loop is 5 instructions the hub address should inc, if it's 4 we should use dec etc. seasoned P2 programmers would not click on the simulator button

Lawson · 2014-05-14 07:37

Brian Fairchild wrote: »

So that's only 99% of your average compiled code that'll suffer then.

I have to ask the question....WHY? As in, why do we need this? As in, why the extra complication?

To make this new scheme work there are already loads of new opcodes needed.

For me the fully random reads are more like 0.1% of my code. On all the tight driver code I've written so far, the read and write pattern has been FIXED at compile time. So even though the initial delay and throughput will be more work to predict before run-time, the actual code will always run at the same speed. When I was talking about "fully random" reads I was thinking of using a random number generator to calculate the address for WRLONG. AFIK this is a pretty rare scenario. Only interpreters are likely to approach this worst case.

Marty

cgracey · 2014-05-14 07:50

Tubular wrote: »

Chip do you forsee any big problems with getting this up and running on a DE0-Nano, so we can code some test cases? I'm not so worried about the timeline as major obstacles that might make it not practical.

As I understand it the M9K memories can be arranged as 256 elements * 32 bits, and there must be something like 70 of them on the 4CE22. Even if we just started with using 16 of them (~16kByte / 4 kLong hub) to keep it simple initially

I hope to get this coded up today. There's really not much to it. The DE0-Nano can support this at 32KB of hub RAM.

All the signals going to the RAMs are gated at the cogs and then OR'd together, kind of like the DIR and OUT signals. This is faster than mux'ing. The only place actual mux's are needed is in the return data paths from the RAMs to the cogs. This will be equivalent to each cog having a sixteen-to-one 32-bit mux - not a big deal.

Cluso99 · 2014-05-14 08:08

Hub slot timing is going to be a little harder to calculate, and is going to be address determined.

Byte access, incrementing, and starting at a 16 long boundary, will be (presuming 2 clock setup RD/WRBYTE) in clocks...
2-17, 16,16,16, 17,16,16,16, 17,16,16,16, 17,16,16,16, (11 more sets of 17,16,16,16), 17,16,16,16, (then)
1,16,16,16, (15 sets of 17,16,16,16), then repeats this line.
Of course the 1 will miss causing this to be 17.

Byte access decrementing gives a different sequence, as does word access, and as doeslong access.

So calculating determinism is not going to be straightforward.

tonyp12 · 2014-05-14 08:08

>This is faster than mux'ing
Do you get the feeling that the RAM will now be much faster as going down from 128bit down to 32bit etc, that 250mhz is now possible?

>So calculating determinism is not going to be straightforward.
Did you not see my post above, who is asking you to do the calculations in your head?, the compiler will have simulator/trace mode.

We will end up with a few perfect solutions, use 6 instructions and place B before A in hub etc,
If both A and B are 'random' (both auto dec/inc) insert a 'rdlong, temp,#0' to align you self up again if you want loop determinism. (a nibble waitcnt would be nice to have)

Peter Jakacki · 2014-05-14 08:12

How do the cached reads such as RDBYTEC work with this scheme? Would they read in more than one long at a time? I was thinking that for bytecode it would work fine if it always double buffered as most bytecode instructions have a bit to do before the next read. So a read on the second long would trigger a read on the following long perhaps.

Seairth · 2014-05-14 08:16

Peter Jakacki wrote: »

How do the cached reads such as RDBYTEC work with this scheme? Would they read in more than one long at a time? I was thinking that for bytecode it would work fine if it always double buffered as most bytecode instructions have a bit to do before the next read. So a read on the second long would trigger a read on the following long perhaps.

RDxxxxC instructions are gone, I believe.

Cluso99 · 2014-05-14 08:18

Peter Jakacki wrote: »

How do the cached reads such as RDBYTEC work with this scheme? Would they read in more than one long at a time? I was thinking that for bytecode it would work fine if it always double buffered as most bytecode instructions have a bit to do before the next read. So a read on the second long would trigger a read on the following long perhaps.

It may be simpler to just read a 16 longblock (16 clocks) and grab them from the cog by sw (ie don't implement RDxxxC)

Or maybe an instruction with offset counter to peel of bytes from a set of 16 cog longs ???

Peter Jakacki · 2014-05-14 08:23

Seairth wrote: »

RDxxxxC instructions are gone, I believe.

Is there anywhere where the instruction set is kept up-to-date? RDBYTEC solved the problem of extracting bytes efficiently but how do we do it now? Have we made progress?

I'm having trouble seeing how this scheme helps real-world sequential non-synchronized situations though. No trouble at all understanding how it works and how it's implemented etc. Don't believe that muttering "the compiler will take care of it" answers anything at all either.

mark · 2014-05-14 08:28

I didn't see the implication of (IIRC) ctwardell's suggestion that hub writes (except for block writes, obviously) shouldn't stall execution at first, but I think there's a VERY good case for it not to given the partitioning of hub ram.

Consider the example again of a tight PASM loop which incorporates a WRBYTE

WRBYTE
INST 1
....
INST 6
loop

As it stands with this new hub scheme, the timing cannot be deterministic if the hub addresses used in the WRBYTE instruction are not limited to one page of memory. However, if WRXXXX instructions don't stall execution, then it can be! There will be jitter on the actual data write in hub ram, but I think this is less of a concern. Now unless I'm making some mistake, you can make the WRXXXX queue 2 levels deep, allowing for a write to *ANY* hub address to be performed every 4 instructions. That 2-level queue, however, would require some logic to evaluate the addresses of the two WRx requests to make sure each request is committed to memory ASAP.

RDx is a bit more tricky, but we can still maintain loop timing determinism *IF* RDx doesn't stall execution AND we pipeline our loop so that we send a hub request for that data 16 clocks before we intend to use it.

New Hub Scheme For Next Chip

Comments