New Hub Scheme For Next Chip

tonyp12 · 2014-05-13 15:56

> addresses of an array stored in, say
Spin should have a new var type 'Block' that always allign at a 16long boundry

jmg · 2014-05-13 16:13

Expanding this to two COGs communicating, lets choose COG0 & COG8 as 'Rotator' delays are identical.

Imagine COG0 does ,from t=T
WRLONG Da,PTRx++
WRLONG Db,PTRx++
WRLONG Dc,PTRx++
WRLONG Dd,PTRx++
WRLONG De,PTRx++

and 8 SysClks later, (t=T+8) COG8 starts, with the same Initial pointer value
RDLONG Da,PTRx++
RDLONG Db,PTRx++
RDLONG Dc,PTRx++
RDLONG Dd,PTRx++
RDLONG De,PTRx++

That transfers 5 longs, very quickly.
Questions are then
How fast can it do this, 'short burst' ?
How can it manage the exact 8 sysCLK phase between COGs

I think SETB and JNB $ could work, but JNB will be quantized (2 or 4 cycles?),however the RDLONG will auto-wait, provided it is 'roughly right' ie within one Rotator. it will catch the correct slot.

Opcode jitter here, would impost a limit on how 'close' these paired COGs can be ? Opposite-on-Ring looks ok.

mark · 2014-05-13 16:15

tonyp12 wrote: »

> addresses of an array stored in, say
Spin should have a new var type 'Block' that always allign at a 16long boundry

It's not a matter of not being able to reserve a block of memory spread across the hub ram slices, but say I have an assembly routine that relies on accessing an array in hub ram every 16 sys clocks - it would therefore be desirable to have said array in only once of the ram slices. Problem is, if it's the LSBs that signify the slice, computing the address would require at least an additional instruction than what would typically be necessary, if I'm not mistaking.

Is there a reason it can't be the upper nibble which indicates the slice?

cgracey · 2014-05-13 16:20

[QUOTE=mark

cgracey · 2014-05-13 16:24

jmg wrote: »

I can see the wait detail, but cannot follow the delay with your values.

Can you enter the Address (optimal address?) and Cycles, for each line of

RDLONG Da,PTRx++ ' this one may wait
RDLONG Db,PTRx++ ' ideally, these ones do not wait
RDLONG Dc,PTRx++
RDLONG Dd,PTRx++
RDLONG De,PTRx++

The first one would likely need to wait. The others would go without delays (2 clocks each). I edited your Dc's to be incrementing, instead, as I assume that's what you meant. If you did mean to read the same long twice in a row, it would have to wait for the window to loop.

jmg · 2014-05-13 16:25

[QUOTE=mark

RossH · 2014-05-13 16:27

Very interesting notion, Chip - it solves a lot of the problems very neatly, and with the ability to do block reads at this speed, I don't see much need for also having HubExec.

However, if I understand your proposal correctly, it does "bend" determinism a little - if you are doing individual random reads and writes, your time between each hub access is now dependent on the address you are accessing. However, it is always possible to re-establish synchronicity between a cog and the hub, just by a cog doing a hub read of its own cog id.

I guess we could live with that.

Ross.

tonyp12 · 2014-05-13 16:30

>address would require at least an additional instruction
Say you declared four 16long Blocks with Spin, with cognew you pass along the start address of first block.
You read PAR,and depending how the new readblock opcode wants the address you probably righshift it 6 bits then store it in a cog var called arraypnt etc
If your cog needs to read part2 of the array you use arraypnt+1 and get array[16-31long]

jmg · 2014-05-13 16:31

cgracey wrote: »

The first one would likely need to wait. The others would go without delays (2 clocks each). I edited your Dc's to be incrementing, instead, as I assume that's what you meant. If you did mean to read the same long twice in a row, it would have to wait for the window to loop.

Yes, that was a typo, but wouldn't it just reload the same Dc 3 times - how does the destination register affect the loop ?
I'm assuming PTR is Pointer into HUB, AutoINC, and Dx is COG-Register

I think the AutoINC size matters, if the opcodes can pack to 2 clocks
( is that sysCLKs or OpCode CLKS ?)

then the PTR needs to track the Rotator phase exactly, to avoid adding delays.

Too small ++, for the Opcode delay, and it needs a go-around, and too large, and it adds a few waits.

mark · 2014-05-13 16:33

cgracey wrote: »

It's a lot simpler than you are supposing, I think. The user may never know or care that when he does a block transfer, it fills in in some crazy order. All he knows is that it worked.

To read or write any byte/word/long/block, you just give it the address and register and it does it. No special knowledge is needed.

I think you misunderstood me (but to be fair, it's very possible that I'm wrong).

Take my example that I gave above, and say I have a small array in only one slice of hub ram (in one slice on purpose: because I need hard coded access to it every 16 clocks). Normally to address it, you would have an index address and compute an offset, which may very well affect the LSBs in practice. If I have to maintain the bottom nibble to ensure that I have access to the specific slice which holds by when I need it, I'll need to perform some additional computations on the address.

cgracey · 2014-05-13 16:34

RossH wrote: »

...with the ability to do block reads at this speed, I don't see much need for also having HubExec.

I know, it would be a relief to not worry about hub exec. The thing is, hub exec makes writing large programs very simple. It would be somewhat fatiguing to always be thinking about limited code blocks that could be cached and run, LMM style. Maybe it wouldn't be so bad, actually, as the performance would be likely a bit higher, so there would be some value in it. Hub exec just allows mindless code to be spewed out that doesn't have to be aware of much. Deep down, I'd be fine with the LMM approach, myself.

cgracey · 2014-05-13 16:35

[QUOTE=mark

Rayman · 2014-05-13 16:46

I don't understand any of this, so it must be pure genius...

But, from my sad point of view, this sounds a lot like starting over...
Don't get mad at me, but at this point, I wish we'd (meaning Chip) would just stick to the basic design and leave anything new for the 2016 version...

cgracey · 2014-05-13 16:49

Rayman wrote: »

I don't understand any of this, so it must be pure genius...

But, from my sad point of view, this sounds a lot like starting over...
Don't get mad at me, but at this point, I wish we'd (meaning Chip) would just stick to the basic design and leave anything new for the 2016 version...

I was about to begin implementing the hub and I was languishing, thinking about how we needed some better way. This is it. It's actually dirt simple to implement, too. It's just a different way to finish the thing.

NEWSFLASH:

Rayman, this involves you... I've decided to get rid of the PLL's in the CTR's, since they were only being used for video - and video, having another clock domain to deal with, was complicating everything. Video, a special case of fast DAC output, will now be tied to the system clock with a special NCO/divide-by-N circuit. The CTRs are gone, too! I actually talked with Phil Pilgrim about this and he agreed that smart I/O could completely compensate for lack of CTRs. What this means for you is that video will now be able to go out over the pins again, as straight bits, so you can drive LCD's.

Rayman · 2014-05-13 16:52

Well, I suppose it's also important for the designer to maintain enthusiasm

ctwardell · 2014-05-13 16:55

Chip,

Can you think through reading pins at a fixed rate and writing the data to the hub at high speed, say for a logic analyzer.
The old 'get sync'ed' to the hub method made this easy, having the variable hub write time seems to cause some issues that it looks like would limit the top speed significantly.

I'm a little concerned that while this helps LMM somewhat, it isn't so good for the more traditional deterministic uses of cogs.

I'm starting to wonder if some level of write caching might be beneficial...(I'm putting my asbestos undies on right after I hit 'save changes'...)

Chris Wardell

cgracey · 2014-05-13 16:55

Rayman wrote: »

Well, I suppose it's also important for the designer to maintain enthusiasm

This has gone on forever, it seems, and I want it done. It's shaping up for a deterministic finish (very low risk of it not working the next time we build it).

I added some stuff to my response to you above, in case you didn't notice.

Rayman · 2014-05-13 16:57

Sounds great to me!

Not just, LCDs BTW... You can do other things, like HDMI (well I guess maybe that is like LCD)...

cgracey wrote: »

Rayman, this involves you.................. What this means for you is that video will now be able to go out over the pins again, as straight bits, so you can drive LCD's.

Peter Jakacki · 2014-05-13 16:58

This is a bit like a model railway with a circular track and 16 stations (cogs) with 16 carriages labelled 0 to 15 joined in hoop snake fashion trundling around, there is always one car outside your station. Now if your long address (a5..a2) matched the car then you don't have to wait but this random access works just the same as the P1 method. The advantage is always in incrementing block mode but not random access. The block opcodes take advantage of the fact that there is always a car available, it will simply access that car with one of 16 longs it happens to have waiting at the station (read/write).

This would also mean a coginit would be very fast too but the scheme is optimized for block ops or synchronized access only but doesn't help any random access which is what happens most of the time. We still need something similar to the supercog which of course has nothing to do with the cog itself, but the hub interface.

cgracey · 2014-05-13 16:59

ctwardell wrote: »

Chip,

Can you think through reading pins at a fixed rate and writing the data to the hub at high speed, say for a logic analyzer.
The old 'get sync'ed' to the hub method made this easy, having the variable hub write time seems to cause some issues that it looks like would limit the top speed significantly.

I'm a little concerned that while this helps LMM somewhat, it isn't so good for the more traditional deterministic uses of cogs.

Chris Wardell

You may need to use two cogs to tag team on high-bandwidth I/O tasks.

Markaeric was thinking about how to handle this above, by using a static $xxxxG address, where G doesn't change. That would always have the same phase.

cgracey · 2014-05-13 17:04

Peter Jakacki wrote: »

This is a bit like a model railway with a circular track and 16 stations (cogs) with 16 carriages labelled 0 to 15 joined in hoop snake fashion trundling around, there is always one car outside your station. Now if your long address (a5..a2) matched the car then you don't have to wait but this random access works just the same as the P1 method. The advantage is always in incrementing block mode but not random access. The block opcodes take advantage of the fact that there is always a car available, it will simply access that car with one of 16 longs it happens to have waiting at the station (read/write).

This would also mean a coginit would be very fast too but the scheme is optimized for block ops or synchronized access only but doesn't help any random access which is what happens most of the time. We still need something similar to the supercog which of course has nothing to do with the cog itself, but the hub interface.

I like your analogy.

Increasing random access is, like you said, a whole different problem.

ctwardell · 2014-05-13 17:04

cgracey wrote: »

Markaeric was thinking about how to handle this above, by using a static $xxxxG address, where G doesn't change. That would always have the same phase.

Yeah, that approach chews up memory fast by rendering the corresponding portions of the other slices useless unless you have them doing similar things.

I'm not trying to be difficult, just thinking through caveats that will require different ways of getting things done.

C.W.

tonyp12 · 2014-05-13 17:10

>using a static $xxxxG address, where G doesn't change
If you are planning to only use RDlong for single Longs out of the array and for some reason it have to be accessed every 16cycle with no wait:
If you have a few arrays interleve them in hub that way when you declare them with Spin that way you don't waste to much ram
Or just have to come up with a few hundred single long vars to fill in the space, no worry we will find a way in software/compiler to do this the best way later.

cgracey · 2014-05-13 17:10

ctwardell wrote: »

Yeah, that approach chews up memory fast by rendering the corresponding portions of the other slices useless unless you have them doing similar things.

I'm not trying to be difficult, just thinking through caveats that will require different ways of getting things done.

C.W.

I hear what you're saying.

What I love about this new way is that you can deterministically transfer 16 longs (or a multiple of 16, for that matter) between hub and cog at one long per system clock. That is fantastic, to me. Maybe your determinism is right there, in those block transfers. It would take twice as long for you to handle that data in instructions as it would to transfer it. Think about tag-teaming with another cog.

RossH · 2014-05-13 17:12

cgracey wrote: »

It would be somewhat fatiguing to always be thinking about limited code blocks that could be cached and run, LMM style.

That's what compilers are for!

Ross.

mark · 2014-05-13 17:16

[quote=mark

jmg · 2014-05-13 17:24

cgracey wrote: »

The first one would likely need to wait. The others would go without delays (2 clocks each)..

There is also this case, where burst copy from HUB to/from PINS are needed for Generate/Capture apps
RDLONG OUTA,PTRx++ ' this one may wait
RDLONG OUTA,PTRx++ ' ideally, these ones do not wait
RDLONG OUTA,PTRx++ '

and also the DJNZ and REPS versions.

How fast can the burst-spacing be each of UnRolled(above), DJNZ, REPS Loop cases , in SysCLKs ?

What does the PTR++ need to change by, for peak transfer, in each case.

What changes for RDBYTE, RDWORD cases ?

ctwardell · 2014-05-13 17:25

cgracey wrote: »

I hear what you're saying.

What I love about this new way is that you can deterministically transfer 16 longs (or a multiple of 16, for that matter) between hub and cog at one long per system clock. That is fantastic, to me. Maybe your determinism is right there, in those block transfers. It would take twice as long for you to handle that data in instructions as it would to transfer it. Think about tag-teaming with another cog.

I thinking it might be worth exploring having the hub writes be cached and behave similar to the old type of hub access.

Here is my thinking:

With the P1 style hub access you could get in sync and do useful work between hub accesses as they would always take the fixed minimum time.
With the new method the hub writes will take a variable amount of time depending on the address.
The issue is you cannot get in sync when writing sequential or random addresses so you spend time with the write blocking instead of being able to do useful work, like in our example sample pins at a fixed rate.
A single level cached write could be setup so it can execute at a fixed period like the P1, it would accept the write and let the cog continue executing P1 style.
If it is setup up to allow execution once every 16 cycles it should be able to do the underlying write and be ready for new data in time to accept another value.
If it is called early it would block just like a P1 hub operation.

This would allow getting in sync and doing useful work instead of being blocked by variable wait periods based on the address.

C.W.

jmg · 2014-05-13 17:31

[QUOTE=mark

jmg · 2014-05-13 17:39

ctwardell wrote: »

I thinking it might be worth exploring having the hub writes be cached and behave similar to the old type of hub access..

So you mean an option switch - as in the Default / Smarter choice ?
Could be some latency fish-hooks here ?

New Hub Scheme For Next Chip

Comments