New Hub Scheme For Next Chip

ctwardell · 2014-05-13 17:43

jmg wrote: »

So you mean an option switch - as in the Default / Smarter choice ?
Could be some latency fish-hooks here ?

No, the hub would still work the new way, it is just that instead of the write instruction blocking until it has proper alignment it accepts the value to write and allows the cog to continue executing.
The cache would write the value to the hub when it is in proper alignment.

C.W.

mark · 2014-05-13 17:44

@ctwardell

If you're satisfied with just writing to the hub once every 16 clocks then it'll be no different than what it would have been other than the fact that you'll be limited to 32k. For your planned usage, how random do you expect the write addresses to be?

EDIT
RE: write cache
and how deep do you make that cache? 1 long? How would you sync to the hub (or more specifically now, a page)? RDXXXX?

jmg · 2014-05-13 17:48

cgracey wrote: »

.
NEWSFLASH:
I've decided to get rid of the PLL's in the CTR's, since they were only being used for video - and video, having another clock domain to deal with, was complicating everything.

ok, most I think figured those were going.

cgracey wrote: »

.
Video, a special case of fast DAC output, will now be tied to the system clock with a special NCO/divide-by-N circuit. The CTRs are gone, too! I actually talked with Phil Pilgrim about this and he agreed that smart I/O could completely compensate for lack of CTRs.

Interesting. The main issue I see with removing COG Counters, is backward code compatibility.
Can you fit the extra COG-Counter modes into the Smart Pin cells, and still meet MHz timing ?

cgracey wrote: »

.
What this means for you is that video will now be able to go out over the pins again, as straight bits, so you can drive LCD's.

That is an important feature. LCD interface is one area the Prop can do well in.

ctwardell · 2014-05-13 17:49

[QUOTE=mark

jmg · 2014-05-13 17:52

ctwardell wrote: »

No, the hub would still work the new way, it is just that instead of the write instruction blocking until it has proper alignment it accepts the value to write and allows the cog to continue executing.
The cache would write the value to the hub when it is in proper alignment.
.

but you can turn this on and off ? Which was what I meant.
In this write example, a second write, has to FIFO queue if the first one is not done yet ?
What about reads ?

jazzed · 2014-05-13 17:52

Chip,

Will we still be doing rdlong, etc... the same as before to get deterministic timing or do we have to worry about the hub address with that too?

I.E.

rdlong d, s
instruction
instruction
rdlong d, s
instruction
instruction
etc....

ctwardell · 2014-05-13 17:57

jmg wrote: »

but you can turn this on and off ? Which was what I meant.
In this write example, a second write, has to FIFO queue if the first one is not done yet ?
What about reads ?

Shouldn't need a FIFO because the write only executes once every 16 cycles, so it should always have had a chance to do the underlying write.

A similar approach could be done for reading if we assume sequential reads, but is more complicated, I'm still thinking it through.

More on cached read...

Something along the lines of:

When a hub read occurs, if the address is in the cache and isn't stale it uses that value and initiates an underlying read of the next sequential value.
If the value address isn't in the cache or the value is stale it initiates the read of the proper value AND the next sequential value to be cached.
cached values would become stale after some number of cycles, probably 16.

This likely needs work, just the general idea.

C.W.

mark · 2014-05-13 17:57

jmg wrote: »

'Sparse' I am still trying to get a handle on. via Chips replies.
Because the Rotate Selector spins also with an address nibble implies, to me that means there can be a 'sweet spot' for COGS to stay Phase locked with that, provided they match their Address with the Selector.

In some (most?) cases that will mean not adjacent address, but instead values related to the opcode-spacing.
eg if Chip means 2 SysCLKS for the unrolled-loop case, then a PTR that adds 2 every time, will stay Phase locked, but will address only every second memory - hence Sparse.
Likewise a 3 or 5 SysCLK loop, would be fastest with a +3 or +5 on pointer.

It gets a little more complicated on deciding when to leave the nibble-range aka nibble-carry, but that is predictable.

Ah, ok. I have a better understanding of what you're getting at. This is tricky, and I imagine it only works if you're serially working with/through data. Any randomness and forget it. At least wait periods will likely be less than 16 clocks.

jmg · 2014-05-13 18:03

[QUOTE=mark

cgracey · 2014-05-13 18:07

jazzed wrote: »

Chip,

Will we still be doing rdlong, etc... the same as before to get deterministic timing or do we have to worry about the hub address with that too?

I.E.

rdlong d, s
instruction
instruction
rdlong d, s
instruction
instruction
etc....

The hub address will change your timing. You could arrange your RDLONG/WRLONG sequence to take advantage of the hub order, though, for much improved performance in reading and writing records.

jmg · 2014-05-13 18:07

ctwardell wrote: »

This likely needs work, just the general idea.

I get what you are trying to do, but the Spinning wheel nature of this complicates things.
eg Even sequential reads by address will be spaced 17 clocks, if doing INC, and spaced 15 clocks if doing DEC.

That may be predictable enough for some applications, if they can work within the Next-Rotate aperture ?

If the code misses that Next-Rotate aperture, and still need deterministic operation, I think it is now
18+16*N (inc by 2), or
17+16*N (inc by 1), or
16+16*N (same-address eg HUB polling )
15+16*N (dec by 1 ) or
14+16*N (dec by 2 ) or
- & only if you have no idea of N, is this non-deterministic, but maybe that much time can use WAITxx ?

ctwardell · 2014-05-13 18:15

jmg wrote: »

I get what you are trying to do, but the Spinning wheel nature of this complicates reads.
eg Even sequential reads by address will be spaced 17 clocks, if doing INC, and 15 clocks if doing DEC.

That is why the actual hub read grabs two values when loading a dirty cache or the next sequential value on a clean cache hit.
So it is predictively grabbing the value on the next clock instead of waiting until next clock + 16.

I didn't consider descending sequential yet.

C.W.

mark · 2014-05-13 18:15

ctwardell wrote: »

If you actually took an entire 32k array like that you wouldn't be able to have more than 7 contiguously addressed longs in the hub for anything else.

C.W.

As I've more or less stated earlier, I'm not a big fan of the lower nibble being used to reference the page, but I'm not following what you mean in the quote above. Did you mean 15 contiguously addressed longs, by chance?

I guess it really depends on how big of an array you realistically expect a deterministic cog to work with. For practical reasons as you mentioned (I think) above, you wouldn't want to go anywhere near 32k. Is that such a limit? I guess it depends on the application.

ctwardell · 2014-05-13 18:16

[QUOTE=mark

jazzed · 2014-05-13 18:17

cgracey wrote: »

The hub address will change your timing. You could arrange your RDLONG/WRLONG sequence to take advantage of the hub order, though, for much improved performance in reading and writing records.

So, we need to be aware of the addresses in DAT or VAR? Before we had a LONG statement to make things LONG aligned in DAT. Will you be introducing something like PAGE to align blocks? What about VAR blocks?

jmg · 2014-05-13 18:29

[QUOTE=mark

RossH · 2014-05-13 18:38

jazzed wrote: »

So, we need to be aware of the addresses in DAT or VAR? Before we had a LONG statement to make things LONG aligned in DAT. Will you be introducing something like PAGE to align blocks? What about VAR blocks?

Why would you need to be aware of them at compile time, since you don't generally know what cog you are going to be accessing them from anyway?

potatohead · 2014-05-13 18:48

I think we will know the COG more often, due to the fixed DAC pins per COG.

BTW: Intriguing scheme. I'm catching up with interest.

Electrodude · 2014-05-13 18:50

RossH wrote: »

Why would you need to be aware of them at compile time, since you don't generally know what cog you are going to be accessing them from anyway?

The cog you're running from doesn't matter - once you're locked, you're locked, no matter which cog.

RossH · 2014-05-13 18:53

Electrodude wrote: »

The cog you're running from doesn't matter - once you're locked, you're locked, no matter which cog.

Exactly - so (similarly) the base address of an array in Hub RAM doesn't matter either.

RossH · 2014-05-13 18:54

potatohead wrote: »

I think we will know the COG more often, due to the fixed DAC pins per COG.

Yes, you're right. :frown:

potatohead · 2014-05-13 18:55

...just catching up. Yes, agreed, provided we get a BLOCK alignment operator.

jazzed · 2014-05-13 18:55

RossH wrote: »

Why would you need to be aware of them at compile time, since you don't generally know what cog you are going to be accessing them from anyway?

The determinism of access appears to depend on the HUB address. How did you interpret it?

jmg · 2014-05-13 18:56

jazzed wrote: »

So, we need to be aware of the addresses in DAT or VAR? Before we had a LONG statement to make things LONG aligned in DAT. Will you be introducing something like PAGE to align blocks? What about VAR blocks?

Address handling will be a factor in speed, as will even if you INC or DEC.
I see DEC by 1 being faster than INC by 1, when scanning over HUB memory.

Given how RDBLOC works, it will make sense to Align for those.

RossH · 2014-05-13 18:59

jazzed wrote: »

The determinism of access appears to depend on the HUB address. How did you interpret it?

Yes, that's true. But the actual address doesn't matter, except for the very first access.

jmg · 2014-05-13 19:00

jazzed wrote: »

The determinism of access appears to depend on the HUB address.

My take on this is #72, and I would say determinism of access appears to depend on the relative HUB address.
ie inc/dec/same give different cycle counts.

First access is more of a lottery, and depends on relative delays of Rotate Selector (Guessing that is reset at init), and code delays to that point.

jmg · 2014-05-13 19:01

RossH wrote: »

Yes, that's true. But the actual address doesn't matter, except for the very first access.

How the address changes matters, see #72

mark · 2014-05-13 19:01

jmg wrote: »

Correct, this is for burst(serial) cases, and can work also for burst COG-COG. Burst cases tend to have the most demanding bandwidth issues.

I think it just needs some arithmetic-smarts on the PTR, to keep it phase-locked, and if read and write use the same 'Pointer Rules', they stream correct data.

Ah, right. I thought chip already addressed this.

"I think it just needs some arithmetic-smarts on the PTR"
I agree. I retract my request for putting the page identifier in the upper nibble, as it makes random addressing of an array spread across multiple pages a PITA, but that still makes calculating address offsets for an array restricted to one page annoying. Would be nice if this "smart pointer" could also do something like rotate left automagically when it is used as a src for RD/WRx.. Or some other method which accomplishes the same thing.

RossH · 2014-05-13 19:05

jmg wrote: »

How the address changes matters, see #72

Exactly - not at all, except for the initial access.

cgracey · 2014-05-13 19:09

jazzed wrote: »

So, we need to be aware of the addresses in DAT or VAR? Before we had a LONG statement to make things LONG aligned in DAT. Will you be introducing something like PAGE to align blocks? What about VAR blocks?

You only need to be aware if you want to optimize timing. I will have to make some 16-long alignment directives, as you noted.

New Hub Scheme For Next Chip

Comments