Fast hub RAM timing

TonyB_ · 2018-01-19 18:10

Why does RDFAST wait until the FIFO contains read data before completing? Couldn't any wait, if necessary, be added to RFBYTE/RFWORD/etc. instead?

It seems a waste of clock cycles to me. I'd like to do a RDFAST on a certain clock tick, then execute some other code, then do a RFBYTE on the next clock tick. The tick period would be long enough for the FIFO to contain data. In effect, the RDFAST is the start of an internal memory read request and the RFBYTE latches the read data.

Do we simply subtract 8 from the various timings in the Instructions v31 now there won't be 16 cogs in the finished chip? Wouldn't it be better for the timings to be changed to eight cogs?

cgracey · 2018-01-19 18:40

The data needs to be ready, without delay, after RDFAST. The streamer may be enabled in the next instruction, which starts pulling data via hidden RFLONGs or RFBYTEs.

The timings need to be updated. In some cases, the improvement is more than 8 clocks.

TonyB_ · 2018-01-19 18:52

How many clock cycles with eight cogs for the following?

RDBYTE/RDWORD/RDLONG
WRBYTE/WRWORD/WRLONG
RDFAST

Is the maximum now eight fewer?

cgracey · 2018-01-19 19:14

TonyB_ wrote: »

How many clock cycles with eight cogs for the following?

RDBYTE/RDWORD/RDLONG
WRBYTE/WRWORD/WRLONG
RDFAST

Is the maximum now eight fewer?

I need to run some tests and determine a formula to adjust all numbers by. At least 8 clocks will be saved on each worst-case clock count.

evanh · 2018-01-19 22:18

Oh, on that note of casting the die, for the first production model, what is the Streamer to DAC direct mapping? I'm guessing only half the DACs will be mapped.

cgracey · 2018-01-20 06:13

evanh wrote: »

Oh, on that note of casting the die, for the first production model, what is the Streamer to DAC direct mapping? I'm guessing only half the DACs will be mapped.

Each cog outputs 4 DAC channels. The streamer in each cog can drive new values into all 4 channels on each clock.

Each pin can select which cog's DAC channel it receives, in such a way that pin %xxxxCC can pick DAC channel %CC from any cog.

potatohead · 2018-01-20 08:42

Nvm

evanh · 2018-01-20 10:51

goes and reads docs ...

The background state of these four 8-bit channels can be established by SETDACS:

Oh, wow, I hadn't noticed that at all. I thought the only hand setting of a DAC was to setup a Smartpin.

And:

Each cog outputs four 8-bit DAC channels that can directly drive the DAC's of pins.
DAC0 can drive the DAC's of all pins numbered %XXXX00.
DAC1 can drive the DAC's of all pins numbered %XXXX01.
DAC2 can drive the DAC's of all pins numbered %XXXX10.
DAC3 can drive the DAC's of all pins numbered %XXXX11.

Okay, that's clear enough. Each channel of each Cog can output to 16 different DACs. And as you've said, all DACs can be reached from every Cog. I hadn't realised that that much routing still existed. That'll be 1/4 of first Prop2-Hot incarnation I presume?

cgracey · 2018-01-20 11:03

evanh wrote: »

cgracey wrote: »

Each pin can select which cog's DAC channel it receives, in such a way that pin %xxxxCC can pick DAC channel %CC from any cog.

Hurm? That reads like any Streamer can DMA to any Pin DAC.

I thought it was 4 dedicated Pins fixed mapped to each Streamer/Cog. In the 16 Cog/64 Pin version of the design that was exactly 4 dedicated Pins per Streamer. Have I got that wrong?

It's been programmable for a long time now. For a while, cog DAC channels were fixed to sets of 4 pins. Now, any pin can select which cog it gets its DAC channel data from.

What IS fixed is that:

Pins 0/4/8/12/...60 can select any cog's DAC channel 0
Pin 1/5/9/13/...61 can select any cog's DAC channel 1
Pins 2/6/10/14/...62 can select any cog's DAC channel 2
Pins 3/7/11/15/...63 can select any cog's DAC channel 3

evanh · 2018-01-20 11:13

Quick reply there Chip! I went and read the docs the moment I posted that and quickly understood your first reply. I think I edited the first post a minute later.

evanh · 2018-01-20 11:24

The routing resources have become significant I'd guess. The DACs and Smartpins are clearly quite hungry in this area, but one detail that'd be interesting to know is if the crosspoint switching to HubRAM spatially packs well compared to an equivalent bandwidth singular HubRAM block.

cgracey · 2018-01-20 11:34

TonyB_ had a really cool idea that he PM'd me about and I'm implementing:

Special RDFAST and WRFAST that take only TWO clocks (no waiting). You just have to make sure you elapse enough clocks before doing another hub memory operation, to ensure that the RDFAST/WRFAST is done.

Two-clock RDFAST and WRFAST, as TonyB_ pointed out, will enable timing determinancy, so that we can better approach FPGA-replacement apps, where timing must be exact.

This alternate behavior is triggered when D[31] is high on RDFAST/WRFAST. The block count is always in D[13:0] and D[31] is normally low.

We could actually have two-clock WRxxxx, as well, but we only have two '{#}D,{#}S' instruction slots. I suppose a WXLONG instruction would be most valuable. Maybe a WXBYTE, also. 'X' is for 'exit'. There's no room for three of them, though. I think long- and byte-writes would be most valuable. No, that actually won't work because the cog must be waiting, after all. RDFAST/WRFAST can be sped up, though.

evanh · 2018-01-20 11:53

Umm, that's pretty significant changes for this late on.

However, I'm all for removing instruction stalls ... and have suggested similar strategies a couple of years back. It came to a head when Cluso was trying to streamline his serial routines and wanted to pair up Cogs so he didn't have to pass any data via HubRAM. And the only reason he didn't like HubRAM was because of the hard to predict instruction stalls.

You may remember me trying to sell you on having a second FIFO per Cog. That was to remove stalls while providing both HubRAM reads and writes.

evanh · 2018-01-20 11:58

Another idea was to implement permanent HubRAM write buffering.

cgracey · 2018-01-20 12:11

I remember you bringing those ideas up.

RDxxxx and WRxxxx are just too sticky to free, but RDFAST/WRFAST can be sped up.

Here is the original code:

assign fast_done =   (rdfast || wrfast) && ~|fast_mx;

And here is the modified code to realize 2-clock RDFAST/WRFAST when D[31] is set:

assign fast_done =   (rdfast || wrfast) && ~|fast_mx ||		// normally 'done'
		     fast_cmd && dx[31];			// immediately 'done' if dx[31]

I'm compiling it now to test it.

cgracey · 2018-01-20 12:39

It seems to work just fine. Doing an 8-cog compile now.

evanh · 2018-01-20 13:45

Here's where I first openly pondered this idea - http://forums.parallax.com/discussion/comment/1374376/#Comment_1374376

EDIT: Hmm, that link isn't lining up correctly. Here's the quote:

It would good to have some way to pragmatically avoid the RDFAST instruction stalling when writing code for a soft device. Then one could use it to direct RFxxxx pre-fetches for a certain number of instructions later without having to run through blocks at a time.

EDIT2: It would seem I spelt programmatically way wrong. I suspect the spell-checker must have intervened.

evanh · 2018-01-20 13:48

All the ideas were to solve Cluso's frustration. Although he never saw it that way.

cgracey · 2018-01-20 17:25

Well, I'm glad it came up again. I just couldn't picture it the first time around.

Maybe because things are now more settled, I can think about it more clearly. When you brought it up, it just seemed all too messy.

TonyB_ · 2018-01-20 18:24

Chip, many thanks for implementing my idea, as suggested.

Although late in the day, it is such a simple enhancement, trivial in terms of logic, but the benefits could be massive. As well as 100% predictable timing, otherwise wasted wait states can now be used productively, as described in the top post.

The modified worst-case timings with only eight cogs would be useful to know, sometime. Anyway, the fog of uncertainty has lifted and it's clear blue skies from now on!

jmg · 2018-01-20 19:18

TonyB_ wrote: »

Although late in the day, it is such a simple enhancement, trivial in terms of logic, but the benefits could be massive. As well as 100% predictable timing, otherwise wasted wait states can now be used productively, as described in the top post.

Yes, sounds well worthwhile - a simple and 'safe' superset. - give the designer the control.

cgracey wrote: »

It seems to work just fine. Doing an 8-cog compile now.

If you get the timing wrong, during development) what happens ? (eg does read return a copy of previous read data ?)
ie how does a developer confirm/prove they have this correct ?

If you do have the timing exact, and flip the opcode, does it still work exactly the same ?

It would be good to have a table of the cycles-trade-offs here : code simplicity vs hand crafting, to allow informed decision making.

TonyB_ · 2018-01-20 21:05

Here are the old 16-cog timings from Instructions v31 :

	16-cog timing
	Cogex cycles	Hubex cycles

RDBYTE   9-24		9-44
RDWORD   9-24*		9-44*
RDLONG   9-24*		9-44*

WRBYTE   3-18		3-38
WRWORD   3-18*		3-38*
WRLONG   3-18*		3-38*
WMLONG   3-18*		3-38*

	* +1 if crosses hub long

	16-cog timing
	Cogex cycles

RDFAST  10-25 + WRFAST finish
WRFAST   3    + WRFAST finish
FBLOCK   2

RFBYTE 	 2
RFWORD   2
RFLONG   2
RFVAR    2
RFVARS   2

WFBYTE   2
WFWORD   2
WFLONG   2

What happens if WFxxxx follows a RFxxxx without a WRFAST and vice-versa?

evanh · 2018-01-21 00:14

Yes, I'm super glad it's done. It did bother me quite a bit that the FIFO wasn't more streamlined.

TonyB_ · 2018-01-21 00:49

Here's a slightly truncated version of my PM for posterity:

TonyB_ wrote:

Chip, some more thoughts in private about RDFAST. Even if the worst-case delay is nine fewer cycles with eight cogs, that's still a delay of 16 cycles or eight instructions.

My main interest in the P2 is how much it could replace complex programmable logic, especially FPGAs. To that end, I think it would be excellent to have the option of no wait states for fast hub reads that prevent the P2 doing something else, even if the address has just changed and the FIFO needs refilling.

Would it be possible for D[31] to indicate "slow RDFAST" or "fast RDFAST" instructions? If D[31]=0 the cog waits for the FIFO to contain valid data, whereas if D[31]=1 there is no wait and it is entirely up to the programmer to ensure there is enough time between the RDFAST and the next RFBYTE/RFWORD/etc. In both cases WRFAST must finish first.

I didn't mention WRFAST because I was concerned mainly with fast hub reads. It's interesting that a fast WRFAST takes two cycles (keeping the cycle count even, which might be important) compared to three for the slow version (both assuming previous WRFAST has finished). Therefore one instruction must come between fast WRFAST and WFxxxx, as I understand it.

TonyB_ · 2018-01-21 01:16

On a vaguely related topic, is there any chance at all that, with early zero detection perhaps, NOP could take one cycle, not two? This would allow finer control of timing.

msrobots · 2018-01-21 05:17

+1

a one clock NOP.

I can imagine for fine synchronization of loops, or even cog2cog a interesting idea.

Mike

cgracey · 2018-01-21 05:59

Instructions have to step through the pipeline and that takes two clocks. WAITX {#}D takes 2+D clocks, which gives precise timing.

TonyB_ · 2018-01-23 23:51

It's very quiet here, just like a Sunday. I thought the number of posts would go up again once most people were back at "work"!

cgracey wrote: »

TonyB_ wrote: »

How many clock cycles with eight cogs for the following?

RDBYTE/RDWORD/RDLONG
WRBYTE/WRWORD/WRLONG
RDFAST

Is the maximum now eight fewer?

I need to run some tests and determine a formula to adjust all numbers by. At least 8 clocks will be saved on each worst-case clock count.

It would be good to have the 8-cog timings for the different hub RAM reads and writes or formulae so that we can calculate them, while this matter is fresh in our minds. Also the answer to some other points, such as what happens:

(a) if RFxxxx occurs too soon after fast RDFAST?
(b) if WFxxxx follows RFxxxx without intervening WRFAST?
(c) if RFxxxx follows WFxxxx without intervening RDFAST?

I posted the old timings so they could be copied and replaced with the new ones here.

TonyB_ · 2018-01-23 23:53

Another query I have is about the block size in RDFAST/WRFAST/FBLOCK.

64 bytes is 16 longs or 16 cogs * one long per cog, which is what determined the minimum size of the hub FIFO? Now there are only 8 cogs is the FIFO smaller?

cgracey · 2018-01-24 02:51

TonyB_ wrote: »

Another query I have is about the block size in RDFAST/WRFAST/FBLOCK.

64 bytes is 16 longs or 16 cogs * one long per cog, which is what determined the minimum size of the hub FIFO? Now there are only 8 cogs is the FIFO smaller?

The FIFO is smaller, yes.

The block size could be smaller, too.

I keep the block size at 16 longs, though, so that we can have software compatibility across a family chips, with up to 16 cogs.

cgracey · 2018-01-24 06:53

TonyB_ wrote: »

It's very quiet here, just like a Sunday. I thought the number of posts would go up again once most people were back at "work"!

cgracey wrote: »

TonyB_ wrote: »

How many clock cycles with eight cogs for the following?

RDBYTE/RDWORD/RDLONG
WRBYTE/WRWORD/WRLONG
RDFAST

Is the maximum now eight fewer?

I need to run some tests and determine a formula to adjust all numbers by. At least 8 clocks will be saved on each worst-case clock count.

It would be good to have the 8-cog timings for the different hub RAM reads and writes or formulae so that we can calculate them, while this matter is fresh in our minds. Also the answer to some other points, such as what happens:

(a) if RFxxxx occurs too soon after fast RDFAST?
(b) if WFxxxx follows RFxxxx without intervening WRFAST?
(c) if RFxxxx follows WFxxxx without intervening RDFAST?

I posted the old timings so they could be copied and replaced with the new ones here.

a) It returns errant data. Ideally it should ignore all WFxxxx and RFxxxx operations if 'not ready'.
b and c) I need to look into this. Ideally, it should ignore ignore WFxxxx in RDFAST most and RFxxxx in WRFAST mode.

Thanks for bringing these things up. I've been really tied up with Treehouse and OnSemi the past few days. I intend to get into these other matters soon.

Fast hub RAM timing

Comments