HUB RAM interface question

TonyB_ · 2022-02-25 13:20

@evanh said:

@TonyB_ said:
if streamer running write a burst of 8 longs when needed
if streamer not running write long/word/byte ASAP.

That may not be so easy to determine since the streamer could be running but is not the one using the FIFO.

Point taken and previous post edited.

Evan, have you tested the streamer/FIFO reading longs with known slice differences of 0-7 between streamer and fast block addresses? Or have you only tested using unknown random differences?

We still don't know exactly when the excessively long block moves occur during FIFO reads but I'm convinced they are not random.

evanh · 2022-02-25 13:27

@TonyB_ said:
Evan, have you tested the streamer/FIFO reading longs with known slice differences of 0-7 between streamer and fast block addresses? Or have you only tested using unknown random differences?

We still don't know exactly when the excessively long block moves occur during FIFO reads from hub RAM, but I'm convinced they are not random.

I stopped using random ages back. With FIFO start address at multiples of longwords there is repeatable measurements, albeit wildly different for some addresses, for each test. You even said I didn't need 12 tests because 8 tests covers the combinations.

Attached is most recent verification of that: Five reruns demonstrating the consistency. Each line of tests has a decreasing hubRAM start address for the FIFO/streamer: 36, 32, 28, 24, 20, 16, 12, 8, 4, 0.

EDIT: Replaced zip with fresh runs

evanh · 2022-02-25 13:40

With FIFO writing hubRAM there is no variation from address offset except for index 17 using odd hubRAM addressing. ie: 37, 33, 29, 25, 21, 17, 13, 9, 5, 1.
The Attached burstwr1.txt is using odd addressing.

rogloh · 2022-02-25 13:42

I've written mailbox code that depends on HUB RAM updates in a given order. It's handy to know that certain longs get written or read before others when you do a burst write or burst read. If you randomize that order with FIFO accesses etc it would get more complex to know when the different longs will be written.

TonyB_ · 2022-02-25 13:50

@evanh said:
Attached is most recent verification of that: Five reruns demonstrating the consistency. Each line of tests has a decreasing hubRAM start address for the FIFO/streamer: 36, 32, 28, 24, 20, 16, 12, 8, 4, 0.

OK, so 36 and 4 start at same slice? If so, I was expecting 36 and 4 results to always match but they don't always.

evanh · 2022-02-25 13:54

@TonyB_ said:

@evanh said:
Attached is most recent verification of that: Five reruns demonstrating the consistency. Each line of tests has a decreasing hubRAM start address for the FIFO/streamer: 36, 32, 28, 24, 20, 16, 12, 8, 4, 0.

OK, so 36 and 4 start at same slice? If so, I was expecting 36 and 4 results to always match but they don't always.

/me goes looking ... hmm, yeah, I hadn't really looked for that. The repeat runs were consistent and that's as far as I went.

EDIT: Oops, those are so old they still have the buggy duplicate test results. Deleting ...
EDIT2: Okay, the zip file has been replaced with a new set of runs.

TonyB_ · 2022-02-25 14:13

@evanh said:
/me goes looking ... hmm, yeah, I hadn't really looked for that. The repeat tests were consistent and that's as far as I went.

In your burst-test.spin2 I'm looking at, FIFO start = 40 initially and block move start = 0 always, I think. FIFO start reduced by four each loop, 10 loops total. Therefore, FIFO and block move don't start at same slice and FIFO slices go backwards each loop.

EDIT:
New zip file results as expected, last two columns match first two on quick look.

TonyB_ · 2022-02-25 14:18

@TonyB_ said:
New zip file results as expected, last two columns match first two on quick look.

I just need to work out FIFO - fast block slice difference for each column ...

evanh · 2022-02-25 14:19

Okay, 40 down to 4 rather than 36 down to 0. Changing the block fill/copy address never made any diff. to the FIFO. At most it might add or subtract eight ticks of total block copy time depending on FIFO burst coincidence.

cgracey · 2022-02-25 19:25

@rogloh said:
I've written mailbox code that depends on HUB RAM updates in a given order. It's handy to know that certain longs get written or read before others when you do a burst write or burst read. If you randomize that order with FIFO accesses etc it would get more complex to know when the different longs will be written.

Yes, it could be messy to fill in randomly. I was thinking the same thing.

It would be best to manage the FIFO depth, if possible.

TonyB_ · 2022-02-25 19:34

@cgracey said:

@rogloh said:
I've written mailbox code that depends on HUB RAM updates in a given order. It's handy to know that certain longs get written or read before others when you do a burst write or burst read. If you randomize that order with FIFO accesses etc it would get more complex to know when the different longs will be written.

Yes, it could be messy to fill in randomly. I was thinking the same thing.

It would be best to manage the FIFO depth, if possible.

Chip, what is the determining factor for the FIFO depth? Is it streaming longs from hub RAM at sysclk * 1?

cgracey · 2022-02-25 20:37

@TonyB_ said:

@cgracey said:

@rogloh said:
I've written mailbox code that depends on HUB RAM updates in a given order. It's handy to know that certain longs get written or read before others when you do a burst write or burst read. If you randomize that order with FIFO accesses etc it would get more complex to know when the different longs will be written.

Yes, it could be messy to fill in randomly. I was thinking the same thing.

It would be best to manage the FIFO depth, if possible.

Chip, what is the determining factor for the FIFO depth? Is it streaming longs from hub RAM at sysclk * 1?

Yes, once it locks onto the 1-of-8 address of interest, it starts issuing read commands to the hub RAMs and 5 clocks later, starts pushing the read longs into the FIFO at the full clock rate until the FIFO level reaches 14. After this, it ceases issuing read commands and 5 more longs stream in over the next 5 clocks, due to registered logic delays, making a possible total of 19 longs storable in the FIFO. Any time the FIFO level dips below 14 stored longs, it reloads at the next opportunity, so that up to 19 longs are stored. The only reason that less than 19 longs would be stored is that longs are being simultaneously popped out, as well as pushed in.

When this was being developed, Brian Dennis discovered that hub-exec was not always working, because the FIFO-load rules and FIFO depth were not right. I never could figure out what was needed by contemplating it, so I made a Prop1 program that simulated random FIFO bursts and the rotating address mechanism. This taught me that we needed cogs+11 levels of FIFO storage, given our fixed set of register delays in the multiplexing scheme. After running the simulation for several seconds, this distal level would be hit and never exceeded. That's how I knew how deep to make the FIFO. And thanks to Brian, or I might have realized too late that there was a problem.

evanh · 2022-02-25 23:15

Chip,
Here's an example, index 33, where there is multiple cases of the average falling below the minimum of 6 longwords per burst.

  BYTE   33   07c1f07c    66888    66880    66888    66880    66888    66880    66888    66880    66888    66880    0
 SHORT   33   07c1f07c    68296    68296    68296    86472    86480    86488    86504    86504    68296    68296    0
  LONG   33   07c1f07c    80104    80088    80096    80064    80080    80096    80104    80104    80104    80088    0

C (number of cogs) = 8
D (effective divisor) = 66 (BYTE), 33 (SHORT), 16.5 (LONG)
Tu (Unimpeded Ticks) = 65536
Tm (Measured Ticks) = 66880, 68296, 80088, 86480
Te (Extra ticks) = Tm - Tu
S (Stalls) = Te / C

Lb (FIFO burst length in longwords) = (C / D) * Tm / Te  =>  Lb = (C / D) * Tm / (Tm - Tu)  =>  Lb = C / (D * (1 - Tu/Tm))
Lbb = 8 / (66 * (1 - 65536 / 66880)) = 6
Lbs = 8 / (33 * (1 - 65536 / 68296)) = 6
Lbl = 8 / (16.5 * (1 - 65536 / 80088)) = 2.67

Lbs = 8 / (33 * (1 - 65536 / 86480)) = 1

In every case, with streamer mode RFLONG, the measured average is 2.67 longwords per burst.
In some cases, with streamer mode RFWORD, the measured average is a flat 1 longwords per burst.

PS: Exact assigned streamer modes are ##DM_32bRF | DM_DIGI_IO | $ffff and ##DM_16bRF | DM_DIGI_IO | $ffff

CON
    DM_16bRF    = (%1011 << 28)         ' 16-bit RFWORD
    DM_32bRF    = (%1011 << 28)|(%0001 << 16)   ' 32-bit RFLONG
    DM_DIGI_IO  = (1 << 23)

TonyB_ · 2022-02-26 00:31

@cgracey said:

@TonyB_ said:
Chip, what is the determining factor for the FIFO depth? Is it streaming longs from hub RAM at sysclk * 1?

Yes, once it locks onto the 1-of-8 address of interest, it starts issuing read commands to the hub RAMs and 5 clocks later, starts pushing the read longs into the FIFO at the full clock rate until the FIFO level reaches 14. After this, it ceases issuing read commands and 5 more longs stream in over the next 5 clocks, due to registered logic delays, making a possible total of 19 longs storable in the FIFO. Any time the FIFO level dips below 14 stored longs, it reloads at the next opportunity, so that up to 19 longs are stored. The only reason that less than 19 longs would be stored is that longs are being simultaneously popped out, as well as pushed in.

Thanks for the info, Chip.

To recap, here are the two worst examples of FIFO bus hogging. The time T for a fast block write of N longs when FIFO is reading longs was measured eight times, once for each of the eight possible slice differences between block and FIFO start addresses.

sysclk/3: T = 1.5N (4 of 8), or T = 9N (4 of 8)
sysclk/9: T = 1.17N (3 of 8), or T = 9N (5 of 8)

I think the 9N times occur because the fast block write has to yield to the FIFO after writing one long, which happens during every egg beater revolution.

The two SETXFRQ values were $2aaa_aaaa and $0e38_e38e, respectively. These were not incremented as recommended in the doc but I think that makes no difference.

evanh · 2022-02-26 01:17

@TonyB_ said:
The two SETXFRQ values were $2aaa_aaaa and $0e38_e38e, respectively. These were not incremented as recommended in the doc but I think that makes no difference.

They show up as over 500_000 ticks measured.

  BYTE    6   2aaaaaaa    74896    74896    74904    74896    74896    74896    74896    74896    74896    74896    0
 SHORT    6   2aaaaaaa    98296    78640    78648    98304    78640    98304    98304    78640    98296    78640    0
  LONG    6   2aaaaaaa    98312   589776    98320   589808    98304   589824    98304   589776    98312   589776    0

  BYTE    9   1c71c71c    70784    70776    70776    70776    70784    70776    70784    70776    70784    70776    0
 SHORT    9   1c71c71c   589720   589792   589792   589792    76928    76928    76936   589648   589720   589792    0    7
  LONG    9   1c71c71c   118000   117960   117968   117976   117936   117968   117976   117992   118000   117960    0

  BYTE   18   0e38e38e    68064    68056    68056    68056    68064    68056    68064    68056    68064    68056    0
 SHORT   18   0e38e38e    70784    70776    70784    70776    70784    70776    70776    70776    70784    70776    0
  LONG   18   0e38e38e    76928    76936    76936   589680   589752   589824   589824   589536    76928    76936    0

Index 6 calculations for average burst lengths:

C (number of cogs) = 8
D (effective divisor) = 12 (BYTE), 6 (SHORT), 3 (LONG)
Tu (Unimpeded Ticks) = 65536
Tm (Measured Ticks) = 74896, 78640, 98320, 589776

Te (Extra ticks) = Tm - Tu
S (Stalls) = Te / C

Lb (FIFO burst length in longwords) = (C / D) * Tm / Te  =>  Lb = (C / D) * Tm / (Tm - Tu)  =>  Lb = C / (D * (1 - Tu/Tm))
Lbb = 8 / (12 * (1 - 65536 / 74896)) = 5.33
Lbs = 8 / ( 6 * (1 - 65536 / 78640)) = 8
Lbl = 8 / ( 3 * (1 - 65536 / 98320)) = 8

Lbs = 8 / ( 6 * (1 - 65536 / 98304)) = 4
Lbl = 8 / ( 3 * (1 - 65536 / 589776)) = 3

Yanomani · 2022-02-26 02:01

I may be in the wrong path, but, in fact, their effect on NCO rollover will appear way earlier:

 0E38 E38E
 1C71 C71C
 2AAA AAAA
 38E3 8E38
 471C 71C6
 5555 5554
 638E 38E2
 71C7 1C70
 7FFF FFFE
 8E38 E38C -> NCO rollover does occur at 10th summation

Versus:
 0E38 E38F
 1C71 C71E
 2AAA AAAD
 38E3 8E3C
 471C 71CB
 5555 555A
 638E 38E9
 71C7 1C78
 8000 0007 -> NCO rollover does occur at 9th summation

And...

 2AAA AAAA
 5555 5554
 7FFF FFFE
 AAAA AAA8 -> NCO rollover does occur at 4th summation

Versus:
 2AAA AAAB
 5555 5556
 8000 0001 -> NCO rollover does occur at 3rd summation

evanh · 2022-02-26 02:12

@Yanomani said:
I may be in the wrong path, but, in fact, their effect on NCO rollover will appear way earlier:

It keeps summing after $8000_0000. A rollover, being circular (modulo), is not a reset to zero.
So, $0E38_E38E paces the streamer fetches at every 9 ticks. And $2AAA_AAAA err, $2492_4924 is 3.5 effective so it alternates/dithers the streamer fetches between 3 and 4 ticks. $2AAA_AAAA is every 3 ticks.

Yes, there can be phase difference of one tick from start. And there can even be slight drift from ideal due to the 32-bit fraction not being exact. None of which is of concern here. All the NCO fractions selected are entirely arbitrary. I could have chosen to use index increments of 10ths instead of halves for example.

Yanomani · 2022-02-26 02:45

@evanh said:
It keeps summing after $8000_0000. A rollover, being circular (modulo), is not a reset to zero.

That part I sure understood: any eventual residues will be cause of slight "hicups" way further, as the cycle-yelding progresses.

But, what I've missed was the real point here; due some misunderstanding, I was convinced that keeping in sync with the Cog-to-Hub rotation relationship was of prime importance, in order to get the maximum throughput, without incurring in many, (perhaps) frequent "slip-events".

evanh · 2022-02-26 02:50

Cool, good to get more input too. You could say we're looking for anything that could be a hiccup. And we sure found more than expected.

I guess my attitude now is is there anything flawed about the findings? And if not then how the hell are we seeing averages of less than 6 longwords per burst?

evanh · 2022-02-26 03:13

Chip,
In the silicon docs under the HUB RAM INTERFACE section there is this paragraph:

The FIFO contains (cogs+11) stages. When in read mode, the FIFO loads continuously whenever less than (cogs+7) stages are filled, after which point, up to 5 more longs may stream in, potentially filling all (cogs+11) stages.

The two little words "up to" means what to you? Can the FIFO just ignore some of the trailing data coming from hubRAM? Or can it mean there isn't always five more? Or just not really have any importance?

Yanomani · 2022-02-26 03:43

"Champollion"-mode on:

"The FIFO contains (cogs+11) stages." => it contains 19, at the present incarnation;

"When in read mode, the FIFO loads continuously whenever less than (cogs+7) stages are filled," => whenever less than 15 stages; so, e.g., 14, or less;

"after which point, up to 5 more longs may stream in, potentially filling all (cogs+11) stages." => there comes the real doubt: if it contains 14, then "up to 5 more longs may stream in" will effectivelly make it contains 19, at the end, so, it'll don't ignore none of the possible longs comming in, BUT, if the consumption ratio drains any of the ones, yet present at the Fifo when the "less than (cogs+7)"-trip-point was initially triggered, it can end with less than 19, but more than 14.

I believe that "up to 5 more longs may stream in" is just covering the possibility of any END condition to hit, in the meantime, stopping the IN dataflow.

"Champollion"-mode off:

evanh · 2022-02-26 03:55

Can there be such an "end condition" that prevents a minimum of six? Chip hasn't indicated that is possible so far.

Yanomani · 2022-02-26 04:00

@evanh said:
Can there be such an "end condition" that prevents a minimum of six? Chip hasn't indicated that is possible so far.

e.g., "the number of NCO rollovers that the command will be active for."...

evanh · 2022-02-26 04:30

@Yanomani said:
e.g., "the number of NCO rollovers that the command will be active for."...

All tests are set to infinite. So, not that.

Yanomani · 2022-02-26 04:36

@evanh said:

All tests are set to infinite. So, not that.

Sure, I know; was just commenting about the "up to" meaning, in the docs.

evanh · 2022-02-26 05:00

If that's the only case for "up to", and since that doesn't apply here, then you are saying there is a flaw in the findings.

Yanomani · 2022-02-26 05:14

No, I'm not saying nothing bad about the findings, moreover because I believe they're right.

But I believe I also got some clue, about the "17"-dilemma:

within a long, bytes can be any from XYZ, or W, but words can just be "XY", or "ZW" (or are there any real chances for a "YZ" word???).

So, as Chip has indicated in a former post, bytes are "packed" into longs (with the possible exception of the very first ones (up to three), and, sure, the very last ones, but those don't make any meaningfull difference in the total number, or transfer time.

But, depending on the "index" used, the "packing"-action will burn clock cycles, thus it can affect the number of total rotations around the HUB, necessary to "fill" the longs where the transition occurs.

17 is just too close to (19 + 14) / 2, in order to not be taken is account. Whenever the opportunity of keeping in sync with the Hub rotation is lost, another round will be necessary, just to fullfil the byte-to-long translation, so the final yeld will be affected by the need to resync with the rotation.

As the 19 Fifo long-slots got poppulated, at least 5 need to be consumed, before another round can be started, so perhaps in those relationships, there lyes the reason the final yeld gets halved.

evanh · 2022-02-26 05:34

@Yanomani said:
No, I'm not saying nothing bad about the findings, moreover because I believe they're right.

Then the minimum of six is not being honoured in the silicon.

TonyB_ · 2022-02-26 11:51

@evanh said:
Chip,
In the silicon docs under the HUB RAM INTERFACE section there is this paragraph:

The FIFO contains (cogs+11) stages. When in read mode, the FIFO loads continuously whenever less than (cogs+7) stages are filled, after which point, up to 5 more longs may stream in, potentially filling all (cogs+11) stages.

The two little words "up to" means what to you? Can the FIFO just ignore some of the trailing data coming from hubRAM? Or can it mean there isn't always five more? Or just not really have any importance?

Chip explained how the FIFO refills yesterday. Ignore the doc and study this:
https://forums.parallax.com/discussion/comment/1536211/#Comment_1536211

evanh · 2022-02-26 12:41

There's nothing new in that Tony. Not really any different from the doc. It basically says that six is the minimum. But evidence suggests the hardware actually allows less.

HUB RAM interface question

Comments