HUB RAM interface question

TonyB_ · 2022-02-19 01:22

It seems to me that streaming longs at sysclk/2 with FIFO burst = 8 should leave half the hub RAM bandwidth available for fast moves. Every other egg beater rev for FIFO or fast move. We really need test results when streaming longs.

evanh · 2022-02-19 01:24

I used 16-bit at sysclock/1 for same effect.

It can have an off-by-one difference due to phase shift, that's all. Bandwidth is identical.

evanh · 2022-02-19 01:44

Here's an update for repeated tests with large random delay to phase shift the streamer bursts. Also no longer hand code the NCO divider.

xmod    long    DM_16bRF | $ffff    ' 1/2 forever
wlen    long    $1ffff          ' 128 kLW (512 kByte)

_main
        qfrac   #1, #12     ' calculate divider
        rdfast  #0, #0      ' burst length 6, so 12 * 6 = 72 ticks between bursts
        getqx   pa
        setq    pa      ' set NCO divider
        xinit   xmod, #0

        mov bcdlen, #8
        call    #itoh       ' print NCO divider
        call    #putnl
        mov bcdlen, #10

.loop
        getrnd  pa
        zerox   pa, #14
        waitx   pa      ' random pause to bring out difference from burst coincidence tally

        wrlong  pa, #1      ' align first GETCT for zero gap in start time
        getct   ticks
        setq    wlen
        wrlong  #0, #0
        getct   pa
        sub pa, ticks

        call    #itod
        jmp #.loop

rogloh · 2022-02-19 01:45

@evanh said:
I used 16-bit at sysclock/1 for same effect.

It can have an off-by-one difference due to phase shift, that's all. Bandwidth is identical.

It would be best to double check it to be sure there's not some weirdness going on...

evanh · 2022-02-19 01:50

I have. That's why I said it affects the phase. The random start delay just above does the same.

rogloh · 2022-02-19 01:51

Sysclk/9 is way worse for 16 bits vs 32 bits. The transfer size is also having an effect somehow.

evanh · 2022-02-19 01:55

@rogloh said:
Sysclk/9 is way worse for 16 bits vs 32 bits. The transfer size is also having an effect somehow.

32-bit is sysclock/4.5 Bandwidth is doubled. To compensate you'd want to /2 (>>1) on the NCO divider.

evanh · 2022-02-19 02:11

hmmm, there is differences ... I wonder if the implied RFBYTE vs RFWORD vs RFLONG do impact available hubRAM bandwidth ..

PS: Most streamer ops are going to be RFBYTE and WFBYTE.

evanh · 2022-02-19 03:59

Exactly the same binary run twice in a row. Bad cases aren't limited to sysclock/5 or /9. EDIT: Uh, block length is 64kLW here.

 BYTE   3   55555555    78648    78648    78648    78640    78640    78640    78640    78640    78640    78648    78640    78640
SHORT   3   55555555    98304    98304    98304    98304    98304    98304    98304    98304    98304    98304    98304    98304
 LONG   3   55555555    98304    98304    98304    98304    98304    98304    98304    98304    98304    98304    98304    98304

 BYTE   3   55555555    78640    78648    78648    78640    78648    78640    78640    78640    78648    78640    78648    78648
SHORT   3   55555555   589768   589784   589784   589784   589824   589800   589800   589784   589776   589800   589824   589776
 LONG   3   55555555    98304    98304    98312    98312    98304    98304    98304    98304    98304    98312    98304    98304

evanh · 2022-02-19 04:18

Looping test code now
Update: added random streamer (FIFO) start address to encourage more oddball cases
Update2: adjust start address for longword granularity
This one is buggy. Newest source code here - https://forums.parallax.com/discussion/comment/1535803/#Comment_1535803

rogloh · 2022-02-19 04:56

What's so special about 16 bits that makes it so much worse? Somehow varies from run to run as well. Something syncing up badly affecting things from then on. Maybe there is something weird about the test setup, but it seemed ok.

evanh · 2022-02-19 05:11

With more randomising it's just as bad for lordwords too. Only byte-wise seems immune.

It's notable that sysclock/5 is afflicted the same as /3 and /9 but the severity is milder.
EDIT: Ah, and I'm seeing it on sysclock/7 too. Just milder still.
EDIT2: Also for sysclock/6

rogloh · 2022-02-19 05:16

Hopefully nibbles and two bit data won't get affected badly, they are useful for streaming to QSPI and RMII devices.

evanh · 2022-02-19 05:18

@rogloh said:
Hopefully nibbles and two bit data won't get affected badly, they are useful for streaming to QSPI and RMII devices.

They should be fine since they all use the same implied RFBYTE and WFBYTE.

rogloh · 2022-02-19 05:23

@evanh said:
They should be fine since they all use the same implied RFBYTE and WFBYTE.

Famous last words... They should be but ARE they? I guess they can't be written to RAM until at least a full byte is already filled so I'll take your word for it (for now).

evanh · 2022-02-19 05:26

@rogloh said:
Somehow varies from run to run as well. Something syncing up badly affecting things from then on. Maybe there is something weird about the test setup, but it seemed ok.

Ya, dunno how but it's gotta be purely down to the FIFO's burst length. It ain't consistent.

evanh · 2022-02-19 05:28

@rogloh said:

@evanh said:
They should be fine since they all use the same implied RFBYTE and WFBYTE.

Famous last words... They should be but ARE they? I guess they can't be written to RAM until at least a full byte is already filled so I'll take your word for it (for now).

I do think it's time for Chip to carefully look at the Verilog. I think we're looking at fixes for future silicon.

rogloh · 2022-02-19 05:41

@evanh said:
I do think it's time for Chip to carefully look at the Verilog. I think we're looking at fixes for future silicon.

Hmm... Did I speak too soon about feeling we should be grateful for a good design there? Hopefully nothing bad in there, just some weird harmonic interaction with addresses killing hub throughput maybe, but it would be good to pinpoint an exact case that performs really poorly every time and run it by Chip to see why.

evanh · 2022-02-19 05:47

@rogloh said:
... but it would be good to pinpoint an exact case that performs really poorly every time and run it by Chip to see why.

I could log the randomising values used to see if there is common alignments. Might not be obvious at all ... No, there's no way that's enough info. The effect was happening anyway, the random address just brought it out some more.

rogloh · 2022-02-19 06:18

It would be good to know that data is reliable when this strange effect is happening. I wonder if a particular test pattern could be streamed out and read back in from the pins, in another COG - might be a PITA to setup at test for this though.

It would be bad if some extra bytes were being transferred due to something wrapping around when it shouldn't be. That might give the appearance of extra hub cycle transfers etc.

If this is a bug, there is a chance it could be corrupting the transfers.

evanh · 2022-02-19 06:25

Now you're hyperventilating.

rogloh · 2022-02-19 06:25

LOL. not really.

evanh · 2022-02-19 10:38

Writing up the situation for Chip and I noticed the fact that my inner loop doesn't restart the streamer. I saw it as a consistency feature at the time but now I think it'd be better if streamer was restarted for each test loop ...

And using the loop count as the FIFO start address creates repeatable changing cases across each line. And, no surprise, using random FIFO start address creates random cases across the line.

Revised program attached
Update: Reintroduced reporting of NCO divider

TonyB_ · 2022-02-19 11:16

@evanh said:
Exactly the same binary run twice in a row. Bad cases aren't limited to sysclock/5 or /9. EDIT: Uh, block length is 64kLW here.

 BYTE   3   55555555    78648    78648    78648    78640    78640    78640    78640    78640    78640    78648    78640    78640
SHORT   3   55555555    98304    98304    98304    98304    98304    98304    98304    98304    98304    98304    98304    98304
 LONG   3   55555555    98304    98304    98304    98304    98304    98304    98304    98304    98304    98304    98304    98304

 BYTE   3   55555555    78640    78648    78648    78640    78648    78640    78640    78640    78648    78640    78648    78648
SHORT   3   55555555   589768   589784   589784   589784   589824   589800   589800   589784   589776   589800   589824   589776
 LONG   3   55555555    98304    98304    98312    98312    98304    98304    98304    98304    98304    98312    98304    98304

Please can we keep to BYTE, WORD and LONG?

Are there some results I could see for various D for sysclk/D long streaming, sysclk/D word streaming & sysclk/D byte streaming?

evanh · 2022-02-19 11:19

https://forums.parallax.com/discussion/comment/1535662/#Comment_1535662

TonyB_ · 2022-02-19 11:32

@evanh said:
https://forums.parallax.com/discussion/comment/1535662/#Comment_1535662

Thanks. Please add names above columns so Chip and the rest of us know what's what and please please change SHORT to WORD. Is this fast move writes and streamer reads? Have you tested fast reads and streamer writes? Or possibly read,read and write,write.

Presumably LONG sysclk/2 time = eternity?

evanh · 2022-02-19 11:51

@TonyB_ said:
Presumably LONG sysclk/2 time = eternity?

yes.

Those are just the emitted reports as is. I've not added any frills. The description is in the prior post to that link.

TonyB_ · 2022-02-19 12:02

@evanh said:

@TonyB_ said:
Presumably LONG sysclk/2 time = eternity?

yes.

Those are just the emitted reports as is. I've not added any frills. The description is in the prior post to that link.

Fast block write and FIFO read, then. What is last column?

I'm wondering whether fast read and FIFO write would be the same. Also, are 12 tests really needed? Could be only eight, one for each slice difference.

evanh · 2022-02-19 12:19

Oh, the last column is new, they're tally's for that line of results >= 80000. Not useful below sysclock/6. It was mainly a way to quickly see something in the mass of larger dividers.

Twelve is just how many I had from earlier. It might show repeats, I haven't looked.

EDIT: Actually, it has highlighted sysclock/3 as an exception where there is variation in its BYTE line now.

TonyB_ · 2022-02-19 12:31

@evanh said:

@TonyB_ said:
Presumably LONG sysclk/2 time = eternity?

yes.

Does this mean no cog access at all to hub RAM when streaming longs at sysclk/2?

HUB RAM interface question

Comments