FBLOCK feature possibly flawed

evanh · 2024-08-02 06:56

I've bumped into what looks like a hardware flaw in the FIFO mechanics. I setup FBLOCK for splitting off the 8 bytes of CRC at the end of each block read from SD cards and found it didn't behave well at all.

Checks for software bugs found nothing. So I setup a stand alone test to verify what I'm seeing - Confirming the odd behaviour. HubRAM writes are significantly more jarring than reads but both have what I consider as broken behaviour.

Here's the core proof routine:

        wrfast  #2, buf1    ' setup FIFO, 2 x 64 = 128 bytes (32 longwords)
        fblock  #0, buf2    ' new buffer after 32 longwords
        mov pa, #1

        rep @.rend, #40    ' longwords
        wflong  pa
        add pa, #1
.rend

The proof loops 40 times, incrementing PA as a counter and writing it to the FIFO. The first 32 writes get written to buf1, the remaining 8 writes get written to buf2. And that works as expected as long as buf1 and buf2 are a multiple of 64 bytes apart. Here's a print of those two buffers after a correct run:

  &buff1=258c  &buff2=2bcc
buff1:   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
buff2:  33  34  35  36  37  38  39  40   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0

But if the buffers are off the multiple of 64-byte difference, by say 4 bytes (1 longword), then we get this:

  &buff1=258c  &buff2=2bd0
buff1:   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
buff2:  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0

And for good measure, an off by 2 longwords example:

  &buff1=258c  &buff2=2bd4
buff1:   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
buff2:  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0

evanh · 2024-08-02 08:28

All code is written in C and built using Flexspin. I haven't seen any reason to suspect a bug with the compiler. The same routine is called for each case. Only the buf2 buffer address changes.

Wuerfel_21 · 2024-08-02 10:12

If you think about it, it makes sense that FBLOCK would require this block alignment, but that really just isn't documented and the behaviour is very strange.
My guess is that FBLOCK is intended for / was only tested with sets of buffers tightly packed to eachother, so the problem never came up.

evanh · 2024-08-02 10:33

Chip is quite clear in the docs that alignment only has to be on longword boundaries. I'm gonna need convincing that the observed behaviour is as intended. I seriously doubt it.

Actually, I'd like to think I've just got a software bug, but I don't think so.

TonyB_ · 2024-08-02 10:43

Is this beneficial sometimes?

Wuerfel_21 · 2024-08-02 10:49

That's about the alignment of a single wrapping buffer.
If you imagine the FIFO writing at full tilt, one long per cycle, it's easy to see why at least 32-alignment relative between the buffers is required: the first long of the new buffer has to be on the RAM slice following the last long of the old buffer, otherwise the data stream is interrupted. The FIFO really assumes the blocks are 64 bytes (as they would be on a 16-slice hub), so that's why it's actually 64.

evanh · 2024-08-02 10:51

@TonyB_ said:
Is this beneficial sometimes?

Definitely not! It's explicitly not doing as instructed, and this depends address delta. It's as useful as putting a relative branch as the last instruction of a REP loop.

Wuerfel_21 · 2024-08-02 10:55

The logic probably is like this:

// on every write cycle:
curr_ptr += 1;
if (curr_ptr&63 == wrap_ptr&63) {
  block_count -= 1;
  if (block_count == 0) {
    curr_ptr = wrap_ptr;
    block_count = wrap_blocks;
  }
}
// then do the actual write on curr_ptr

evanh · 2024-08-02 11:06

@Wuerfel_21 said:
... the first long of the new buffer has to be on the RAM slice following the last long of the old buffer, otherwise the data stream is interrupted. The FIFO really assumes the blocks are 64 bytes (as they would be on a 16-slice hub), so that's why it's actually 64.

Okay, that is convincing.

Okay, a workaround for me then is to allocate a larger CRC buffer so that it extends over all combinations of alignment. It was only 8 bytes, I guess I add 64 to that and make it's size 72 bytes.

Or I could stop the SD clock at the beginning of the CRC and bit-bash the last 16 bitsnibbles. Or start a fresh WRFAST + streamer + clock gen.

Or I could use a temporary buffer for both data block and CRC then later copy the data block to the requested hubRAM address.

Wuerfel_21 · 2024-08-02 11:57

@evanh said:
Okay, a workaround for me then is to allocate a larger CRC buffer so that it extends over all combinations of alignment. It was only 8 bytes, I guess I add 64 to that and make it's size 72 bytes.

That might not actually work still. The FAT layer will give you a pointer into client memory when it can (avoiding memcpy) and that pointer may not be long-aligned.

evanh · 2024-08-02 12:34

Longword aligned could be made a requirement of block level read/write. That's not unreasonable.

Rayman · 2024-08-02 20:06

Does it works if there's some nops between block writes?
Seems like it might....

Rayman · 2024-08-02 20:13

What else needs to be longword aligned? Didn't we used to need "long long" for something?
Or, maybe that got fixed at some point?

Rayman · 2024-08-02 20:15

@evanh You are pretty good at finding these undocumented "features"... Keep it up!

evanh · 2024-08-02 20:28

@Rayman said:
Does it works if there's some nops between block writes?

It'll be wired for worst case of sysclock/1. Keeps the state machine simple. That proof is already at sysclock/4 and shows no sign of variability based on time, just addressing. Not that adding NOPs would be helpful even if they did work because I'm wanting to use FBLOCK for seamless streamer ops.

Rayman · 2024-08-02 20:35

But, in a more general case where want to transfer two or more very large blocks where a few nops (or something else) in between wouldn't hurt, adding in nops would probably work, right?

evanh · 2024-08-02 20:37

No, makes no diff. The FIFO state machine will be fixed parameter. Ada made that clear to me.
PS: I just tested sysclock/6 (added one NOP in the REP loop) and got same results as opening post.

TonyB_ · 2024-08-02 23:34

@evanh said:

@TonyB_ said:
Is this beneficial sometimes?

Definitely not! It's explicitly not doing as instructed, and this depends address delta. It's as useful as putting a relative branch as the last instruction of a REP loop.

Evan, my point was you've discovered a way to write varying amounts of data to two buffers based on their address difference and is this potentially a useful feature? In your example using WFLONG the first buffer can store 17-32 longs and the second buffer the remainder. How about WFWORD and WFBYTE? (I've not had time to test this behaviour yet.)

evanh · 2024-08-03 00:49

@TonyB_ said:
... How about WFWORD and WFBYTE? (I've not had time to test this behaviour yet.)

It's the FIFO's ordered accesses of the hubRAM slices. Ada described it succinctly. It's tuned for handling worst case full speed longwords at sysclock/1. Using different instructions, or differently timed instructions, or using the streamer, don't make any difference to this mechanism.

It would be nice if it was more flexible but it isn't.

PS: I've PM'd Chip asking him to write up a full description for the docs. And I guess confirm all this.

evanh · 2024-08-03 01:08

Well, I've ditch using FBLOCK for now. And resetting the streamer+FIFO for the 16 nibbles of the CRC is less code than trying to bit-bash it. So I guess that's going to be the approach.

rogloh · 2024-08-03 03:24

@evanh said:
Well, I've ditch using FBLOCK for now. And resetting the streamer+FIFO for the 16 nibbles of the CRC is less code than trying to bit-bash it. So I guess that's going to be the approach.

I don't imagine resetting the streamer temporarily for redirecting the CRC is going to waste that many clocks. Performance should be fine. I know you're trying to keep the clock sustained for as much as possible which is laudable but probably in reality has diminishing returns after some point. We might be talking around a percent or so for this, right? For each sector transferred of 1024 data nibbles, (or P2 clocks at the fastest sysclock/1 transfer rates), a handful of extra P2 clocks to restart the streamer for the CRC portion of the transfer shouldn't be a significant proportion of this total. At sysclock/2 or slower it will be even less again.

evanh · 2024-08-03 03:57

You're poo-poo'ing my poo-poo'ing!

EDIT: Just the extra streamer handling has added 2% to overheads. Example run from history - https://forums.parallax.com/discussion/comment/1560351/#Comment_1560351

Rayman · 2024-08-03 13:40

Very surprising to me that buf1 gets messed up by the second block not being 64 bytes away...
Must be some interesting mechanics inside the fifo control...

TonyB_ · 2024-08-03 15:02

@Wuerfel_21 said:

@evanh said:
Okay, a workaround for me then is to allocate a larger CRC buffer so that it extends over all combinations of alignment. It was only 8 bytes, I guess I add 64 to that and make it's size 72 bytes.

That might not actually work still. The FAT layer will give you a pointer into client memory when it can (avoiding memcpy) and that pointer may not be long-aligned.

I've tested this behaviour and the buffers do not need to be long-aligned. FBLOCK address[1:0] are ignored and the FBLOCK data start at the byte within a long given by WRFAST address[1:0].

Wuerfel_21 · 2024-08-03 15:05

Are you sure the last 0..3 bytes of the first buffer are actually written in that case?

TonyB_ · 2024-08-03 16:00

@Wuerfel_21 said:
Are you sure the last 0..3 bytes of the first buffer are actually written in that case?

Yes, if unaligned the last 1..3 bytes of first buffer's data are written to first 1..3 bytes of second buffer.

Wuerfel_21 · 2024-08-03 19:08

I guess they can just be moved over manually then.

TonyB_ · 2024-08-03 21:00

@Rayman said:
Very surprising to me that buf1 gets messed up by the second block not being 64 bytes away...
Must be some interesting mechanics inside the fifo control...

WRFAST can write only one long, depending on the buffer address difference:

        wrfast  #1, buf1    'writes 1 long
        fblock  #0, buf2    'buf2 = buf1 + 64N + 4
...
        wrfast  #1, buf1    'writes 2 longs
        fblock  #0, buf2    'buf2 = buf1 + 64N + 8
...
        wrfast  #1, buf1    'writes 3 longs
        fblock  #0, buf2    'buf2 = buf1 + 64N + 12
...
        wrfast  #1, buf1    'writes 16 longs, i.e. complete block of 64 bytes
        fblock  #0, buf2    'buf2 = buf1 + 64N

'N = positive or negative integer
'buf1 [1:0] and buf2[1:0] are don't care
'and these bits are zeroed in equations above

evanh · 2024-08-04 11:12

@TonyB_ said:
I've tested this behaviour and the buffers do not need to be long-aligned. FBLOCK address[1:0] are ignored and the FBLOCK data start at the byte within a long given by WRFAST address[1:0].

Wow, yeah, totally does that. Contradicting Chip's write-up. Could be handy in the future.

I guess it's a feature of misaligned writes that's used for a plain WRLONG and the likes. It'll be wired in below the FIFO, at the eggbeater level I presume.

  size = 512   offset = 11   filler = 0xff   &buff1 = 0x2318   &buff2 = 0x2518
buff1:
   0   0 ff000000   0 1000001 2000001 3000001 4000001 5000001 6000001 7000001 8000001 9000001 a000001 b000001 c000001 d000001 e000001 f000001 10000001 11000001 12000001 13000001 14000001 15000001 16000001 17000001 18000001 19000001 1a000001 1b000001 1c000001 1d000001 1e000001   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
buff2:
   0   0 1f000001 20000001 21000001 22000001 23000001 24000001 25000001 26000001   1   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0

TonyB_ · 2024-08-04 11:52

If the two buffers overlap then data written by WRFAST can be overwritten by FBLOCK, with long granularity. EDIT: This is obvious but it allows unwanted data to be discarded.

Also, all of the following write the same data starting at buf1:

'T = Total longs written

        wrfast  #1, buf1    'writes 1 long
        fblock  #0, buf1+4  'writes T-1 longs
...
        wrfast  #1, buf1    'writes 2 longs
        fblock  #0, buf1+8  'writes T-2 longs
...
        wrfast  #1, buf1    'writes 3 longs
        fblock  #0, buf1+12 'writes T-3 longs
...
        wrfast  #1, buf1    'writes 16 longs, i.e. complete block of 64 bytes
        fblock  #0, buf1+64 'writes T-16 longs

Rayman · 2024-08-04 14:06

What if buf1 and buf2 are the same?

FBLOCK feature possibly flawed

Comments