P2-ES Ethernet Dev Board Add-on

Ramon · 2021-04-08 16:11

Thank you, will take a look later at that CRCBIT instruction. It guess that the instruction allows to define any polynomial, so it can be used.

I have updated the MDIO_Interactive program. I have created the MDIO_WRITE_REG function.
It's buggy, sorry. But sometimes works !!

Write (option 1) is done with a predefined value $7F, and only for PHY 1 and REG 0.
Option 2 (RESET). Is just a write of PHY 1 REG 0 with bit 15 enabled. (I works sometimes).

[2021/04/10]  Edit: I have found the bug. It was related to reg variable not being correctly updated.
 I changed reg to mdio_reg, as 'reg' is a reserved keyword in spin2.
(It worked before because the file extension was '.spin' instead of '.spin2')

Ramon · 2021-04-10 04:23

(I have fixed the MDIO Interactive v0.3 code and updated code)

Please, can someone check the following CRC code?

The correct result should contain the following bytes : $5B $00 $6C $87 (but I don't know in which order)

Edited: fixed code loop and other errors (but still wrong FCS, see a few posts later)

CON
  'poly    = $04C11DB7
  'poly    = $EDB88320
  'poly    = $04C11DAB
  poly    = $FB3EE254

  rx_pin   = 63
  tx_pin   = 62
  baud     = 921_600
  DOWNLOAD_BAUD = 921_600
  DEBUG_DELAY   = 100
  DEBUG_BAUD    = 921_600

VAR
  long  crc   

OBJ
  ser: "spin/SmartSerial"

PUB main() | data, i, mycrc

  ser.start(rx_pin, tx_pin, 0, baud)

  repeat i FROM 0 to 63
    data := byte[@eth_frame[i]]
    ser.printf("data: %x   ", data)

    mycrc := mycrc + crc32(data, poly)
    ser.printf("eth_frame[%d]: %x Polynomial: %x  CRC: %x \n", i, data, poly, mycrc)

 PUB crc32 (data, poly): crc     ' data byte in, crc word out

  org
       SHL         data,#32-8
       SETQ        data
       CRCNIB      crc,poly
       CRCNIB      crc,poly
  end

DAT

' Captured ethernet frame :
'
' ff ff ff ff ff ff 11 11 11 11 11 11 81 00 00 64
' 81 00 00 c8 08 06 00 01 08 00 06 04 00 01 11 11
' 11 11 11 11 c0 a8 02 01 00 00 00 00 00 00 01 01
' 01 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00
' 5B 00 6C 87  
'
' (last four bytes are the FCS/CRC)
'
eth_frame  byte  $ff, $ff, $ff, $ff, $ff, $ff, $11, $11, $11, $11, $11, $11, $81, $00, $00, $64
           byte  $81, $00, $00, $c8, $08, $06, $00, $01, $08, $00, $06, $04, $00, $01, $11, $11
           byte  $11, $11, $11, $11, $c0, $a8, $02, $01, $00, $00, $00, $00, $00, $00, $01, $01
           byte  $01, $01, $00, $00, $00, $00, $00, $00, $00, $00, $00, $00, $00, $00, $00, $00

FCS_long   long
FCS        byte  $5B, $00, $6C, $87

rogloh · 2021-04-10 07:17

Your CRC code was never going to work the way it was accumulating its CRC. Here's a version using CRCNIB that matches your Ethernet CRC. You may want to test it on more packets but if it worked here, it's probably correct.

CON
  poly    = $edb88320

OBJ
  ser: "SmartSerial"
  f  : "ers_fmt"

PUB main() | i, crc
  ser.start(115200)
  send:=@ser.tx

  crc:=-1 ' initial crc value is all ones
  repeat i from 0 to 63
    crc := crc32(byte[@eth_frame][i], poly, crc)

  send("computed ethernet CRC : ", f.hex(crc ^ -1), 13,10) ' need to xor with all ones for final value

  send("captured ethernet CRC : ", f.hex(long[@FCS]), 13,10)


PUB crc32 (data, polynomial, priorcrc) : newcrc     ' data byte in & crc, polynomial longs in,  crc long out
  newcrc := priorcrc  ' start with prior crc
  org
       rev data ' crcnib uses bits 31 to 28 of Q
       setq data ' setup data in Q
       crcnib newcrc, polynomial
       crcnib newcrc, polynomial
  end

DAT

' Captured ethernet frame :
'
' ff ff ff ff ff ff 11 11 11 11 11 11 81 00 00 64
' 81 00 00 c8 08 06 00 01 08 00 06 04 00 01 11 11
' 11 11 11 11 c0 a8 02 01 00 00 00 00 00 00 01 01
' 01 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00
' 5B 00 6C 87  
'
' (last four bytes are the FCS/CRC)
'
eth_frame  byte  $ff, $ff, $ff, $ff, $ff, $ff, $11, $11, $11, $11, $11, $11, $81, $00, $00, $64
           byte  $81, $00, $00, $c8, $08, $06, $00, $01, $08, $00, $06, $04, $00, $01, $11, $11
           byte  $11, $11, $11, $11, $c0, $a8, $02, $01, $00, $00, $00, $00, $00, $00, $01, $01
           byte  $01, $01, $00, $00, $00, $00, $00, $00, $00, $00, $00, $00, $00, $00, $00, $00

FCS        byte  $5B, $00, $6C, $87

rogloh · 2021-04-10 07:46

This shows that the Ethernet CRC can be computed at a rate of 20 clocks for 8 nibbles (when working on 32 bits at a time). This is 2.5 clocks per nibble. At 25M nibbles / sec transmit or receive timing, we can probably do CRC on the fly if the clock speed is sufficient.

For transmit, while the packet is streaming, it also could get read into LUTRAM by the transmitting COG and a CRC could then be computed in parallel and in time before the final byte is sent. You can read 1 long per clock into LUT, then 3 clocks for a RDLUT data, ptr++ so 4 clocks for 8 nibbles, or 0.5 clocks per nibble. This plus the 2.5 clocks per nibble for CRC means you could operate CRC on the fly if the clock rate is at least 75MHz. You'd want to go higher for overheads, say 100MHz.

On receive there is more do to but if the streamer is working you have some time to do things. However if you have to byte bang (or more precisely nibble bang), you'd need 2 clocks per nibble for that (rolnib data, ina/inb, #xx), plus you'd want to accumulate your data into a long to write into the LUTRAM, then 1 clock for 32 bits out of LUTRAM to HUB. Trick is going to be delimiting packets correctly and handling the various end cases. It's likely going to need more than P2 3-4x clocks per nibble for this, especially if it is two bit wide RMII at 50MHz instead of 4 bit wide MII or RGMII at 25MHz. Hopefully something like 150-200MHz would be enough for 25MHz x 4 bit Ethernet. For 2 bit RMII you may also like the RCZL and RCZR instructions if the RMII buses are nibble aligned, or the SPLIT/MERGE stuff that evanh used with Dual SPI and 2 smart pins in synch receiver mode. RX is going to be a lot trickier to get correct vs TX. The smart pin sync mode is nice because it probably lets you operate using a range of P2 frequencies, though we probably want the P2 clock to be a multiple of 25 or 50MHz anyway for the TX side if the streamer is used.

Ramon · 2021-04-10 10:19

Rogloh, Thank you so much!

I would never find out how to calculate the CRC with CRCNIB instructions unless you haven't posted the code.
So many things can go wrong (wrong initialized value, wrong polynomial, wrong direction SHL or REV?, wrong nibble order, not doing final XOR, etc ...)

I have tested with several frames, and it's PERFECT !! Thanks a lot !

' From Rogloh Ethernet CRC example (2021/04/10 15:17)
'
'    https://forums.parallax.com/discussion/comment/1521854/#Comment_1521854
'
CON

  rx_pin   = 63
  tx_pin   = 62
  baud     = 115_200
  DOWNLOAD_BAUD = 115_200
  DEBUG_DELAY   = 100
  DEBUG_BAUD    = 115_200

OBJ
  ser: "spin/SmartSerial"

PUB main() | i, mycrc, mypoly

  ser.start(rx_pin, tx_pin, 0, baud)

  ser.printf("\nRogloh Ethernet CRC function: \n\n")

  mypoly := $edb88320
  mycrc:=-1
  repeat i FROM 0 to 63
    mycrc  := crc32(byte[@eth_frame1][i], mypoly, mycrc)
  ser.printf("FCS = %x (Expected) Polynomial: %x  CRC: %x CRC_xor: %x (Calculated)\n", long[@FCS1], mypoly, mycrc, mycrc ^ -1)

  mypoly := $edb88320
  mycrc:=-1
  repeat i FROM 0 to 73
    mycrc  := crc32(byte[@eth_frame2][i], mypoly, mycrc)
  ser.printf("FCS = %x (Expected) Polynomial: %x  CRC: %x CRC_xor: %x (Calculated)\n", long[@FCS2], mypoly, mycrc, mycrc ^ -1)

  mypoly := $edb88320
  mycrc:=-1
  repeat i FROM 0 to 59
    mycrc  := crc32(byte[@eth_frame3][i], mypoly, mycrc)
  ser.printf("FCS = %x (Expected) Polynomial: %x  CRC: %x CRC_xor: %x (Calculated)\n", long[@FCS3], mypoly, mycrc, mycrc ^ -1)

  mypoly := $edb88320
  mycrc:=-1
  repeat i FROM 0 to 59
    mycrc  := crc32(byte[@eth_frame4][i], mypoly, mycrc)
  ser.printf("FCS = %x (Expected) Polynomial: %x  CRC: %x CRC_xor: %x (Calculated)\n", long[@FCS4], mypoly, mycrc, mycrc ^ -1)

  repeat
    org
      nop
    end

PUB crc32 (data, polynomial, priorcrc) : newcrc     ' data byte in & crc, polynomial longs in,  crc long out
  newcrc := priorcrc  ' start with prior crc
  org
       rev data ' crcnib uses bits 31 to 28 of Q
       setq data ' setup data in Q
       crcnib newcrc, polynomial
       crcnib newcrc, polynomial
  end

DAT

' Captured ethernet frame :
'
' ff ff ff ff ff ff 11 11 11 11 11 11 81 00 00 64
' 81 00 00 c8 08 06 00 01 08 00 06 04 00 01 11 11
' 11 11 11 11 c0 a8 02 01 00 00 00 00 00 00 01 01
' 01 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00
' 5B 00 6C 87  
'
' (last four bytes are the FCS/CRC)
'

eth_frame1 byte  $ff, $ff, $ff, $ff, $ff, $ff, $11, $11, $11, $11, $11, $11, $81, $00, $00, $64
           byte  $81, $00, $00, $c8, $08, $06, $00, $01, $08, $00, $06, $04, $00, $01, $11, $11
           byte  $11, $11, $11, $11, $c0, $a8, $02, $01, $00, $00, $00, $00, $00, $00, $01, $01
           byte  $01, $01, $00, $00, $00, $00, $00, $00, $00, $00, $00, $00, $00, $00, $00, $00

FCS1_long  long
FCS1       byte  $5B, $00, $6C, $87

' http://wiki.hevs.ch/uit/index.php5/Standards/Ethernet
'
eth_frame2 byte  $F0, $4D, $A2, $33, $B7, $EF, $00, $25, $64, $C2, $6F, $6A, $08, $00, $45, $00
           byte  $00, $3C, $31, $98, $00, $00, $80, $01, $CA, $DA, $99, $6D, $05, $E0, $99, $6D
           byte  $05, $94, $08, $00, $D5, $5B, $04, $00, $74, $00, $61, $62, $63, $64, $65, $66
           byte  $67, $68, $69, $6A, $6B, $6C, $6D, $6E, $6F, $70, $71, $72, $73, $74, $75, $76
           byte  $77, $61, $62, $63, $64, $65, $66, $67, $68, $69

FCS2_long  long
FCS2       byte  $63, $A7, $EA, $82

' https://web.archive.org/web/20081023184031/http://www.fpga-faq.com/archives/91050.html#91062
'
eth_frame3 byte  $0d, $0d, $0d, $0d, $0d, $0d, $0c, $0c, $0c, $0c, $0c, $0c, $88, $08, $00, $01
           byte  $77, $77, $77, $77, $77, $77, $77, $77, $77, $77, $77, $77, $77, $77, $77, $77
           byte  $77, $77, $77, $77, $77, $77, $77, $77, $77, $77, $77, $77, $77, $77, $77, $77
           byte  $77, $77, $77, $77, $77, $77, $77, $77, $77, $77, $77, $77

FCS3_long  long
FCS3       byte  $e5, $5c, $0b, $f8

' http://stackoverflow.com/questions/9286631/ethernet-crc32-calculation-software-vs-algorithmic-result
'
eth_frame4 byte  $ff, $ff, $ff, $ff, $ff, $ff, $00, $00, $00, $00, $00, $00, $08, $00, $45, $00
           byte  $00, $1C, $00, $00, $00, $00, $FF, $01, $62, $F9, $AC, $64, $00, $01, $AC, $64
           byte  $00, $1E, $08, $00, $F7, $BC, $00, $42, $00, $01, $00, $00, $00, $00, $00, $00
           byte  $00, $00, $00, $00, $00, $00, $00, $00, $00, $00, $00, $00

FCS4_long  long
FCS4       byte  $9E, $89, $24, $C6

rogloh · 2021-04-10 11:10

No problem. I want to see Ethernet on P2 as much as you do.

Simonius · 2021-04-10 22:18

I would recommend using MII at 25Mhz, 4 bits at a time, both clocks are generated in the PHY ( 2 pins Management + 2 x 6 pins for TX/RX which fits on a double extension board)
on a positive clock edge the data should be valid.
when transmitting, sync to the negative edge to make sure data is in place on posedge
so can the smartpins be used?
the documentation is not clear here: can a smart pin output data using an external clock ?
besides, a pin that determines the packet start is also sampled with the clock, another smartpin used? let it count the pulses since RX_DV (packet start) went up and know exact ethernet frame start
would be nice if a) PLL is not used for ethernet b) only 2-3 cogs used for full duplex ethernet c) calculate CRC on the fly for realtime ethernet
at 25 Mhz could a simple wait-for-pin / interrupt scheme be used to drive the signals?
for crc a lookup table could be used to process 8 bit at once

rogloh · 2021-04-11 00:25

@Simonius said:
I would recommend using MII at 25Mhz, 4 bits at a time, both clocks are generated in the PHY ( 2 pins Management + 2 x 6 pins for TX/RX which fits on a double extension board)
on a positive clock edge the data should be valid.
when transmitting, sync to the negative edge to make sure data is in place on posedge
so can the smartpins be used?
the documentation is not clear here: can a smart pin output data using an external clock ?

It should be possible based on this information:

%11100 = synchronous serial transmit

This mode overrides OUT to control the pin output state.

Words of 1 to 32 bits are shifted out on the pin, LSB first, with each new bit being output two internal clock cycles after registering a positive edge on the B input. For negative-edge clocking, the B input may be inverted by setting B[3] in WRPIN's D value.

besides, a pin that determines the packet start is also sampled with the clock, another smartpin used? let it count the pulses since RX_DV (packet start) went up and know exact ethernet frame start
would be nice if a) PLL is not used for ethernet b) only 2-3 cogs used for full duplex ethernet c) calculate CRC on the fly for realtime ethernet

Yes it would be nicer if their clocks could be fully decoupled. For MII it might be feasible. Possibly not for RMII. At least two COGs are a given for full duplex Ethernet. You might be able to hold off the TX transfers if it's COG is needed for some MAC functions. Hopefully a 3rd COG, if needed, is just the client COG itself and not an additional one.

at 25 Mhz could a simple wait-for-pin / interrupt scheme be used to drive the signals?

I would hope so.

for crc a lookup table could be used to process 8 bit at once

It could but I'm not sure in this particular case if a table approach is faster than the 4 instructions per byte with the P2 HW approach (REV+SETQ+CRCNIB+CRCNIB). It might be 5 - 6 from memory since I last did this. The RDLUT takes 3 clocks too.

uint32_t crc32_1byte(const void* data, size_t length, uint32_t previousCrc32 = 0)
{
   uint32_t crc = ~previousCrc32;
   unsigned char* current = (unsigned char*) data;
   while (length--)
     crc = (crc >> 8) ^ Crc32Lookup[(crc & 0xFF) ^ *current++];
   return ~crc;
}

Yeah it's inner loop in PASM becomes 11 clocks vs 8, so a table approach isn't worth it in this case.

getbyte xx, crc, #0
xor     xx, newdata
rdlut   xx, xx
shr     crc, #8
xor     crc, xx

Simonius · 2021-04-14 01:53

The data must be collected in 32 bit words as early as possible... using 4 smart pins @ 25 MHz, MII // or 2 smart pins @ 50 MHz, RMII
i suspect that using 2 pins is more efficient than dealing with 4 pins
the overhead of starting/stopping the 2 vs 4 smart pins and gathering/interleaving their data differs because in one case we can clock in 16 bits (x2) in a row and in the other only 8 bits (x4)
After it's inside an 32 bit register we are dealing with mere 3,125,000 words/second... suddenly it sounds quite doable
doing a 32 bit crc should be REV+SET?+8 times CRCNIB ... 10 instructions
i think about how we can employ the PIN COMPARE interrupt to start the smart pins once the "start frame delimiter" is received (i.e. the first "11" bits on the line that come together with the RX_CLOCK high, which comes right after the 010101...preamble)
once the header of the packet is received, the packet length is known and the smart pins can be stopped at the exact frame end
after running through the CRC the data could be sent off into the HUB memory or be kept inside the cog to process it there while switching to another cog to receive the next frame

rogloh · 2021-04-14 02:30

@Simonius said:
The data must be collected in 32 bit words as early as possible... using 4 smart pins @ 25 MHz, MII // or 2 smart pins @ 50 MHz, RMII
i suspect that using 2 pins is more efficient than dealing with 4 pins

It might be, just needs to be coded to see which is easier.

the overhead of starting/stopping the 2 vs 4 smart pins and gathering/interleaving their data differs because in one case we can clock in 16 bits (x2) in a row and in the other only 8 bits (x4)
After it's inside an 32 bit register we are dealing with mere 3,125,000 words/second... suddenly it sounds quite doable
doing a 32 bit crc should be REV+SET?+8 times CRCNIB ... 10 instructions
i think about how we can employ the PIN COMPARE interrupt to start the smart pins once the "start frame delimiter" is received (i.e. the first "11" bits on the line that come together with the RX_CLOCK high, which comes right after the 010101...preamble)

Yes it is critical to get this part right to find the start of the frame. I think it is potentially the hardest part of the whole thing and needs to work or nothing will work. If we can't delimit at wire speed then we'll have a problem keeping up with the data.

once the header of the packet is received, the packet length is known and the smart pins can be stopped at the exact frame end

Only if you interpret what is in the packet, like the IP header contents (and you don't want to do that). Otherwise you probably don't really know the length until the packet ends or it exceeds some upper limit. This might mean the CRC needs to keep a pipeline of previous 4 values and selects the one four bytes ago for comparison only after the packet ends.

after running through the CRC the data could be sent off into the HUB memory or be kept inside the cog to process it there while switching to another cog to receive the next frame

I'd hope it would be possible to do the CRC and use the FIFO to write the Ethernet data bits to HUB RAM at the same time. This way there is less dead time and the COG would be ready to receive the next frame sooner, ie. right after the 12 byte IPG.

Cluso99 · 2021-04-14 04:25

There is a possibility of using a cog pair with shared lut. This is precisely the situation I thought for the shared lut.

rogloh · 2021-04-14 04:50

In what way do you think it should be shared in this application? To have an RX coordinate with TX COG somehow, or for something else? I know for full duplex operation we'd want independent RX and TX COGs because RX packets could arrive at any time while sending packets.

Ramon · 2021-04-21 13:56

I am playing with a saleae-like analyzer (using Sigrok Pulseview).

I have found that Pulseview has a decode for MDIO. Here is the source code:

https://sigrok.org/gitweb/?p=libsigrokdecode.git;a=tree;f=decoders/mdio

I was unsure if my code was working correctly or not (it matched the TI MSP430 USB-2-MDIO tool, but some times the output is not stable).

I found a good register (register 0x17, 23d) in DP83848 to experiment with. This register has some bits that are RW and some others that are read-only and always returns 0.

So I tested to write and read into this register. And for this test, I implemented a few new options on the menu to select the register and do writes (this is version 0.4).

I found that this decoder (in Pulseview) it is not working correctly for reads (as it is impossible to read 67h on register 17h due to the read-only bits).

My code reads and writes correctly most of times (so there could be still some bugs, that I was not able to find). I wonder if it could be something related to the turnaround bit time. I will upload this new code v0.4 to the first page (first post).

P2-ES Ethernet Dev Board Add-on

Comments