Thank you, will take a look later at that CRCBIT instruction. It guess that the instruction allows to define any polynomial, so it can be used.
I have updated the MDIO_Interactive program. I have created the MDIO_WRITE_REG function. It's buggy, sorry. But sometimes works !!
Write (option 1) is done with a predefined value $7F, and only for PHY 1 and REG 0.
Option 2 (RESET). Is just a write of PHY 1 REG 0 with bit 15 enabled. (I works sometimes).
[2021/04/10] Edit: I have found the bug. It was related to reg variable not being correctly updated.
I changed reg to mdio_reg, as 'reg' is a reserved keyword in spin2.
(It worked before because the file extension was '.spin' instead of '.spin2')
Your CRC code was never going to work the way it was accumulating its CRC. Here's a version using CRCNIB that matches your Ethernet CRC. You may want to test it on more packets but if it worked here, it's probably correct.
CON
poly = $edb88320
OBJ
ser: "SmartSerial"
f : "ers_fmt"
PUB main() | i, crc
ser.start(115200)
send:=@ser.tx
crc:=-1 ' initial crc value is all ones
repeat i from 0 to 63
crc := crc32(byte[@eth_frame][i], poly, crc)
send("computed ethernet CRC : ", f.hex(crc ^ -1), 13,10) ' need to xor with all ones for final value
send("captured ethernet CRC : ", f.hex(long[@FCS]), 13,10)
PUB crc32 (data, polynomial, priorcrc) : newcrc ' data byte in & crc, polynomial longs in, crc long out
newcrc := priorcrc ' start with prior crc
org
rev data ' crcnib uses bits 31 to 28 of Q
setq data ' setup data in Q
crcnib newcrc, polynomial
crcnib newcrc, polynomial
end
DAT
' Captured ethernet frame :
'
' ff ff ff ff ff ff 11 11 11 11 11 11 81 00 00 64
' 81 00 00 c8 08 06 00 01 08 00 06 04 00 01 11 11
' 11 11 11 11 c0 a8 02 01 00 00 00 00 00 00 01 01
' 01 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00
' 5B 00 6C 87
'
' (last four bytes are the FCS/CRC)
'
eth_frame byte $ff, $ff, $ff, $ff, $ff, $ff, $11, $11, $11, $11, $11, $11, $81, $00, $00, $64
byte $81, $00, $00, $c8, $08, $06, $00, $01, $08, $00, $06, $04, $00, $01, $11, $11
byte $11, $11, $11, $11, $c0, $a8, $02, $01, $00, $00, $00, $00, $00, $00, $01, $01
byte $01, $01, $00, $00, $00, $00, $00, $00, $00, $00, $00, $00, $00, $00, $00, $00
FCS byte $5B, $00, $6C, $87
This shows that the Ethernet CRC can be computed at a rate of 20 clocks for 8 nibbles (when working on 32 bits at a time). This is 2.5 clocks per nibble. At 25M nibbles / sec transmit or receive timing, we can probably do CRC on the fly if the clock speed is sufficient.
For transmit, while the packet is streaming, it also could get read into LUTRAM by the transmitting COG and a CRC could then be computed in parallel and in time before the final byte is sent. You can read 1 long per clock into LUT, then 3 clocks for a RDLUT data, ptr++ so 4 clocks for 8 nibbles, or 0.5 clocks per nibble. This plus the 2.5 clocks per nibble for CRC means you could operate CRC on the fly if the clock rate is at least 75MHz. You'd want to go higher for overheads, say 100MHz.
On receive there is more do to but if the streamer is working you have some time to do things. However if you have to byte bang (or more precisely nibble bang), you'd need 2 clocks per nibble for that (rolnib data, ina/inb, #xx), plus you'd want to accumulate your data into a long to write into the LUTRAM, then 1 clock for 32 bits out of LUTRAM to HUB. Trick is going to be delimiting packets correctly and handling the various end cases. It's likely going to need more than P2 3-4x clocks per nibble for this, especially if it is two bit wide RMII at 50MHz instead of 4 bit wide MII or RGMII at 25MHz. Hopefully something like 150-200MHz would be enough for 25MHz x 4 bit Ethernet. For 2 bit RMII you may also like the RCZL and RCZR instructions if the RMII buses are nibble aligned, or the SPLIT/MERGE stuff that evanh used with Dual SPI and 2 smart pins in synch receiver mode. RX is going to be a lot trickier to get correct vs TX. The smart pin sync mode is nice because it probably lets you operate using a range of P2 frequencies, though we probably want the P2 clock to be a multiple of 25 or 50MHz anyway for the TX side if the streamer is used.
I would never find out how to calculate the CRC with CRCNIB instructions unless you haven't posted the code.
So many things can go wrong (wrong initialized value, wrong polynomial, wrong direction SHL or REV?, wrong nibble order, not doing final XOR, etc ...)
I have tested with several frames, and it's PERFECT !! Thanks a lot !
I would recommend using MII at 25Mhz, 4 bits at a time, both clocks are generated in the PHY ( 2 pins Management + 2 x 6 pins for TX/RX which fits on a double extension board)
on a positive clock edge the data should be valid.
when transmitting, sync to the negative edge to make sure data is in place on posedge
so can the smartpins be used?
the documentation is not clear here: can a smart pin output data using an external clock ?
besides, a pin that determines the packet start is also sampled with the clock, another smartpin used? let it count the pulses since RX_DV (packet start) went up and know exact ethernet frame start
would be nice if a) PLL is not used for ethernet b) only 2-3 cogs used for full duplex ethernet c) calculate CRC on the fly for realtime ethernet
at 25 Mhz could a simple wait-for-pin / interrupt scheme be used to drive the signals?
for crc a lookup table could be used to process 8 bit at once
@Simonius said:
I would recommend using MII at 25Mhz, 4 bits at a time, both clocks are generated in the PHY ( 2 pins Management + 2 x 6 pins for TX/RX which fits on a double extension board)
on a positive clock edge the data should be valid.
when transmitting, sync to the negative edge to make sure data is in place on posedge
so can the smartpins be used?
the documentation is not clear here: can a smart pin output data using an external clock ?
It should be possible based on this information:
%11100 = synchronous serial transmit
This mode overrides OUT to control the pin output state.
Words of 1 to 32 bits are shifted out on the pin, LSB first, with each new bit being output two internal clock cycles after registering a positive edge on the B input. For negative-edge clocking, the B input may be inverted by setting B[3] in WRPIN's D value.
besides, a pin that determines the packet start is also sampled with the clock, another smartpin used? let it count the pulses since RX_DV (packet start) went up and know exact ethernet frame start
would be nice if a) PLL is not used for ethernet b) only 2-3 cogs used for full duplex ethernet c) calculate CRC on the fly for realtime ethernet
Yes it would be nicer if their clocks could be fully decoupled. For MII it might be feasible. Possibly not for RMII. At least two COGs are a given for full duplex Ethernet. You might be able to hold off the TX transfers if it's COG is needed for some MAC functions. Hopefully a 3rd COG, if needed, is just the client COG itself and not an additional one.
at 25 Mhz could a simple wait-for-pin / interrupt scheme be used to drive the signals?
I would hope so.
for crc a lookup table could be used to process 8 bit at once
It could but I'm not sure in this particular case if a table approach is faster than the 4 instructions per byte with the P2 HW approach (REV+SETQ+CRCNIB+CRCNIB). It might be 5 - 6 from memory since I last did this. The RDLUT takes 3 clocks too.
The data must be collected in 32 bit words as early as possible... using 4 smart pins @ 25 MHz, MII // or 2 smart pins @ 50 MHz, RMII
i suspect that using 2 pins is more efficient than dealing with 4 pins
the overhead of starting/stopping the 2 vs 4 smart pins and gathering/interleaving their data differs because in one case we can clock in 16 bits (x2) in a row and in the other only 8 bits (x4)
After it's inside an 32 bit register we are dealing with mere 3,125,000 words/second... suddenly it sounds quite doable
doing a 32 bit crc should be REV+SET?+8 times CRCNIB ... 10 instructions
i think about how we can employ the PIN COMPARE interrupt to start the smart pins once the "start frame delimiter" is received (i.e. the first "11" bits on the line that come together with the RX_CLOCK high, which comes right after the 010101...preamble)
once the header of the packet is received, the packet length is known and the smart pins can be stopped at the exact frame end
after running through the CRC the data could be sent off into the HUB memory or be kept inside the cog to process it there while switching to another cog to receive the next frame
@Simonius said:
The data must be collected in 32 bit words as early as possible... using 4 smart pins @ 25 MHz, MII // or 2 smart pins @ 50 MHz, RMII
i suspect that using 2 pins is more efficient than dealing with 4 pins
It might be, just needs to be coded to see which is easier.
the overhead of starting/stopping the 2 vs 4 smart pins and gathering/interleaving their data differs because in one case we can clock in 16 bits (x2) in a row and in the other only 8 bits (x4)
After it's inside an 32 bit register we are dealing with mere 3,125,000 words/second... suddenly it sounds quite doable
doing a 32 bit crc should be REV+SET?+8 times CRCNIB ... 10 instructions
i think about how we can employ the PIN COMPARE interrupt to start the smart pins once the "start frame delimiter" is received (i.e. the first "11" bits on the line that come together with the RX_CLOCK high, which comes right after the 010101...preamble)
Yes it is critical to get this part right to find the start of the frame. I think it is potentially the hardest part of the whole thing and needs to work or nothing will work. If we can't delimit at wire speed then we'll have a problem keeping up with the data.
once the header of the packet is received, the packet length is known and the smart pins can be stopped at the exact frame end
Only if you interpret what is in the packet, like the IP header contents (and you don't want to do that). Otherwise you probably don't really know the length until the packet ends or it exceeds some upper limit. This might mean the CRC needs to keep a pipeline of previous 4 values and selects the one four bytes ago for comparison only after the packet ends.
after running through the CRC the data could be sent off into the HUB memory or be kept inside the cog to process it there while switching to another cog to receive the next frame
I'd hope it would be possible to do the CRC and use the FIFO to write the Ethernet data bits to HUB RAM at the same time. This way there is less dead time and the COG would be ready to receive the next frame sooner, ie. right after the 12 byte IPG.
In what way do you think it should be shared in this application? To have an RX coordinate with TX COG somehow, or for something else? I know for full duplex operation we'd want independent RX and TX COGs because RX packets could arrive at any time while sending packets.
I was unsure if my code was working correctly or not (it matched the TI MSP430 USB-2-MDIO tool, but some times the output is not stable).
I found a good register (register 0x17, 23d) in DP83848 to experiment with. This register has some bits that are RW and some others that are read-only and always returns 0.
So I tested to write and read into this register. And for this test, I implemented a few new options on the menu to select the register and do writes (this is version 0.4).
I found that this decoder (in Pulseview) it is not working correctly for reads (as it is impossible to read 67h on register 17h due to the read-only bits).
My code reads and writes correctly most of times (so there could be still some bugs, that I was not able to find). I wonder if it could be something related to the turnaround bit time. I will upload this new code v0.4 to the first page (first post).
Comments
Thank you, will take a look later at that CRCBIT instruction. It guess that the instruction allows to define any polynomial, so it can be used.
I have updated the MDIO_Interactive program. I have created the MDIO_WRITE_REG function.
It's buggy, sorry. But sometimes works !!
Write (option 1) is done with a predefined value $7F, and only for PHY 1 and REG 0.
Option 2 (RESET). Is just a write of PHY 1 REG 0 with bit 15 enabled. (I works sometimes).
(I have fixed the MDIO Interactive v0.3 code and updated code)
Please, can someone check the following CRC code?
The correct result should contain the following bytes : $5B $00 $6C $87 (but I don't know in which order)
Edited: fixed code loop and other errors (but still wrong FCS, see a few posts later)
Your CRC code was never going to work the way it was accumulating its CRC. Here's a version using CRCNIB that matches your Ethernet CRC. You may want to test it on more packets but if it worked here, it's probably correct.
This shows that the Ethernet CRC can be computed at a rate of 20 clocks for 8 nibbles (when working on 32 bits at a time). This is 2.5 clocks per nibble. At 25M nibbles / sec transmit or receive timing, we can probably do CRC on the fly if the clock speed is sufficient.
For transmit, while the packet is streaming, it also could get read into LUTRAM by the transmitting COG and a CRC could then be computed in parallel and in time before the final byte is sent. You can read 1 long per clock into LUT, then 3 clocks for a RDLUT data, ptr++ so 4 clocks for 8 nibbles, or 0.5 clocks per nibble. This plus the 2.5 clocks per nibble for CRC means you could operate CRC on the fly if the clock rate is at least 75MHz. You'd want to go higher for overheads, say 100MHz.
On receive there is more do to but if the streamer is working you have some time to do things. However if you have to byte bang (or more precisely nibble bang), you'd need 2 clocks per nibble for that (rolnib data, ina/inb, #xx), plus you'd want to accumulate your data into a long to write into the LUTRAM, then 1 clock for 32 bits out of LUTRAM to HUB. Trick is going to be delimiting packets correctly and handling the various end cases. It's likely going to need more than P2 3-4x clocks per nibble for this, especially if it is two bit wide RMII at 50MHz instead of 4 bit wide MII or RGMII at 25MHz. Hopefully something like 150-200MHz would be enough for 25MHz x 4 bit Ethernet. For 2 bit RMII you may also like the RCZL and RCZR instructions if the RMII buses are nibble aligned, or the SPLIT/MERGE stuff that evanh used with Dual SPI and 2 smart pins in synch receiver mode. RX is going to be a lot trickier to get correct vs TX. The smart pin sync mode is nice because it probably lets you operate using a range of P2 frequencies, though we probably want the P2 clock to be a multiple of 25 or 50MHz anyway for the TX side if the streamer is used.
Rogloh, Thank you so much!
I would never find out how to calculate the CRC with CRCNIB instructions unless you haven't posted the code.
So many things can go wrong (wrong initialized value, wrong polynomial, wrong direction SHL or REV?, wrong nibble order, not doing final XOR, etc ...)
I have tested with several frames, and it's PERFECT !! Thanks a lot !
No problem. I want to see Ethernet on P2 as much as you do.
I would recommend using MII at 25Mhz, 4 bits at a time, both clocks are generated in the PHY ( 2 pins Management + 2 x 6 pins for TX/RX which fits on a double extension board)
on a positive clock edge the data should be valid.
when transmitting, sync to the negative edge to make sure data is in place on posedge
so can the smartpins be used?
the documentation is not clear here: can a smart pin output data using an external clock ?
besides, a pin that determines the packet start is also sampled with the clock, another smartpin used? let it count the pulses since RX_DV (packet start) went up and know exact ethernet frame start
would be nice if a) PLL is not used for ethernet b) only 2-3 cogs used for full duplex ethernet c) calculate CRC on the fly for realtime ethernet
at 25 Mhz could a simple wait-for-pin / interrupt scheme be used to drive the signals?
for crc a lookup table could be used to process 8 bit at once
It should be possible based on this information:
Yes it would be nicer if their clocks could be fully decoupled. For MII it might be feasible. Possibly not for RMII. At least two COGs are a given for full duplex Ethernet. You might be able to hold off the TX transfers if it's COG is needed for some MAC functions. Hopefully a 3rd COG, if needed, is just the client COG itself and not an additional one.
I would hope so.
It could but I'm not sure in this particular case if a table approach is faster than the 4 instructions per byte with the P2 HW approach (REV+SETQ+CRCNIB+CRCNIB). It might be 5 - 6 from memory since I last did this. The RDLUT takes 3 clocks too.
Yeah it's inner loop in PASM becomes 11 clocks vs 8, so a table approach isn't worth it in this case.
The data must be collected in 32 bit words as early as possible... using 4 smart pins @ 25 MHz, MII // or 2 smart pins @ 50 MHz, RMII
i suspect that using 2 pins is more efficient than dealing with 4 pins
the overhead of starting/stopping the 2 vs 4 smart pins and gathering/interleaving their data differs because in one case we can clock in 16 bits (x2) in a row and in the other only 8 bits (x4)
After it's inside an 32 bit register we are dealing with mere 3,125,000 words/second... suddenly it sounds quite doable
doing a 32 bit crc should be REV+SET?+8 times CRCNIB ... 10 instructions
i think about how we can employ the PIN COMPARE interrupt to start the smart pins once the "start frame delimiter" is received (i.e. the first "11" bits on the line that come together with the RX_CLOCK high, which comes right after the 010101...preamble)
once the header of the packet is received, the packet length is known and the smart pins can be stopped at the exact frame end
after running through the CRC the data could be sent off into the HUB memory or be kept inside the cog to process it there while switching to another cog to receive the next frame
It might be, just needs to be coded to see which is easier.
Yes it is critical to get this part right to find the start of the frame. I think it is potentially the hardest part of the whole thing and needs to work or nothing will work. If we can't delimit at wire speed then we'll have a problem keeping up with the data.
Only if you interpret what is in the packet, like the IP header contents (and you don't want to do that). Otherwise you probably don't really know the length until the packet ends or it exceeds some upper limit. This might mean the CRC needs to keep a pipeline of previous 4 values and selects the one four bytes ago for comparison only after the packet ends.
I'd hope it would be possible to do the CRC and use the FIFO to write the Ethernet data bits to HUB RAM at the same time. This way there is less dead time and the COG would be ready to receive the next frame sooner, ie. right after the 12 byte IPG.
There is a possibility of using a cog pair with shared lut. This is precisely the situation I thought for the shared lut.
In what way do you think it should be shared in this application? To have an RX coordinate with TX COG somehow, or for something else? I know for full duplex operation we'd want independent RX and TX COGs because RX packets could arrive at any time while sending packets.
I am playing with a saleae-like analyzer (using Sigrok Pulseview).
I have found that Pulseview has a decode for MDIO. Here is the source code:
https://sigrok.org/gitweb/?p=libsigrokdecode.git;a=tree;f=decoders/mdio
I was unsure if my code was working correctly or not (it matched the TI MSP430 USB-2-MDIO tool, but some times the output is not stable).
I found a good register (register 0x17, 23d) in DP83848 to experiment with. This register has some bits that are RW and some others that are read-only and always returns 0.
So I tested to write and read into this register. And for this test, I implemented a few new options on the menu to select the register and do writes (this is version 0.4).
I found that this decoder (in Pulseview) it is not working correctly for reads (as it is impossible to read 67h on register 17h due to the read-only bits).
My code reads and writes correctly most of times (so there could be still some bugs, that I was not able to find). I wonder if it could be something related to the turnaround bit time. I will upload this new code v0.4 to the first page (first post).