W5100 Sn_SR undocumented misfeature / bug
kuisma
Posts: 134
When sending UDP, it is necessary to make sure the socket is in SOCK_UDP before actually transmitting the data by checking the Sn_SR register. Sometimes after a transmit, it may change from SOCK_UDP (0x22) to the undocumented status 0x01. I guess this occurs during ARP requests, but it is not documented, so any guess is as good as mine.
The conclusion is, always check the socket status before transmitting. I have not checked if this particular misfeature also occurs with TCP, but during TCP you should always monitor the socket status anyway, due to the peers ability to close it.
Maybe this would best be handled by the driver's txUDP method;
In the TCP case, ARP is only used during the connection setup phase, so this is not a problem -- and this is documented.
The conclusion is, always check the socket status before transmitting. I have not checked if this particular misfeature also occurs with TCP, but during TCP you should always monitor the socket status anyway, due to the peers ability to close it.
Maybe this would best be handled by the driver's txUDP method;
repeat readIND((_S0_SR + (_socket * $0100)), @rv, 1) until rv.byte[0] & $CF <> $01 if rv.byte[0] <> _SOCK_UDP return false [I] ... send ...[/I]
In the TCP case, ARP is only used during the connection setup phase, so this is not a problem -- and this is documented.
Comments
On a similar topic, I was thinking about an idea to building a user interface software to run on pc/mac and a USB serial connection to the Spinneret which would regularly poll all the values from the W5100 and display the status. This could be a debugger of sorts for users trying new code or when we are implementing new driver changes. Of course, the updates to the PC may be regular or intermitten, depending on where th user branches to include the code to run through all the W5100 registers. Good idea or bad idea?
Just uploaded a new version of the drivers with the fixes and some more.
Neat, but quite a limited (but exclusive) audience, methinks. I have some similar thoughts, but other way around. I'm developing quite a complex application using up all the cogs and quite timing sensitive. Sending debug data over the serial interface impacts the timing of the system, hence using syslog instead (multiple access, of course).
To go one step further, would be implementing a more general API over the network, used for both debugging and application-to-application communication. A light weight RPC service.
http://en.wikipedia.org/wiki/Syslog
A bit quick and dirty. You need to fill in your syslog server at "syslog_packet".
syslog.spin
I just got this confirmed by WIZnet, and they'll fix the documentation.
I don't understand, though, why a socket that is listening on a UDP port should/would/could transition to another state (other than if there is an error and it is closed). I understand ARP, why would that be involved?
Thanks for any info you can send, as it will no doubt improve my knowledge in this area.
Without an ARP cache entry, it is not possible for the W5100 to send the packet to the host in question. I find it quite logical that the W5100 chip signals this by changing its socket status from "UDP all ok (0x22)" to "UDP please wait, we have no ARP entry (0x01)" making sure the user do not try to send more data to the (at least for now) non responsive host.
Good Job! ... It's always nice to find a real bug (in this case a feature creature) and then get confirmation on it.
Since WIZnet managed to forget status code 0x11 in this last version of the W5100 datasheet (1.2.3), I verified this about status 0x11 you mention here. As far as I can see:
Sn_SR 0x01 SOCK_ARP_REQUEST outstanding
Sn_SR 0x11 SOCK_ARP_REPLY processing
Neither of those states requires a socket reset, but are handled by simply waiting for them to transit to the next state.
I guess we'll see another update of the datasheet soon.
1st: Since it's sometimes easy to create misunderstandings through text, let me say upfront that I'm trying to understand, not argue.
2nd: It would be great to have some sort of announcement (perhaps on this thread) when the revised datasheet is available. I refer to it often (I have it downloaded locally - which means it's already obsolete).
3rd: Your explanation makes sense in how/why ARP can affect a UDP socket. However, let me add some detail and see if my understanding is correct, or if the observed behavior is explained by the pending datasheet revisions:
When a UDP socket is opened, it is almost stateless (I think of it as in an "active-listen" mode); all it knows is what port it's listening to (Sn_PORT: <n> is port number). It gets the source IP address from the value stored (globally) in the WizNet register (SIPR). The destination IP address (Sn_DIPR) and destination port (Sn_DPORT) are configured on-the-fly whenever needed for UDPtx, without having to close/re-open the socket. [NOTE: it would also be interesting to test to see if it is a valid operation to modify the value of Sn_PORT while the socket is open, or if it must be closed, modified and then re-opened.]
So, if I open a UDP socket and just loop, waiting to receive a packet on the specified port, I would expect that I could never get the 0x01 or 0x11 socket states: I'm never sending (from that socket) so there would be no ARP_REQUEST to be sent so I wouldn't be waiting for a reply (status 0x01), and similarly if no request is sent, I would never have to process an ARP_REPLY so I'd never see that (status 0x11).
Of course, this all changes if I am sending data from this UDP socket. In that case, it would be interesting to see what happens if:
a) the ARP REQUEST results in a valid reply. When processed successfully, I would assume that the queued data would be sent and the socket status would reset (status 0x22). I have not verified this but I'd be surprised if not so. Otherwise, there would be a hang condition whenever the ARP table was updated (since none of us even knew of these two states until recently and therefore couldn't possibly have handled them).
b) the ARP REQUEST times out with no reply. Would the socket hang, waiting for a close/re-open? Upon timeout, would it just toss the packet that was to be transmitted and reset? I could justify either choice, as long as it's consistently applied. This is probably easier to test just by sending a UDP packet with an IP address on the local subnet that has no valid host to reply. I could do that if it would help provide new information (although someone else has probably already figured that out by now).
Thanks for adding to the collective understanding on this topic.
Not necessary true. If the W5100 UDP state machine behaves as TCP, the UDP socket may enter state 0x11 (ARP_REPLY processing) after receiving the ARP request from remove host and stay here until this request is replied by the W5100. I have not seen this in real life, but the scenario is perfectly consistent with the TCP case, and the 0x11 state seems to be very short lived (by obvious reasons). If not 0x11, maybe some other value of SOCK_ARP, e.g. 0x21 or 0x31 as documented in the pre-1.2.3 datasheet.
Yes, the socket re-enters state SOCK_UDP (0x22) again, and the queued packet is transmitted (or in reverse order, not really significant).
The packet is lost, and any packets queued during the the time the socket is in 0x01 are silently lost as well. After 1.8s the W5100 timeout the 0x01 state and re-enters 0x22 again.
My pleasure.
I just did a test bombarding the W5100 with ARP requests monitoring the socket state, and I could not see it ever leaving the SOCK_UDP (0x22) state.
You know, I thought about that. Before my previous post, I did some research, but couldn't confirm what I suspect. Basically, I think of ARP as one of the 'tween protocols, meaning in some ways it is straddling OSI Layers 2 & 3. It is aware of IP, but operates at the Data Link layer (just the way I think of it, might not be how others do). While on the requesting host side, the IP address is included in request, and at the responding host side, it is used to determine if the message is for me ("you talkin' to me...?"), it never really bubbles up to the IP layer. I don't believe there is a notion of UDP or TDP, nor ports. So, as I thought about it, it seems illogical/inconsistent (to me) for the WizNet (or any host) to associate a port with any ARP actitivy. It is more like a hardware or network address - in the WizNet, that is stored and handled globally, not relative to any socket. Furthermore, particularly because we're limited to only 4 sockets on the WizNet, I'd hope that whatever could be done at a lower level would be. In other words, unless some problem (unreachable address) is encountered, the sockets are all oblivious to ARP functioning. That's my hope, but I realize often hope <> reality.
This raises another interesting question: with UDP, when an address is unreachable, the IP address and port are storedin the global UIPR & UPORT registers. The docs say this is when an ICMP destination unreachable message is received. For the case where the destination IP address is on the same subnet, there will be no valid ARP response, and ICMP isn't used. I wonder if these registers are updated in that case also. If so, I wonder if the WizNet will attempt to find the hardware address for the given destination IP address every time sending a packet to that address is attempted, or if it checks the UIPR and doesn't even try. If the latter, I wonder how one clears that register (say in the case of a host booting and being available). There surely are many details to explore!
My thanks again...
I totally agree in the server-side case (in the client case you'll need this info to determine connectivity). Still, going from TCP SOCK_LISTEN to SOCK_ESTABLISHED sometimes passes the state SOCK_ARP (0x11).
a) UDP and TCP is not handled similar by the W5100 implementation according to ARP.
b) 0x11 is something completely different and not related to ARP at all.
c) Some hardware optimization requires reuse of sockets, and even ARPs need resources otherwise allocated to one of the four sockets.
But I have no idea about what of a) b) and/or c) may be true.
Actually, that makes sense to me. SOCK_LISTEN is unaware of the (yet-to-be-connected) other host. When a TCP CONNECT is received, all the L2 stuff is gone - it's just IP address, port, etc. For the server to connect, it has to respond to the requesting host, and it may not have an entry in it's ARP table - so it does go through that state. This might not be what is actually happening, but this model seems consistent and works for me.
If my hypothesis/model above is correct, these may not actually be inconsistencies as suspected.
Sadly, that is often the case. As I said, often hope <> reality, and there is a good amount of functionality in the little WizNet. Still, I hope the design doesn't have to consume the precious few user-accessible socket resources for "housekeeping".
Well, no. When the server get the SYN, this packet already contains the client's MAC address, so no ARP request is needed at the server-side - and no ARP request is neither performed de facto by the W5100 in this scenario, not even when traversing the 0x11 state.
OK, I'm still dealing with this "feature". I'm experiencing unexplained hangs with my simple HTTP server and continue to suspect this revelation as a potential cause. Several comments/questions:
1) Yes, I agree that every packet sent over Ethernet (L1) containts link (L2) information (MAC address) as well as IP (L3) address. My thought was that by the time TCP/UDP (L4) gets that packet (the exposed WIZnet API), the lowest level information has been stripped off (including MAC address).
2) However, if your hypothesis is correct and the TCP_LISTEN port does receive a CONNECT request, and it does still maintain the MAC of the potential client, then why would ARP be involved?
3) Now I'm really lost. If it's unrelated to ARP, what is it related to (and can you rationalize why it's named "SOCK_ARP"?)
4) Re-reading the datasheet (v1.2.2) I also see there are multiple "SOCK_ARP" status codes: 0x11, 0x21, 0x31. Any idea what they mean and how they differ?
Once again, for the sake of continued constructive communication, my goal here is to learn & understand, not argue.
Thanks!
At TCP/UDP level, yes, but at this time the ARP table is already populated. The W5100 will not send out an own ARP request in this case. This is both as suspected as well as verified.
Before the peer sends is connect request (SYN) it will need the MAC address of the Spinneret. This is resolved by ARP, but in this case the W5100 is replying ARP, not requesting.
I still thinks this is related to ARP, but since this is not documented, I'm emphasizing this is a guess, and I might be wrong. Also, in the last version of the datasheet (v1.2.3), it is no longer named "SOCK_ARP", it is not mentioned at all. I have not gotten any explanation about why, but believe this is a mistake.
My best guess is:
0x01 = ARP_REQUEST_OUTSTANDING
0x11 = ARP_REPLY_PROCESSING
0x21 and 0x31 I've never seen.
Ah, yes! Now I follow your point. I agree completely.
So, let me summarize and let's see if we agree.
For a UDP socket, the flow would be:
SOCK_CLOSED -> SOCK_UDP -> * -> SOCK_CLOSED
where "*" could be:
a) in the case of sending, SOCK_ARP_REQUESTED if this is the first time sending to this remote host (the remote host's MAC is not in the WIZnet's ARP table)
b) in the case of receiving, SOCK_ARP_REPLY if this is the first time the remote host has sent to the WIZnet (the WIZnet's MAC is not in the remote host's ARP table)
For a TCP server:
SOCK_CLOSED -> SOCK_INIT -> SOCK_LISTEN -> SOCK_ARP_REPLY* -> SOCK_SYNRECV -> SOCK_ESTABLISHED -> (various tear-down states not important for this discussion)
where the SOCK_ARP_REPLY would only be traversed if the remote host wanting to connect to the WIZnet's listening (server) port did not already have the WIZnet's MAC in its ARP table.
Finally, for a TCP client:
SOCK_CLOSED -> SOCK_INIT -> SOCK_ARP_REQUEST* -> SOCK_SYNSENT -> SOCK_ESTABLISHED -> (tear-down states)
where the SOCK_ARP_REQUEST state would only be traversed if the WIZnet had never sent to the remote host before (there is no entry in the WIZnet's ARP table for the remote host).
Does this make sense and agree with what you've seen? It is, at least, logical to me. I'd still really like to know the purpose of the various SOCK_ARP status codes are. I wonder if you'd see the 0x21 and 0x31 states if you were using different/multiple socket flows that would be waiting on an ARP REPLY or were processing a response to an ARP_REQUEST??
No, more like "SOCK_UDP <-> SOCK_ARP_REQUETS". I.e. SOCK_ARP_REQUEST is a transient state going both from and to the SOCK_UDP state.
This above is verified.
The only other occurrence of a ARP state I have verified, is during the server side TCP connection phase, the state SOCK_ARP_REPLY may be traversed someplace in the chain from SOCK_LISTEN and SOCK_ESTABLISHED. I don't know where, and it seems it not occurs all the times.
All the rest of your claims might very well be true, but I have not been able to see it myself.
The practical way solving this, is that if "Sn_SR & $CF == 1", wait a while for it to resolve. No other action needed.
Yes, your description is more complete. If only receiving on a UDP port, no ARP_* state will ever occur. If sending on a UDP port, each time a new host (but not port) is addressed, there is the potential for an ARP_REQUEST state if that host MAC is not in the WIZnet ARP table. Agreed.
Yes, your method is a solution and I have implemented it. Now I'll just have to see if my random (and infrequent) unexplained hangs are gone!
Still, though, for the sake of understanding, I went back to basics and again wonder if the 0x11 state really is ARP-related or not (as you suggested in an earlier post). Clearly in client mode a TCP socket will transition through an ARP_REQUEST state before SOCK_SYNSENT and SOCK_ESTABLISHED, if there is no entry in the WIZnet's ARP table for the remote host being connected to. We agree that in server mode, a TCP socket in the SOCK_LISTEN state will need to wait for any remote host to send an ARP request before a CONNECT request can be sent to the WIZnet. You noted and I concede that although this is not something that *needs* to bubble up to the application or transport layer, it may cause some delay which is reflected by some additional state, perhaps the mysterious 0x11 state. However, if I'm not mistaken, the ARP request exists at the link layer, and doesn't contain an IP address or a PORT value - it's just a broadcast on the local segment and it won't pass through routers. If it has no PORT content, I wonder how any WIZnet *socket* register could be affected by an incoming ARP request? Even if this has to be handled at the socket layer due to practicalities/tradeoffs in the WIZnet hardware/stack implementation, how would the WIZnet decide which socket to associate with the received ARP request? I wonder if that is why the 0x11 state is rarely seen - because it's handled by any one of the 4 possible sockets available, and isn't necessarily associated with the socket that will ultimately be responding to the CONNECT request from the remote server that caused the ARP_REPLY.
I realize this is all speculation. As it seems you've been able to have some productive interaction with the good people at WIZnet (resulting in datasheet 1.2.3), do you think you could (and would you be interesting and willing to) contact them to see exactly what the 0x11, 0x21 and 0x31 states really are, and more importantly, how they fit in a complete and correct state diagram?
Thanks for the continued discussion and learning process.
Well, if they can't figure it out then I give up!
Thanks.
Not even a ping.
No, not a line.