Webserver randomly becomes unresponsive

sstandfast · 2011-01-05 09:59

@Beau - Just out of curiosity, what was the break-over point? I.E. how long could you pause before the hang issues came back.

Beau Schwabe · 2011-01-05 10:07

sstandfast,

"Just out of curiosity, what was the break-over point? I.E. how long could you pause before the hang issues came back. " --- Since the problem seems to be random in nature anyway (<--human ping rate, how fast can you click refresh) ... Without any delay in the code we could get thousands of hits... with a 1 second or so delay we might get 2 hits one time, and 25 another before it became 'unresponsive'.

Mike G · 2011-01-05 10:21

Thanks for sharing Beau.

For another point of reference, my server has been up for two days. Some of the processes the web server executes are expensive. Every time a dynamic page is requested a temp file is created on the SD card and served up. There's also some delay involved in retrieving POST and GET values. I'm grabbing the key/value pairs at the time of the request, putting the values in a stack, decoding the string, and returning a pointer. When there are a lot of requests, it take a bit of time. See this guy and click save, http://spinneret.servebeer.com:5000/config.htm.

I did some stress testing a few weeks ago and found that sometimes a request would come in but with no data. So I set a timer when this condition happens. If time runs out and still no data then I reset the socket. That along with the driver fix seemed to get me up and running consistently.

Beau Schwabe · 2011-01-05 10:39

Mike G,

That gives me an idea... short of a PWp "Please Wait page" that loads the dynamic page after it has been rendered might be a workable solution.

The PWp wouldn't need to say anything indicating that the user is waiting. A simple script in the PWp would reload the rendered html when it was complete. In most cases that should be fast enough, that you shouldn't be aware of the PWp.

just a thought.

Mike G · 2011-01-05 10:45

I read your post a couple of times... I'm having a hard time picturing how that would work.

Beau Schwabe · 2011-01-05 10:52

Mike G,

Basically upon the initial connection you have a bare minimum html that responds back to the web request. In that bare minimum html code there is a timer that launches the page >again<. The html timer allows you time to render the "new" html in the background so to speak, without causing any timeout issues from the web request. The actual duration of the html timer would be small enough that to the person making the web request, they would be unaware of the delay.

chillybasen · 2011-01-05 11:08

FYI, I just moved my servo moving code outside the ethernet code (after ETHERNET.SocketClose(0)) and it seems to be much more stable. I threw a few thousand requests at it and it's still up.

The code is here and it built off of bean's webserver with GET

Mike G · 2011-01-05 12:18

One other thing that I always do is write the HTML header in a separate method right before rendering a dynamic (or static) page. So the browser is only waiting for the message body and usually the body is ready to go. For the most part, I already ran whatever processes and have the rendered content in memory before sending the response header.

@chillybasen, I'd do all the processing right after getting the request not in the middle of sending the response. Basically, get all the data ready to go, spin up any processing etc. then render the page.

For example, I'd put this logic elsewhere if you can. I know it does not take too long to run.

      if (byte[VarStr[1]] <> 0)
        if (byte[VarStr[1]] == "o")
          status := "1"
          StringSend(0, string("opening", CR))
        elseif (byte[VarStr[1]] == "c")
          status := "0"
          StringSend(0, string("closing", CR))
      elseif (status == "1")
        StringSend(0, string("open", CR))
      elseif (status == "0")
        StringSend(0, string("closed", CR))

jstjohnz · 2011-01-05 19:37

I am seeing this same issue. I started monitoring the socket status register. Normally it's value is $14 while waiting for a connection. What I found was that when the hangs were occurring the status register showed either $1C or 0.

What I am trying now is, every time I check for the presence of a connection, if there is no connection I then check the status register. If it's not $14, I close and re-open the socket.

So far so good. Doesn't explain the why, but seems to be a workaround.

jstjohnz · 2011-01-06 02:28

kuisma wrote: »

With my code, there is no longer any 2K boundary. It is all handled internally in a sound way, and you can transmit (e.g. enqueue) a 10kB request with txTCP as TCP is supposed to work. Try it out some more, and if you still believe it is the driver, I'll have another look at it.

Kuisma, can you explain your rxtcp fix? I don't understand what this is supposed to do.

-jim-

kuisma · 2011-01-06 02:38

jstjohnz wrote: »

Kuisma, can you explain your rxtcp fix? I don't understand what this is supposed to do.

-jim-

RSR starts out uninitialized and is only assigned data to the lower 16 bits, but returned the entire 32 bit long. This way it from time to time returned garbage instead of the amount of data read.

Beau Schwabe · 2011-01-06 08:04

jstjohnz,

Hmm, on page 29 and 30 of the PDF...(see below)

$14 certainly seems to be the correct status mode. $00 seems to shutdown the port indefinitely, while $1C perpetually waits for a close. If it doesn't get a close it will just sit there.

Can you provide the snippet of code you are using to check the status?

$00 - SOCK_CLOSED - It is shown in case that CLOSE commands are given to
Sn_CR, and Timeout interrupt is asserted or connection is
terminated. In this SOCK_CLOSED status, no operation
occurs and all resources for the connection is released.

$1C - SOCK_CLOSE_WAIT - It is shown in case that connection termination request is
received from peer host. At this status, the Acknowledge
message has been received from the peer, but not
disconnected. The connection can be closed by receiving
the DICON or CLOSE commands.

$14 - SOCK_LISTEN - It is shown in case that LISTEN commands are given to
Sn_CR at the SOCK_INIT status. The related socket will
operate as TCP Server mode, and become ESTABLISHED status
if connection request is normally received.

jstjohnz · 2011-01-06 11:27

Beau Schwabe (Parallax) wrote: »

jstjohnz,

Hmm, on page 29 and 30 of the PDF...(see below)

$14 certainly seems to be the correct status mode. $00 seems to shutdown the port indefinitely, while $1C perpetually waits for a close. If it doesn't get a close it will just sit there.

Can you provide the snippet of code you are using to check the status?

$00 - SOCK_CLOSED - It is shown in case that CLOSE commands are given to
Sn_CR, and Timeout interrupt is asserted or connection is
terminated. In this SOCK_CLOSED status, no operation
occurs and all resources for the connection is released.

$1C - SOCK_CLOSE_WAIT - It is shown in case that connection termination request is
received from peer host. At this status, the Acknowledge
message has been received from the peer, but not
disconnected. The connection can be closed by receiving
the DICON or CLOSE commands.

$14 - SOCK_LISTEN - It is shown in case that LISTEN commands are given to
Sn_CR at the SOCK_INIT status. The related socket will
operate as TCP Server mode, and become ESTABLISHED status
if connection request is normally received.

I first added this method to the ethernet driver:

'***************************************
PUB SocketStatus(_socket) | temp0
'***************************************
'' return socket status register     new 01/05/2011 jstjohnz for detecting 'bad' socket status
''
''  params: _socket is a value of 0 to 3 - only four sockets on the W5100
''  return: none

  'If the ASM cog is running, execute the command
  if (W5100flags & _Flag_ASMstarted)

      readSPI((_S0_SR + (_socket * $0100)), @temp0, 1)
      return temp0.byte[0]

Then in my main spin routine, if I don't see a connection established I do a check to see if the status register value is wrong:

if WIZ.SocketTCPestablished(websocket)==0 'if no connection is established
       if wiz.socketstatus(websocket)<>$14   'chk for proper socket status
           webstate:=99                                 'if wrong, set flag so we know to close/reopen the socket
                  'basically this should execute the code that you would normally execute after sending
                  'the page, ie close then re-open the socket.
       return
    else   
     'code to send the page goes here

The reason for the "webstate:=" is that I am having to split my web page code into small chunks because I am trying to render the web page without missing incoming UDP packets on the other 3 sockets. So, I can only spend a few ms at a time executing the server code. I use the webstate variable with a CASE command to execute a specific portion of the page rendering code on each pass, then I check and deal with the other 3 sockets.

Here's a copy of the page I'm getting now after being up about 11 hours:

-jim-

Beau Schwabe · 2011-01-07 20:22

I am trying out a 'new' correction scheme to monitor the status register... The server has been up almost 24 hours now without hanging ... I'll let it continue over the weekend, but I need some more hits to be certain that this scheme will work. I'll leave the web-cam up as well, so you can see the interactive results. ...I am keeping a tally on this end of bad vs. good status results and will provide code and the results later.

Thanks

Web-Server --> http://24.253.241.231:5555/
Web-Camera --> http://24.253.241.231:8081/

Beau Schwabe · 2011-01-07 22:15

Here is a basic connection flow that I'm using... Within about 1200 hits there are at least 3 that would have resulted in an 'unresponsive' state. The Attached flow was able to recognize a transmission error and recover.

Refer to the PDF starting at the bottom of page 29 for the meaning of the Status values

Hope this helps

DynamoBen · 2011-01-08 09:24

While checking the status and resetting the socket is a reasonable workaround for this, I'm concern by the fact that it is happening in the first place. Has anyone been able to isolate if this is on the WizNet side or in the driver?

kuisma · 2011-01-08 09:30

DynamoBen wrote: »

While checking the status and resetting the socket is a reasonable workaround for this, I'm concern by the fact that it is happening in the first place. Has anyone been able to isolate if this is on the WizNet side or in the driver?

Using the TCP Echo server and quite a nasty client, I have exercised the driver quite thorough without being able to neither hang it nor lose any data.

DynamoBen · 2011-01-08 10:48

kuisma wrote: »

Using the TCP Echo server and quite a nasty client, I have exercised the driver quite thorough without being able to neither hang it nor lose any data.

Interesting so the driver may be OK, then it must be an issue with the way we are using the driver.

Anyone know if this problem is before, during, or after a page is served up?

Assuming this issue occurs after a page is served up there might be a timing problem between resetting the socket and waiting for another connection. If that is the case a simple fix could be a short wait between reloading the socket and listening.

kuisma · 2011-01-08 10:54

DynamoBen wrote: »

Interesting so the driver may be OK, then it must be an issue with the way we are using the driver.

Well, not really. I just can't reproduce any error related to the driver, but still, I assure you, there are bugs in the driver.

If you have code reproducing the bug, and believe it's the driver, please upload the code here.

Mike G · 2011-01-08 15:59

I wanted to mention this... browsers look for favicon.ico. I knew about favicon.ico but it did not cross my mind until I started using Kye's FAT driver with my HttpRequest object. The HttpRequest object processes all requests. So when it sees

GET /favicon.ico HTTP/1.1

the object tries to grab and render the file. Well, if the file does not exist, the FAT driver has a problem. Anyway, just be aware that the browser requests favicon.ico .

sstandfast · 2011-01-10 09:24

I have made a similar change to my interrupt based W5100 as recommended by Beau. I am now checking the status register and resetting the socket if it is not in the "listen" state prior to sleeping the Cog. I also changed the dynamic page generation to get data from a stored variable, rather than polling the node controllers directly. (This was on the ToDo list anyway as I need to read and respond to various room temperatures.) So far it has been up and running the last two days (211800+ seconds uptime) without locking up. Unfortunately, I can't generate enough traffic to really test it out. I know I'm probably gonna be disappointed with the results, but do you guys think you might be able to give me thousand or so hits to test stability? I would really appreciate it. The address is http://krwsms.dyndns.org I'm at work now, so I won't be able to reset it if it goes down, but I should be able to check it throughout the day to see if it is still up.

Thanks again,

Shawn

DynamoBen · 2011-01-10 15:15

For those looking to test their webserver you may want to consider Selenium:

http://seleniumhq.org/

Mike G · 2011-01-10 15:27

@sstandfast, it's hard to hit your site with any stress because of the googlelead.g.doubleclick.net... script has to load first.

Anyway, the site seems to work fine.

David Carrier · 2011-01-10 16:13

sstandfast,
I think I crashed your Spinneret Web Server. The best test I have found is to enter the IP address and port as th URI (in this case http://74.195.253.203:8000/) and hold Ctrl+R to reload continuously. It has been able to crash everyone else's Spinneret Web Server too.

David Carrier
Parallax Inc.

David Carrier · 2011-01-10 16:21

sstandfast,
I stand corrected, it is back up and the Site Counter has gone up quite a bit. It may have queued all of those requests and it wasn't responding until it finished them.

David Carrier
Parallax Inc.

Mike G · 2011-01-10 16:24

Give this one your Ctrl-R best.
http://www.agaverobotics.com/spinneret/controlpanel.htm

or this
http://spinneret.servebeer.com:5000/index.htm

sstandfast · 2011-01-10 16:24

@David, I thought you crashed it too at around 395-400 hits, but apparently it was able to recover after about 10 more connection attempts.

@Mike, to avoid the google ads at the top, try using this URI instead: http://krwredirector.dyndns.org:8000/ This is the destination of the WebHop URI and points directly to my IP.

Mike G · 2011-01-10 16:32

I crashed it myself

zapmaster · 2011-01-10 17:30

I'm have the crash issues also. my server is up and running. It auto resets when it crashes. i'm going to log the hits and crashes to the eeprom.
here is my ip adress:http://72.55.239.53
i also have a issue with not all the data transmited to the destantion. the page only partaily loads.

I idid not have the .8 code running i had the .6
I will see how it tomrrow at work.
Tell me what you see.

Mike G · 2011-01-13 06:38

I made a slight change to my web server that allows it to recover if it gets stuck. I ran a few test and it seems to be working.
http://spinneret.servebeer.com:5000/

Webserver randomly becomes unresponsive

Comments