Shop OBEX P1 Docs P2 Docs Learn Events
Forum scraping — Parallax Forums

Forum scraping

Heater.Heater. Posts: 21,230
edited 2013-07-29 20:17 in General Discussion
Does anyone out there have a script that can:

1) Fetch all the pages of a given forum thread.
2) For each post in the page extract the user name, date and text.
3) Save the above as a plain text file, with white space preserved.
4) Fetch any attachments on those posts.

?

I can't help thinking there might be a Perl monger out there who has already done this.

Edit: Here is a program to do the above task: https://github.com/ZiCog/parallax-scrape
«1

Comments

  • Duane DegnDuane Degn Posts: 10,588
    edited 2013-07-16 09:38
    I have the impression Ron Czapala could do this with Visual Basic.
  • WhitWhit Posts: 4,191
    edited 2013-07-16 17:09
    This might be a start - Courtesy of W9GFO - Click here to search the Parallax and Savage Circuits forums
  • Heater.Heater. Posts: 21,230
    edited 2013-07-16 17:16
    I thought maybe this was "impossible" so I started to make my own.
    After half an hour or so I have some JavaScript running under node.js that fetches a given pages URL and pulls out the text of each post.
    Getting the rest of my specification right might take a lot longer.
    God I hate HTML.
  • WhitWhit Posts: 4,191
    edited 2013-07-16 20:10
    Hey Heater - sorry - guess I didn't understand what you really wanted to do...
  • Heater.Heater. Posts: 21,230
    edited 2013-07-17 05:08
    No worries Whit.

    The Perl mongers let me down so I had a go in JavaScript. If you want to see how not to parse HTML in JavaScript I posted the code into a github repository: https://github.com/ZiCog/parallax-scrape

    There are some outstanding items like not having a harwired page URL and downloading attachments but it's a start.

    Here is what it outputs for this thread:

    Fetching: http://forums.parallax.com/showthread.php/149173-Forum-scraping?p=1195982#post1195982


    Heater.


    Does anyone out there have a script that can:

    1) Fetch all the pages of a given forum thread.
    2) For each post in the page extract the user name, date and text.
    3) Save the above as a plain text file, with white space preserved.
    4) Fetch any attachments on those posts.

    ?

    I can't help thinking there might be a Perl monger out there who has already
    this.


    Duane Degn


    I have the
    impression
    Ron Czapala could do this with Visual Basic.
  • Ron CzapalaRon Czapala Posts: 2,418
    edited 2013-07-17 06:41
    The vbscript I wrote to capture a forum thread outputs html but could be modified to output plain text

    This threads result:
    http://ronczap.home.insightbb.com/testpost.htm
  • Heater.Heater. Posts: 21,230
    edited 2013-07-17 07:18
    Ron,

    Can it be modified to run under Linux or Mac?
    :)

    I just managed to get parallax-scrape to extract the date and time of posts with an output like so:
    Fetching: http://forums.parallax.com/showthread.php/149173-Forum-scraping?p=1195982#post1195982
    
    
    --------------------
    07-16-2013, 03:46 PM
    Heater.
    --------------------
    Does anyone out there have a script that can: 
     
    1) Fetch all the pages of a given forum thread. 
    2) For each post in the page extract the user name, date and text. 
    3) Save the above as a plain text file, with white space preserved. 
    4) Fetch any attachments on those posts. 
     
    ? 
     
    I can't help thinking there might be a Perl monger out there who has already 
    done this. 
    
    
    --------------------
    07-16-2013, 04:38 PM
    Duane Degn
    --------------------
    I have the 
    impression 
    Ron Czapala could do this with Visual Basic. 
    
    
    

    Did I say I hate HTML?
  • Ron CzapalaRon Czapala Posts: 2,418
    edited 2013-07-17 07:33
    Heater. wrote: »
    Ron,

    Can it be modified to run under Linux or Mac?
    :)

    Don't think so - vbscript is a MS development tool and my script creates an Internet Explorer object (using Object Linking and Embedding - think ActiveX) to access the Document Object Model and parse the HTML tags.

    - Ron
    Heater. wrote: »
    Did I say I hate HTML?

    Yeah, but without HTML, we wouldn't be having this exchange... :smile:
  • Heater.Heater. Posts: 21,230
    edited 2013-07-17 08:08
    Ron,

    Yeah, I know I was just digging. I'm kind of allergic to MS only products.

    I was wondering what was the easiest way to go, build a DOM tree or streaming. As it is I just stream the HTML through a parser and have it spit out events when it sees interesting looking tags and stuff. That seems to be easy enough if one only wants a few items out of the page but it looks like it might get very complex very rapidly if you want a lot of different items. You have to build a state machine that keeps track of where it is as it goes along.

    Actually I'm amazed to have gotten this far in only 150 lines of code.
  • Ron CzapalaRon Czapala Posts: 2,418
    edited 2013-07-17 08:25
    Heater. wrote: »
    Ron,

    Yeah, I know I was just digging. I'm kind of allergic to MS only products.

    I was wondering what was the easiest way to go, build a DOM tree or streaming. As it is I just stream the HTML through a parser and have it spit out events when it sees interesting looking tags and stuff. That seems to be easy enough if one only wants a few items out of the page but it looks like it might get very complex very rapidly if you want a lot of different items. You have to build a state machine that keeps track of where it is as it goes along.

    Actually I'm amazed to have gotten this far in only 150 lines of code.

    I am only familiar with MS operating systems and development tools and really a fan of vbscript.
    The DOM of course is not specific to any OS and I would think it provide the same capabilities regardless.

    I use the All.tags method to return a collection DIVs, Anchors, etc when looking for an element that has no specified ID or NAME and then loop thru the collection.

    Here is my vbscript - you may find it helpful

    Of course screen scraping can be a problem if the source web pages change and your code is looking for certain HTML tags or layout.
    option explicit
    '  const defURL =  "[URL]http://forums.parallax.com/showthread.php?124495-Fill-the-Big-Brain[/URL]"  
      const defURL =  "[URL]http://forums.parallax.com/showthread.php/149173-Forum-scraping[/URL]"
      const defmaxpage =  35
      const defFname = "bigbrain"
    '  const defURL = "[URL]http://forums.parallax.com/showthread.php?135033-IR-Remote-controlled-VMUSIC2-MP3-Player[/URL]"
    '  const defmaxpage =  1
      dim fname,answer, url,maxpage
      dim objIE, boolBrowserRunning, boolWorking,  pagedone
      dim fso, WshShell, mydocs, outfile,ts
      dim users, user,  i, idx, processed
      i = 0
      Set fso = CreateObject("Scripting.FileSystemObject")
      set WshShell = WScript.CreateObject("Wscript.Shell")
      answer = InputBox("Enter URL of Parallax forum thread", "Forum Thread", defURL)
      if answer = "" then
        wscript.quit
      end if
      url = answer
      answer = InputBox("Enter number of last page", "Forum Thread", defmaxpage)
      if answer = "" then
        wscript.quit
      end if
      maxpage = Cint(answer)
      answer = InputBox("Enter output HTML file name", "Forum Thread", defFname)
      if answer = "" then
        wscript.quit
      end if
      fname =  answer & ".htm"
      mydocs = WshShell.SpecialFolders("MyDocuments")
      outfile = fso.BuildPath(mydocs, fname)
      set ts = fso.CreateTextFile(outfile, True, True)  '  unicode    ' False)
      ts.writeline "<HTML><head><title>" & fname & "</title>"
    '  ts.writeline "<link REL='stylesheet' TYPE='text/css' HREF='forum.css'>"
      ts.writeline "<base href='http://forums.parallax.com/' />"
      ts.writeline  "<LINK rel=stylesheet type=text/css href='css.php?styleid=3&amp;langid=1&amp;d=1320441976&amp;td=ltr&amp;sheet=bbcode.css,editor.css,popupmenu.css,reset-fonts.css,vbulletin.css,vbulletin-chrome.css,vbulletin-formcontrols.css,'>"
      ts.writeline "<LINK rel=stylesheet type=text/css href='css.php?styleid=3&amp;langid=1&amp;d=1320441976&amp;td=ltr&amp;sheet=toolsmenu.css,postlist.css,showthread.css,postbit.css,options.css,attachment.css,poll.css,lightbox.css'>"
      ts.writeline "<LINK rel=stylesheet type=text/css href='css.php?styleid=3&amp;langid=1&amp;d=1320441976&amp;td=ltr&amp;sheet=additional.css'>"
      ts.writeline "<LINK rel=stylesheet href='http://forums.parallax.com/clientscript/ckeditor/skins/kama/editor.css?t=B37D54V'>"
      ts.writeline "</head><body>" 
    '  ts.writeline "<a href='" & URL & "', target='_blank'>Yahoo Comics</a>"
    '  ts.writeline "<center><button onclick='setStyles()'>Show/Hide previous days</button></center>"
      idx = 1
      processed = True
     
      Set objIE = WScript.CreateObject("InternetExplorer.Application","objIE_")
      'objIE.resizable = False
      objIE.MenuBar = False
      objIE.ToolBar = False
      objIE.StatusBar = True
      objIE.Silent = True
      objIE.Visible = True
      pagedone = False
      WSHShell.popup "Page " & idx , 1, "Processing",  vbInformation     
      objIE.navigate URL & "/page" & idx
      boolBrowserRunning = True 
      boolWorking = True
      Do While boolWorking  
         WScript.Sleep 500
          if pagedone = true then
            idx = idx + 1
            if idx > maxpage  then
              boolWorking = false
              exit do 
            end if
            ts.writeline "<hr>"
            pagedone = false
           WSHShell.popup "Page " & idx , 2, "Processing",  vbInformation     
           objIE.navigate URL & "/page" & idx
          end if  
       Loop
      Show_users  
      ts.write "</body></html>" 
      ts.close
      objIE.Visible =  True
      objIE.navigate outfile
    '  objIE.Quit 
    '  set objIE = nothing 
    '  WshShell.Run "iexplore.exe " & outfile, 1, True      'True = wait
      set fso = nothing
      set WshShell = nothing
      wscript.quit
    '- - - - - - - - - - IE Events - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
    Public Sub objIE_DocumentComplete(pDisp, url)
      'msgbox url 
      'wscript.echo url
      dim doc
      dim disp
      If (pDisp Is objIE) Then     'Page is done loading KB article 180366
      Else
        Exit Sub
      End If
      set doc = objIE.document
      ParseDoc doc
      pagedone = true
    end sub
    Public Sub objIE_OnQuit()
        boolBrowserRunning = False
        boolWorking = False
    End Sub 
    '- - - - - - - - - - - - - - - - - - - - - - - - - - - - 
    Public Sub ParseDoc(doc)
      dim  disp, olist, lists, listitem
      set olist = doc.getElementById("posts")
      if olist is nothing then
        msgbox "Posts OL tag missing", vbcritical, "DOM error"
        wscript.quit 
      end if
    '  ts.write olist.outerHTML
      set lists =  olist.All.tags("LI")
      For each listitem in lists
        Process_LI listitem    
      next
      pagedone = true
      if disp <> "" then      
         msgbox disp
      end if 
    end sub
    Public Sub Process_LI(item)
      dim divs,div,head, disp, spans, span, anchors, anchor
      dim  date_time, postnum
      set divs = item.All.tags("DIV")
      for each div in divs
        if div.className <> "" then
    '      disp = disp & div.className & " "
          select case div.className
             case "posthead"
                date_time = div.innerText
                set anchors=div.All.tags("A")
                for each anchor in anchors
                   if anchor.className="postcounter" then
                      postnum = replace(anchor.innerText, "#", "")
                   end if   
                next
    '            ts.write "<DIV>" & div.innerText & "</DIV>"
    '           disp = disp & div.innerText & " " 
             case "username_container" 
               set spans = div.All.tags("SPAN")
               for each span in spans
    '             if span.className = "parauser" then 
                   user = span.innerText 
    '               Process_user user
                   ts.write "<a name='" & postnum & "'></a>"
                   ts.write "<DIV style='FONT-SIZE: 16px; COLOR: #020FC0; FONT-FAMILY: verdana'>" & span.innerText & "&nbsp;"  & date_time & "</DIV>"
    '               disp = disp & span.innerText & " " 
    '             end if
               next 
            case "postbody"
               Process_Body div 
                ts.writeline "<HR>"
          end select 
    '      disp = date_time & " " & user & vbcrlf
        end if
      next 
      if disp <> "" then
        msgbox disp
      end if
    end sub
    Public Sub Process_Body(body)
      dim divs, div, H2s, H2, blocks,block
    '  msgbox body.outerHTML
    '  exit sub
      set H2s = body.All.tags("H2")
      for each H2 in H2s
        ts.write "<H2>" &  H2.innerText & "</H2>" 
      next   
      set divs = body.All.tags("DIV")
      for each div in divs
        if div.className <> "" then
          select case div.className
             case "content"
                set blocks = div.All.tags("BLOCKQUOTE")
                for each block in blocks
    '           msgbox div.outerHTML
    '            ts.write replace(div.outerHTML, chr(34), chr(39))
                   on error resume next
                   ts.write block.outerHTML
                   if err <> 0 then
                      msgbox block.outerHTML
                   end if
                   on error goto 0
                   Process_user user, len(block.outerHTML)
                next
          end select 
        end if
      next 
    end sub
    Public Sub Process_user(user, block_len)
      dim  k
      if isArray(users) then
      else
        ReDim users(2,0)         'add first row
        users(0, 0) = user
        users(1, 0) = 0
        users(2, 0) = 0  
      end if
      k = Match_user(user)
      if k = ""  then                                      'not in table
        k = ubound(users, 2) + 1
        redim preserve users(2, k)
        users(0, k) = user
        users(1, k) = 0
        users(2, k) = 0
      end if 
      users(1,k) = users(1,k) + 1
      users(2,k) = users(2,k) + block_len
    end sub
    Public Function Match_user(user)
       dim  k
       match_user = ""
       if IsArray(users) then
         for k=0 to ubound(users, 2)
           if user = users(0, k) then
             Match_user = k
             exit function
           end if
         next
      end if
    end function
    Public Sub Show_users
      dim k
      ts.writeline "<HR><HR>"
      ts.writeline "<TABLE width='100%' cellpadding='10' cellspacing='10' border='0' style='FONT-SIZE: 12px; COLOR: #020FC0; FONT-FAMILY: verdana'>"
      ts.writeline "<TR><TD>Posting Users</TD><TD># of Posts</TD><TD>Bytes used</TD></TR>"
      for k = 0 to ubound(users, 2)
        ts.writeline "<TR><TD>" & users(0, k) & "</TD><TD>" & users(1, k) & "</TD><TD>" & users(2, k)  & "</TD></TR>"
      next
      ts.writeline "</TABLE>"
    end sub
    
  • Heater.Heater. Posts: 21,230
    edited 2013-07-19 07:09
    I don't know if anyone has a use for this but my Parallax forum scraper is working quite nicely now.

    The text of posts is reformatted to 80 column width.
    Quote blocks within posts are indented for clarity.
    Code blocks within posts are output as is, no reformatting. Hopefully the Spin code snippets in posts are still in good shape.
    Links to attachments to posts are extracted and output.
    Output goes to standard output.

    Hopefully I will have time over the weekend to get the attachment down loading working.

    Why am I doing all this?

    Well. There is this rather long thread about a Z80 emulator here and I thought it would be nice to pull it down into a text file for easy reading or printing. I seem to remember there was a lot of useful info in there that should be in some documentation. Whilst I'm at it I can recover all the versions of code attached there.

    As usual these quick hacks take longer than you think...

    parallax-scrape can be downloaded as a ZIP file here: https://github.com/ZiCog/parallax-scrape
  • mindrobotsmindrobots Posts: 6,506
    edited 2013-07-19 07:28
    Heater,

    I looked at this the other day but didn't get a chance to run it. I like the idea very much as there are a number of interesting thread that could be "journalized" with your tool: PropForth, Tachyon, Elev-8, propgcc discussions, P2 discussions, etc.

    I hope to get a chance to run it and play with it this weekend.

    Yay for Javascript and Node!!
  • Heater.Heater. Posts: 21,230
    edited 2013-07-19 07:51
    I was thinking about this. There are lots of great threads with tons of useful stuff buried away here.
    For example If you extracted all of kuroneko's posts you would have enough for a great Propeller book!
    Being able to pull whole threads into an easy to handle plain text file might be of use in many cases.
    Yay for Javascript and Node!!

    Oh yeah!

    I'm wondering if it would be of use to anyone to make this into a tool in the "cloud". Just paste the URL of the first page of a forum into a form and get back the plain text. Easy to do in node. Just need to find somewhere to host it.
  • max72max72 Posts: 1,155
    edited 2013-07-19 09:37
    http://forums.parallax.com/archive/index.php/t-149173

    Edit:

    Automating the URL would be nice or maybe Parallax could offer a button to directly open the "archive"?

    Sorry for being late, but reading the thread I remembered having seen something similar.

    Massimo
  • Heater.Heater. Posts: 21,230
    edited 2013-07-19 12:38
    max72,

    Oh Smile. You mean I might have been able to get the same information from the archive than the normal forum more easily.

    Ah well, it's done now. Anyway what I wanted was plain text output so the archive does not help much.

    The only automation I am planning to add is to suck down all pages of a forum into one big file given the URL of the thread. That will be true even if I get it up as a service on the web.

    Edit: It's an odd thing but I noticed the forum archive pages display code blocks with all the indentation removed. Like the page you linked to. That's no good, But if you look at the HTML source of such a page you can see the same code blocks there with indentation preserved. Very odd.

    Edit: Looking at it again it really is true the archive screws up code snippets in posts. It's basically useless.
  • max72max72 Posts: 1,155
    edited 2013-07-19 13:13
    Strange indeed. Looking at the source it looks the indent is present..
    The other (possible) advantage is you have the thread in less pages.

    Massimo
  • WhitWhit Posts: 4,191
    edited 2013-07-19 13:42
    All coming clearer now Heater - I see now how useful this could be. Thanks for your work!
  • Heater.Heater. Posts: 21,230
    edited 2013-07-19 14:13
    max72,

    Yes in the archive page source the indenting is still there. Problem is that the code snippets are contained in <div> or or whatever tags instead of <pre> which means that when the browser displays the code snippet it removes all the spaces! No good.

    Whit,

    Thanks, I'm not sure it's clear to me. I only started hacking it out with a vague idea in mind.

    Any (small) suggestions are welcome.

    Has anyone out there managed to run this? I need some test coverage.

    Windows and Mac users have it easy if they install node.js from here: http://nodejs.org/
  • mindrobotsmindrobots Posts: 6,506
    edited 2013-07-19 15:30
    Heater,

    I've run it on a couple threads. It works fine on my Mac and on the few threads I tried. Your installation instructions work great.

    I need to dig up some of the larger threads and run them through to capture the output.

    Maybe later tonight!
  • Heater.Heater. Posts: 21,230
    edited 2013-07-19 15:40
    mindrobots,

    Great. Hopefully I have some hours free this weekend to get it to stitch all the pages of a thread together.
    Oh, and the attachment download thing.
  • dgatelydgately Posts: 1,630
    edited 2013-07-20 07:22
    Heater. wrote: »
    Has anyone out there managed to run this? I need some test coverage.

    I ran this on Mac OS X 10.8.4 with nodejs installed...

    Does the scraper have a limit on the size of the scraped page or thread? When I scrape this long thread (it's one of yours), I get about half the first page...

    http://forums.parallax.com/showthread.php/110804-ZiCog-a-Zilog-Z80-emulator-in-1-Cog

    It stops at the end if post #20.

    dgately
  • Heater.Heater. Posts: 21,230
    edited 2013-07-20 08:22
    Strangely enough that is the page I have been testing on mostly.
    Post #20 by Ale is the last post on that page.
    Looks like it is working OK from here.
  • mindrobotsmindrobots Posts: 6,506
    edited 2013-07-20 08:32
    I didn't see a way for it to handle multi-page threads. I'm fumbling through a "brute force" way for it to do that. Every single page I've thrown at it has been completely processed so far.
  • Heater.Heater. Posts: 21,230
    edited 2013-07-20 08:40
    mindrobits,

    Support for fetching multi-page threads is coming. First I have to clean up the code a bit.

    I have been busy with household chores to day so not much happened with it.
  • dgatelydgately Posts: 1,630
    edited 2013-07-20 13:58
    Heater. wrote: »
    Strangely enough that is the page I have been testing on mostly.
    Post #20 by Ale is the last post on that page.
    Looks like it is working OK from here.

    Must be a settings thing, but I get 40 posts per page in my browser. Is your script set to gather just 20 posts, somehow?

    This image of my browser shows how it displays beyond #20 going to post #21 and beyond.

    Post21.png


    My profile setting is for 40 posts per page. If you set your's beyond 20, does the script work?

    PostSetting.png



    dgately
    396 x 555 - 32K
    782 x 135 - 24K
  • Heater.Heater. Posts: 21,230
    edited 2013-07-20 15:33
    dgately,

    parallax-scrape does not log in to the forum so it cannot change the number of posts per page and can only get the default, 20, posts.

    No matter, I'm going to have it fetch all the pages of a thread and concatenate the output into a single file.

    Hopefully I will I have time to work on that again tomorrow.
  • Cluso99Cluso99 Posts: 18,069
    edited 2013-07-20 17:07
    Interesting work heater.
    BTW is it the latest ZiCog source you are after? I can send you that if you need it, but I don't have the BIOS/BDOS that you did, only the object - guess we could disassemble it.
  • Heater.Heater. Posts: 21,230
    edited 2013-07-21 00:14
    Cluso,

    Thanks. But don't don't post me any code for ZiCog. I have tons of it. That's part of my problem and current effort to straighten it out.
    If I remember correctly the latest CP/M sources especially the BIOS are included in the CP/M disk images that I put out. Look for ZBOOT.MAC or some such file name.

    This forum scraper is of course part of that house keeping effort. Another of those "half hour hacks" that has ended up taking all the little spare time I have.
    I want to get all the released versions of ZiCog into a git repository where they will be safe and sound, easily labeled and recoverable. That repo needs a change history and whatever documentation. Most of which is in the long and winding ZiCog thread. Hence parallax-scrape.

    Hopefully others may have a use for this as well.
  • Heater.Heater. Posts: 21,230
    edited 2013-07-21 00:53
    Just put up a new version of parallax-scrape to github. Same functionality but now all the high level tag parsing is table driven instead of using long winded case statements in code.
    This should make it easier to fix if Parallax changes the forum HTML layout.
    Might also eventually make it reconfigurable for other websites.
  • Heater.Heater. Posts: 21,230
    edited 2013-07-21 02:47
    Yet another update of parallax-scrape to git hub. This one fixes a serious bug. A certain style of quote blocks was being totally dropped.
Sign In or Register to comment.