Forum scraping

Heater. · 2013-07-16 08:46

Does anyone out there have a script that can:

1) Fetch all the pages of a given forum thread.
2) For each post in the page extract the user name, date and text.
3) Save the above as a plain text file, with white space preserved.
4) Fetch any attachments on those posts.

?

I can't help thinking there might be a Perl monger out there who has already done this.

Edit: Here is a program to do the above task: https://github.com/ZiCog/parallax-scrape

Duane Degn · 2013-07-16 09:38

I have the impression Ron Czapala could do this with Visual Basic.

Whit · 2013-07-16 17:09

This might be a start - Courtesy of W9GFO - Click here to search the Parallax and Savage Circuits forums

Heater. · 2013-07-16 17:16

I thought maybe this was "impossible" so I started to make my own.
After half an hour or so I have some JavaScript running under node.js that fetches a given pages URL and pulls out the text of each post.
Getting the rest of my specification right might take a lot longer.
God I hate HTML.

Whit · 2013-07-16 20:10

Hey Heater - sorry - guess I didn't understand what you really wanted to do...

Heater. · 2013-07-17 05:08

No worries Whit.

The Perl mongers let me down so I had a go in JavaScript. If you want to see how not to parse HTML in JavaScript I posted the code into a github repository: https://github.com/ZiCog/parallax-scrape

There are some outstanding items like not having a harwired page URL and downloading attachments but it's a start.

Here is what it outputs for this thread:

Fetching: http://forums.parallax.com/showthread.php/149173-Forum-scraping?p=1195982#post1195982

Heater.

Does anyone out there have a script that can:

1) Fetch all the pages of a given forum thread.
2) For each post in the page extract the user name, date and text.
3) Save the above as a plain text file, with white space preserved.
4) Fetch any attachments on those posts.

?

I can't help thinking there might be a Perl monger out there who has already
this.

Duane Degn

I have the
impression
Ron Czapala could do this with Visual Basic.

Ron Czapala · 2013-07-17 06:41

The vbscript I wrote to capture a forum thread outputs html but could be modified to output plain text

This threads result:
http://ronczap.home.insightbb.com/testpost.htm

Heater. · 2013-07-17 07:18

Ron,

Can it be modified to run under Linux or Mac?

I just managed to get parallax-scrape to extract the date and time of posts with an output like so:

Fetching: http://forums.parallax.com/showthread.php/149173-Forum-scraping?p=1195982#post1195982


--------------------
07-16-2013, 03:46 PM
Heater.
--------------------
Does anyone out there have a script that can: 
 
1) Fetch all the pages of a given forum thread. 
2) For each post in the page extract the user name, date and text. 
3) Save the above as a plain text file, with white space preserved. 
4) Fetch any attachments on those posts. 
 
? 
 
I can't help thinking there might be a Perl monger out there who has already 
done this. 


--------------------
07-16-2013, 04:38 PM
Duane Degn
--------------------
I have the 
impression 
Ron Czapala could do this with Visual Basic.

Did I say I hate HTML?

Ron Czapala · 2013-07-17 07:33

Heater. wrote: »

Ron,

Can it be modified to run under Linux or Mac?

Don't think so - vbscript is a MS development tool and my script creates an Internet Explorer object (using Object Linking and Embedding - think ActiveX) to access the Document Object Model and parse the HTML tags.

- Ron

Heater. wrote: »

Did I say I hate HTML?

Yeah, but without HTML, we wouldn't be having this exchange...

Heater. · 2013-07-17 08:08

Ron,

Yeah, I know I was just digging. I'm kind of allergic to MS only products.

I was wondering what was the easiest way to go, build a DOM tree or streaming. As it is I just stream the HTML through a parser and have it spit out events when it sees interesting looking tags and stuff. That seems to be easy enough if one only wants a few items out of the page but it looks like it might get very complex very rapidly if you want a lot of different items. You have to build a state machine that keeps track of where it is as it goes along.

Actually I'm amazed to have gotten this far in only 150 lines of code.

Ron Czapala · 2013-07-17 08:25

Heater. wrote: »

Ron,

Yeah, I know I was just digging. I'm kind of allergic to MS only products.

I was wondering what was the easiest way to go, build a DOM tree or streaming. As it is I just stream the HTML through a parser and have it spit out events when it sees interesting looking tags and stuff. That seems to be easy enough if one only wants a few items out of the page but it looks like it might get very complex very rapidly if you want a lot of different items. You have to build a state machine that keeps track of where it is as it goes along.

Actually I'm amazed to have gotten this far in only 150 lines of code.

I am only familiar with MS operating systems and development tools and really a fan of vbscript.
The DOM of course is not specific to any OS and I would think it provide the same capabilities regardless.

I use the All.tags method to return a collection DIVs, Anchors, etc when looking for an element that has no specified ID or NAME and then loop thru the collection.

Here is my vbscript - you may find it helpful

Of course screen scraping can be a problem if the source web pages change and your code is looking for certain HTML tags or layout.

option explicit
'  const defURL =  "[URL]http://forums.parallax.com/showthread.php?124495-Fill-the-Big-Brain[/URL]"  
  const defURL =  "[URL]http://forums.parallax.com/showthread.php/149173-Forum-scraping[/URL]"
  const defmaxpage =  35
  const defFname = "bigbrain"
'  const defURL = "[URL]http://forums.parallax.com/showthread.php?135033-IR-Remote-controlled-VMUSIC2-MP3-Player[/URL]"
'  const defmaxpage =  1
  dim fname,answer, url,maxpage
  dim objIE, boolBrowserRunning, boolWorking,  pagedone
  dim fso, WshShell, mydocs, outfile,ts
  dim users, user,  i, idx, processed
  i = 0
  Set fso = CreateObject("Scripting.FileSystemObject")
  set WshShell = WScript.CreateObject("Wscript.Shell")
  answer = InputBox("Enter URL of Parallax forum thread", "Forum Thread", defURL)
  if answer = "" then
    wscript.quit
  end if
  url = answer
  answer = InputBox("Enter number of last page", "Forum Thread", defmaxpage)
  if answer = "" then
    wscript.quit
  end if
  maxpage = Cint(answer)
  answer = InputBox("Enter output HTML file name", "Forum Thread", defFname)
  if answer = "" then
    wscript.quit
  end if
  fname =  answer & ".htm"
  mydocs = WshShell.SpecialFolders("MyDocuments")
  outfile = fso.BuildPath(mydocs, fname)
  set ts = fso.CreateTextFile(outfile, True, True)  '  unicode    ' False)
  ts.writeline "<HTML><head><title>" & fname & "</title>"
'  ts.writeline "<link REL='stylesheet' TYPE='text/css' HREF='forum.css'>"
  ts.writeline "<base href='http://forums.parallax.com/' />"
  ts.writeline  "<LINK rel=stylesheet type=text/css href='css.php?styleid=3&amp;langid=1&amp;d=1320441976&amp;td=ltr&amp;sheet=bbcode.css,editor.css,popupmenu.css,reset-fonts.css,vbulletin.css,vbulletin-chrome.css,vbulletin-formcontrols.css,'>"
  ts.writeline "<LINK rel=stylesheet type=text/css href='css.php?styleid=3&amp;langid=1&amp;d=1320441976&amp;td=ltr&amp;sheet=toolsmenu.css,postlist.css,showthread.css,postbit.css,options.css,attachment.css,poll.css,lightbox.css'>"
  ts.writeline "<LINK rel=stylesheet type=text/css href='css.php?styleid=3&amp;langid=1&amp;d=1320441976&amp;td=ltr&amp;sheet=additional.css'>"
  ts.writeline "<LINK rel=stylesheet href='http://forums.parallax.com/clientscript/ckeditor/skins/kama/editor.css?t=B37D54V'>"
  ts.writeline "</head><body>" 
'  ts.writeline "<a href='" & URL & "', target='_blank'>Yahoo Comics</a>"
'  ts.writeline "<center><button onclick='setStyles()'>Show/Hide previous days</button></center>"
  idx = 1
  processed = True
 
  Set objIE = WScript.CreateObject("InternetExplorer.Application","objIE_")
  'objIE.resizable = False
  objIE.MenuBar = False
  objIE.ToolBar = False
  objIE.StatusBar = True
  objIE.Silent = True
  objIE.Visible = True
  pagedone = False
  WSHShell.popup "Page " & idx , 1, "Processing",  vbInformation     
  objIE.navigate URL & "/page" & idx
  boolBrowserRunning = True 
  boolWorking = True
  Do While boolWorking  
     WScript.Sleep 500
      if pagedone = true then
        idx = idx + 1
        if idx > maxpage  then
          boolWorking = false
          exit do 
        end if
        ts.writeline "<hr>"
        pagedone = false
       WSHShell.popup "Page " & idx , 2, "Processing",  vbInformation     
       objIE.navigate URL & "/page" & idx
      end if  
   Loop
  Show_users  
  ts.write "</body></html>" 
  ts.close
  objIE.Visible =  True
  objIE.navigate outfile
'  objIE.Quit 
'  set objIE = nothing 
'  WshShell.Run "iexplore.exe " & outfile, 1, True      'True = wait
  set fso = nothing
  set WshShell = nothing
  wscript.quit
'- - - - - - - - - - IE Events - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
Public Sub objIE_DocumentComplete(pDisp, url)
  'msgbox url 
  'wscript.echo url
  dim doc
  dim disp
  If (pDisp Is objIE) Then     'Page is done loading KB article 180366
  Else
    Exit Sub
  End If
  set doc = objIE.document
  ParseDoc doc
  pagedone = true
end sub
Public Sub objIE_OnQuit()
    boolBrowserRunning = False
    boolWorking = False
End Sub 
'- - - - - - - - - - - - - - - - - - - - - - - - - - - - 
Public Sub ParseDoc(doc)
  dim  disp, olist, lists, listitem
  set olist = doc.getElementById("posts")
  if olist is nothing then
    msgbox "Posts OL tag missing", vbcritical, "DOM error"
    wscript.quit 
  end if
'  ts.write olist.outerHTML
  set lists =  olist.All.tags("LI")
  For each listitem in lists
    Process_LI listitem    
  next
  pagedone = true
  if disp <> "" then      
     msgbox disp
  end if 
end sub
Public Sub Process_LI(item)
  dim divs,div,head, disp, spans, span, anchors, anchor
  dim  date_time, postnum
  set divs = item.All.tags("DIV")
  for each div in divs
    if div.className <> "" then
'      disp = disp & div.className & " "
      select case div.className
         case "posthead"
            date_time = div.innerText
            set anchors=div.All.tags("A")
            for each anchor in anchors
               if anchor.className="postcounter" then
                  postnum = replace(anchor.innerText, "#", "")
               end if   
            next
'            ts.write "<DIV>" & div.innerText & "</DIV>"
'           disp = disp & div.innerText & " " 
         case "username_container" 
           set spans = div.All.tags("SPAN")
           for each span in spans
'             if span.className = "parauser" then 
               user = span.innerText 
'               Process_user user
               ts.write "<a name='" & postnum & "'></a>"
               ts.write "<DIV style='FONT-SIZE: 16px; COLOR: #020FC0; FONT-FAMILY: verdana'>" & span.innerText & "&nbsp;"  & date_time & "</DIV>"
'               disp = disp & span.innerText & " " 
'             end if
           next 
        case "postbody"
           Process_Body div 
            ts.writeline "<HR>"
      end select 
'      disp = date_time & " " & user & vbcrlf
    end if
  next 
  if disp <> "" then
    msgbox disp
  end if
end sub
Public Sub Process_Body(body)
  dim divs, div, H2s, H2, blocks,block
'  msgbox body.outerHTML
'  exit sub
  set H2s = body.All.tags("H2")
  for each H2 in H2s
    ts.write "<H2>" &  H2.innerText & "</H2>" 
  next   
  set divs = body.All.tags("DIV")
  for each div in divs
    if div.className <> "" then
      select case div.className
         case "content"
            set blocks = div.All.tags("BLOCKQUOTE")
            for each block in blocks
'           msgbox div.outerHTML
'            ts.write replace(div.outerHTML, chr(34), chr(39))
               on error resume next
               ts.write block.outerHTML
               if err <> 0 then
                  msgbox block.outerHTML
               end if
               on error goto 0
               Process_user user, len(block.outerHTML)
            next
      end select 
    end if
  next 
end sub
Public Sub Process_user(user, block_len)
  dim  k
  if isArray(users) then
  else
    ReDim users(2,0)         'add first row
    users(0, 0) = user
    users(1, 0) = 0
    users(2, 0) = 0  
  end if
  k = Match_user(user)
  if k = ""  then                                      'not in table
    k = ubound(users, 2) + 1
    redim preserve users(2, k)
    users(0, k) = user
    users(1, k) = 0
    users(2, k) = 0
  end if 
  users(1,k) = users(1,k) + 1
  users(2,k) = users(2,k) + block_len
end sub
Public Function Match_user(user)
   dim  k
   match_user = ""
   if IsArray(users) then
     for k=0 to ubound(users, 2)
       if user = users(0, k) then
         Match_user = k
         exit function
       end if
     next
  end if
end function
Public Sub Show_users
  dim k
  ts.writeline "<HR><HR>"
  ts.writeline "<TABLE width='100%' cellpadding='10' cellspacing='10' border='0' style='FONT-SIZE: 12px; COLOR: #020FC0; FONT-FAMILY: verdana'>"
  ts.writeline "<TR><TD>Posting Users</TD><TD># of Posts</TD><TD>Bytes used</TD></TR>"
  for k = 0 to ubound(users, 2)
    ts.writeline "<TR><TD>" & users(0, k) & "</TD><TD>" & users(1, k) & "</TD><TD>" & users(2, k)  & "</TD></TR>"
  next
  ts.writeline "</TABLE>"
end sub

Heater. · 2013-07-19 07:09

I don't know if anyone has a use for this but my Parallax forum scraper is working quite nicely now.

The text of posts is reformatted to 80 column width.
Quote blocks within posts are indented for clarity.
Code blocks within posts are output as is, no reformatting. Hopefully the Spin code snippets in posts are still in good shape.
Links to attachments to posts are extracted and output.
Output goes to standard output.

Hopefully I will have time over the weekend to get the attachment down loading working.

Why am I doing all this?

Well. There is this rather long thread about a Z80 emulator here and I thought it would be nice to pull it down into a text file for easy reading or printing. I seem to remember there was a lot of useful info in there that should be in some documentation. Whilst I'm at it I can recover all the versions of code attached there.

As usual these quick hacks take longer than you think...

parallax-scrape can be downloaded as a ZIP file here: https://github.com/ZiCog/parallax-scrape

mindrobots · 2013-07-19 07:28

Heater,

I looked at this the other day but didn't get a chance to run it. I like the idea very much as there are a number of interesting thread that could be "journalized" with your tool: PropForth, Tachyon, Elev-8, propgcc discussions, P2 discussions, etc.

I hope to get a chance to run it and play with it this weekend.

Yay for Javascript and Node!!

Heater. · 2013-07-19 07:51

I was thinking about this. There are lots of great threads with tons of useful stuff buried away here.
For example If you extracted all of kuroneko's posts you would have enough for a great Propeller book!
Being able to pull whole threads into an easy to handle plain text file might be of use in many cases.

Yay for Javascript and Node!!

Oh yeah!

I'm wondering if it would be of use to anyone to make this into a tool in the "cloud". Just paste the URL of the first page of a forum into a form and get back the plain text. Easy to do in node. Just need to find somewhere to host it.

max72 · 2013-07-19 09:37

http://forums.parallax.com/archive/index.php/t-149173

Edit:

Automating the URL would be nice or maybe Parallax could offer a button to directly open the "archive"?

Sorry for being late, but reading the thread I remembered having seen something similar.

Massimo

Heater. · 2013-07-19 12:38

max72,

Oh Smile. You mean I might have been able to get the same information from the archive than the normal forum more easily.

Ah well, it's done now. Anyway what I wanted was plain text output so the archive does not help much.

The only automation I am planning to add is to suck down all pages of a forum into one big file given the URL of the thread. That will be true even if I get it up as a service on the web.

Edit: It's an odd thing but I noticed the forum archive pages display code blocks with all the indentation removed. Like the page you linked to. That's no good, But if you look at the HTML source of such a page you can see the same code blocks there with indentation preserved. Very odd.

Edit: Looking at it again it really is true the archive screws up code snippets in posts. It's basically useless.

max72 · 2013-07-19 13:13

Strange indeed. Looking at the source it looks the indent is present..
The other (possible) advantage is you have the thread in less pages.

Massimo

Whit · 2013-07-19 13:42

All coming clearer now Heater - I see now how useful this could be. Thanks for your work!

Heater. · 2013-07-19 14:13

max72,

Yes in the archive page source the indenting is still there. Problem is that the code snippets are contained in <div> or or whatever tags instead of <pre> which means that when the browser displays the code snippet it removes all the spaces! No good.

Whit,

Thanks, I'm not sure it's clear to me. I only started hacking it out with a vague idea in mind.

Any (small) suggestions are welcome.

Has anyone out there managed to run this? I need some test coverage.

Windows and Mac users have it easy if they install node.js from here: http://nodejs.org/

mindrobots · 2013-07-19 15:30

Heater,

I've run it on a couple threads. It works fine on my Mac and on the few threads I tried. Your installation instructions work great.

I need to dig up some of the larger threads and run them through to capture the output.

Maybe later tonight!

Heater. · 2013-07-19 15:40

mindrobots,

Great. Hopefully I have some hours free this weekend to get it to stitch all the pages of a thread together.
Oh, and the attachment download thing.

dgately · 2013-07-20 07:22

Heater. wrote: »

Has anyone out there managed to run this? I need some test coverage.

I ran this on Mac OS X 10.8.4 with nodejs installed...

Does the scraper have a limit on the size of the scraped page or thread? When I scrape this long thread (it's one of yours), I get about half the first page...

http://forums.parallax.com/showthread.php/110804-ZiCog-a-Zilog-Z80-emulator-in-1-Cog

It stops at the end if post #20.

dgately

Heater. · 2013-07-20 08:22

Strangely enough that is the page I have been testing on mostly.
Post #20 by Ale is the last post on that page.
Looks like it is working OK from here.

mindrobots · 2013-07-20 08:32

I didn't see a way for it to handle multi-page threads. I'm fumbling through a "brute force" way for it to do that. Every single page I've thrown at it has been completely processed so far.

Heater. · 2013-07-20 08:40

mindrobits,

Support for fetching multi-page threads is coming. First I have to clean up the code a bit.

I have been busy with household chores to day so not much happened with it.

dgately · 2013-07-20 13:58

Heater. wrote: »

Strangely enough that is the page I have been testing on mostly.
Post #20 by Ale is the last post on that page.
Looks like it is working OK from here.

Must be a settings thing, but I get 40 posts per page in my browser. Is your script set to gather just 20 posts, somehow?

This image of my browser shows how it displays beyond #20 going to post #21 and beyond.

My profile setting is for 40 posts per page. If you set your's beyond 20, does the script work?

dgately

Heater. · 2013-07-20 15:33

dgately,

parallax-scrape does not log in to the forum so it cannot change the number of posts per page and can only get the default, 20, posts.

No matter, I'm going to have it fetch all the pages of a thread and concatenate the output into a single file.

Hopefully I will I have time to work on that again tomorrow.

Cluso99 · 2013-07-20 17:07

Interesting work heater.
BTW is it the latest ZiCog source you are after? I can send you that if you need it, but I don't have the BIOS/BDOS that you did, only the object - guess we could disassemble it.

Heater. · 2013-07-21 00:14

Cluso,

Thanks. But don't don't post me any code for ZiCog. I have tons of it. That's part of my problem and current effort to straighten it out.
If I remember correctly the latest CP/M sources especially the BIOS are included in the CP/M disk images that I put out. Look for ZBOOT.MAC or some such file name.

This forum scraper is of course part of that house keeping effort. Another of those "half hour hacks" that has ended up taking all the little spare time I have.
I want to get all the released versions of ZiCog into a git repository where they will be safe and sound, easily labeled and recoverable. That repo needs a change history and whatever documentation. Most of which is in the long and winding ZiCog thread. Hence parallax-scrape.

Hopefully others may have a use for this as well.

Heater. · 2013-07-21 00:53

Just put up a new version of parallax-scrape to github. Same functionality but now all the high level tag parsing is table driven instead of using long winded case statements in code.
This should make it easier to fix if Parallax changes the forum HTML layout.
Might also eventually make it reconfigurable for other websites.

Heater. · 2013-07-21 02:47

Yet another update of parallax-scrape to git hub. This one fixes a serious bug. A certain style of quote blocks was being totally dropped.

Forum scraping

Comments