Forum scraping
Heater.
Posts: 21,230
Does anyone out there have a script that can:
1) Fetch all the pages of a given forum thread.
2) For each post in the page extract the user name, date and text.
3) Save the above as a plain text file, with white space preserved.
4) Fetch any attachments on those posts.
?
I can't help thinking there might be a Perl monger out there who has already done this.
Edit: Here is a program to do the above task: https://github.com/ZiCog/parallax-scrape
1) Fetch all the pages of a given forum thread.
2) For each post in the page extract the user name, date and text.
3) Save the above as a plain text file, with white space preserved.
4) Fetch any attachments on those posts.
?
I can't help thinking there might be a Perl monger out there who has already done this.
Edit: Here is a program to do the above task: https://github.com/ZiCog/parallax-scrape
Comments
After half an hour or so I have some JavaScript running under node.js that fetches a given pages URL and pulls out the text of each post.
Getting the rest of my specification right might take a lot longer.
God I hate HTML.
The Perl mongers let me down so I had a go in JavaScript. If you want to see how not to parse HTML in JavaScript I posted the code into a github repository: https://github.com/ZiCog/parallax-scrape
There are some outstanding items like not having a harwired page URL and downloading attachments but it's a start.
Here is what it outputs for this thread:
Fetching: http://forums.parallax.com/showthread.php/149173-Forum-scraping?p=1195982#post1195982
Heater.
Does anyone out there have a script that can:
1) Fetch all the pages of a given forum thread.
2) For each post in the page extract the user name, date and text.
3) Save the above as a plain text file, with white space preserved.
4) Fetch any attachments on those posts.
?
I can't help thinking there might be a Perl monger out there who has already
this.
Duane Degn
I have the
impression
Ron Czapala could do this with Visual Basic.
This threads result:
http://ronczap.home.insightbb.com/testpost.htm
Can it be modified to run under Linux or Mac?
I just managed to get parallax-scrape to extract the date and time of posts with an output like so:
Did I say I hate HTML?
Don't think so - vbscript is a MS development tool and my script creates an Internet Explorer object (using Object Linking and Embedding - think ActiveX) to access the Document Object Model and parse the HTML tags.
- Ron
Yeah, but without HTML, we wouldn't be having this exchange...
Yeah, I know I was just digging. I'm kind of allergic to MS only products.
I was wondering what was the easiest way to go, build a DOM tree or streaming. As it is I just stream the HTML through a parser and have it spit out events when it sees interesting looking tags and stuff. That seems to be easy enough if one only wants a few items out of the page but it looks like it might get very complex very rapidly if you want a lot of different items. You have to build a state machine that keeps track of where it is as it goes along.
Actually I'm amazed to have gotten this far in only 150 lines of code.
I am only familiar with MS operating systems and development tools and really a fan of vbscript.
The DOM of course is not specific to any OS and I would think it provide the same capabilities regardless.
I use the All.tags method to return a collection DIVs, Anchors, etc when looking for an element that has no specified ID or NAME and then loop thru the collection.
Here is my vbscript - you may find it helpful
Of course screen scraping can be a problem if the source web pages change and your code is looking for certain HTML tags or layout.
The text of posts is reformatted to 80 column width.
Quote blocks within posts are indented for clarity.
Code blocks within posts are output as is, no reformatting. Hopefully the Spin code snippets in posts are still in good shape.
Links to attachments to posts are extracted and output.
Output goes to standard output.
Hopefully I will have time over the weekend to get the attachment down loading working.
Why am I doing all this?
Well. There is this rather long thread about a Z80 emulator here and I thought it would be nice to pull it down into a text file for easy reading or printing. I seem to remember there was a lot of useful info in there that should be in some documentation. Whilst I'm at it I can recover all the versions of code attached there.
As usual these quick hacks take longer than you think...
parallax-scrape can be downloaded as a ZIP file here: https://github.com/ZiCog/parallax-scrape
I looked at this the other day but didn't get a chance to run it. I like the idea very much as there are a number of interesting thread that could be "journalized" with your tool: PropForth, Tachyon, Elev-8, propgcc discussions, P2 discussions, etc.
I hope to get a chance to run it and play with it this weekend.
Yay for Javascript and Node!!
For example If you extracted all of kuroneko's posts you would have enough for a great Propeller book!
Being able to pull whole threads into an easy to handle plain text file might be of use in many cases.
Oh yeah!
I'm wondering if it would be of use to anyone to make this into a tool in the "cloud". Just paste the URL of the first page of a forum into a form and get back the plain text. Easy to do in node. Just need to find somewhere to host it.
Edit:
Automating the URL would be nice or maybe Parallax could offer a button to directly open the "archive"?
Sorry for being late, but reading the thread I remembered having seen something similar.
Massimo
Oh Smile. You mean I might have been able to get the same information from the archive than the normal forum more easily.
Ah well, it's done now. Anyway what I wanted was plain text output so the archive does not help much.
The only automation I am planning to add is to suck down all pages of a forum into one big file given the URL of the thread. That will be true even if I get it up as a service on the web.
Edit: It's an odd thing but I noticed the forum archive pages display code blocks with all the indentation removed. Like the page you linked to. That's no good, But if you look at the HTML source of such a page you can see the same code blocks there with indentation preserved. Very odd.
Edit: Looking at it again it really is true the archive screws up code snippets in posts. It's basically useless.
The other (possible) advantage is you have the thread in less pages.
Massimo
Yes in the archive page source the indenting is still there. Problem is that the code snippets are contained in <div> or or whatever tags instead of <pre> which means that when the browser displays the code snippet it removes all the spaces! No good.
Whit,
Thanks, I'm not sure it's clear to me. I only started hacking it out with a vague idea in mind.
Any (small) suggestions are welcome.
Has anyone out there managed to run this? I need some test coverage.
Windows and Mac users have it easy if they install node.js from here: http://nodejs.org/
I've run it on a couple threads. It works fine on my Mac and on the few threads I tried. Your installation instructions work great.
I need to dig up some of the larger threads and run them through to capture the output.
Maybe later tonight!
Great. Hopefully I have some hours free this weekend to get it to stitch all the pages of a thread together.
Oh, and the attachment download thing.
I ran this on Mac OS X 10.8.4 with nodejs installed...
Does the scraper have a limit on the size of the scraped page or thread? When I scrape this long thread (it's one of yours), I get about half the first page...
http://forums.parallax.com/showthread.php/110804-ZiCog-a-Zilog-Z80-emulator-in-1-Cog
It stops at the end if post #20.
dgately
Post #20 by Ale is the last post on that page.
Looks like it is working OK from here.
Support for fetching multi-page threads is coming. First I have to clean up the code a bit.
I have been busy with household chores to day so not much happened with it.
Must be a settings thing, but I get 40 posts per page in my browser. Is your script set to gather just 20 posts, somehow?
This image of my browser shows how it displays beyond #20 going to post #21 and beyond.
My profile setting is for 40 posts per page. If you set your's beyond 20, does the script work?
dgately
parallax-scrape does not log in to the forum so it cannot change the number of posts per page and can only get the default, 20, posts.
No matter, I'm going to have it fetch all the pages of a thread and concatenate the output into a single file.
Hopefully I will I have time to work on that again tomorrow.
BTW is it the latest ZiCog source you are after? I can send you that if you need it, but I don't have the BIOS/BDOS that you did, only the object - guess we could disassemble it.
Thanks. But don't don't post me any code for ZiCog. I have tons of it. That's part of my problem and current effort to straighten it out.
If I remember correctly the latest CP/M sources especially the BIOS are included in the CP/M disk images that I put out. Look for ZBOOT.MAC or some such file name.
This forum scraper is of course part of that house keeping effort. Another of those "half hour hacks" that has ended up taking all the little spare time I have.
I want to get all the released versions of ZiCog into a git repository where they will be safe and sound, easily labeled and recoverable. That repo needs a change history and whatever documentation. Most of which is in the long and winding ZiCog thread. Hence parallax-scrape.
Hopefully others may have a use for this as well.
This should make it easier to fix if Parallax changes the forum HTML layout.
Might also eventually make it reconfigurable for other websites.