Forum scraping

MJB · 2013-07-21 04:56

I would love to have a button for this conversion on every forum page -
to open a new window with your reformatted content.
and links to attachments still working.
MJB

MJB · 2013-07-21 05:06

If I go to thread tools menu - show printable version.
There is a link / button to show 50 on one page.
I tried to take this link and change 50 to some larger number - but unfortunately this does not work.
If this worked there is only a cut and paste to notepad++ for example and I have a wonderful
text file with formatting and links preserved.

So if somebody could remove this 50 posts restriction that is all I would love to have.
To download the attachments I could then just use a standard web site downloader
starting from this created page.
MJB

Heater. · 2013-07-21 07:11

MJB,

You know, I never thought of that printable version. Have I been wasting my time?
Not quite yet. When I copy and paste the printable version into a text editor the formatting is all wonky. Or at least not the way I would like it.
Perhaps the printable versions HTML might be easier to scrape though,

Heater. · 2013-07-22 07:20

New version up. parallax-scrape now extracts and displays the attachment urls nicely.
https://github.com/ZiCog/parallax-scrape

Heater. · 2013-07-23 07:42

Another version. Now downloads some or all of the pages of a thread. Instructions on github.
https://github.com/ZiCog/parallax-scrape

mindrobots · 2013-07-23 09:40

Very nice!

I just gulped down 41 pages of Tachyon posts plus references to all the attachments!!

No more than a couple minutes for over a MB of processed data, that's certainly acceptable.

In playing around with it this weekend trying to get it to do multiple pages, I was curious how you would approach it. The recursion seems to handle it well. I was wondering about sequencing problems but putting the recursive calls inside the callback takes care of that. I had many, many fetches going on and chaos was reigning. I obviously still have trouble thinking in terms of node and it's asynchronous, non-blocking nature.

Thanks for sticking with it, I know I'll use this a lot for harvesting knowledge from the forum.

Heater. · 2013-07-23 10:17

mindrobots,

Ah ha. It looks like you are suffering a dose of node confusion:) Coming from "normal" languages like C, Pascal etc JavaScript can be quite a handful. JS has immensely expressive features like first class functions, inner functions, closures and prototypal inheritance but they can be a bit a bear to get your mind around sometimes. Then throw in the event loop and call backs...

Then it kind of looks "normal" like C or Pascal etc but it behaves very differently. For example variables have function scope not block scope which can really catch you out.

Now, note well: There is no recursion in that code!

The fetchPages() function looks like it calls itself. But does it? No.

What happens is:

A) fetchPages() creates an anonymous function which it passes as a parameter to request. That's a callback.

fetchPages() will then return as will the main() function that called it. No possibility for recursion there.
C) But then some time later when request() has got the data back from the server it calls the callback which parses the data and calls fetchPages() again with an incremented page number.

So fetchPages is actually being called by the event loop every time data comes in. The stack is unwound each time. There is no recursion,

And this all works because of the miracle of closures. Note that fetchPages contains a local variable "pageNumber". Well that variable does not disappear when fetchPages() returns, as it would in C. What happens is that pageNumber stays around because it is reference by the callback function. Only when the call back function is garbage collected, functions are objects after all, will any variables it references be deleted. The call back function in turn cannot be deleted until the event loop has dropped any references to it.

Magic hey?

It is possible that you might want to make many requests in parallel rather than wait for each one to finish. In which case I guess the code might look more like whatever loop you had. But I thought it's probably polite not to hammer on the Parallax servers like that. And besides you then have the problem of stitching the data back together in the right order as you found.

Heater. · 2013-07-23 10:23

Is anyone in a hurry for automatically downloading all attachments?

mindrobots · 2013-07-23 10:38

Heater. wrote: »

mindrobots,

Ah ha. It looks like you are suffering a dose of node confusion:)

OUCH! My poor little brain!!

I'm suffering from many programmatic maladies....may as well add node confusion to the list!

Thank you for the explanation, it makes perfect sense, I just need to context switch better.

dgately · 2013-07-23 10:43

Heater,

Any way to allow thread URLs that contain characters such as '('?

I ran a test against the thread named "TACHYON - A Fast and small Forth byte code VM (source code available) + Web pages", with a URL of: http://forums.parallax.com/showthread.php/141061-TACHYON-A-Fast-and-small-Forth-byte-code-VM-(source-code-available)-Web-pages and bash tells me -bash: syntax error near unexpected token `('

I was not able to escape the '(' or ')' with typical line command escape characters '\'

Not sure how many threads have special characters in their title, but there is no restriction in the forum editing process.

dgately

mindrobots · 2013-07-23 10:49

Interesting, that's the same thread I used.

I cut it out of firefox and pasted it into my bash (OS X) command line and ended up with %28 for '(' and %29 for ')' -firefox gave me a "web friendly" URL to work with.

Heater. · 2013-07-23 10:55

You should be able to put double quotes around the url to get what yo want. I sould note that in the instructions.

Edit: Sorry that wont work. You can't have spaces and such in urls. I would have to get it to fix those characters up.

dgately · 2013-07-23 15:28

mindrobots wrote: »

I cut it out of firefox and pasted it into my bash (OS X) command line and ended up with ( for '(' and ) for ')' -firefox gave me a "web friendly" URL to work with.

Yep! Chrome & Safari just copy "as-is" but FireFox does indeed "fix" the special characters! Good catch, Rick...

dgately

Heater. · 2013-07-23 22:43

dgately,

I just tried out your URL. Copied from Chrome's URL bar it fails because my bash shell does not like the brackets.
Wrapping the URL in double quotes fixes that.
So does escaping the open and close parenthesis with \. Not sure why that would have failed for you.

Not much I can do about that.

In general though I guess I should do some URI encoding there because potentially the the URL can contain spaces and other chars that need encoding.

Heater. · 2013-07-23 23:01

mindrobots,

Further to the node confusion I thought I should point something out.

I said that the fetchPages() function was not recursive. This is true because the request() function calls the given callback after request() and fetchPages() have returned. That is to say that at some point in the future data arrives and causes an event and the event loop calls something in the request module which eventually calls our callback. That is to say request is asynchronous.

But what if request were synchronous? It would hang up until data arrived, we would not return from fetchPages whilst waiting. The callback would call fetchPages() and boom the thing would be recursive.

That would probably still work but of course it would block any other events going on so it would be no good in a server situation.

Heater. · 2013-07-24 11:42

New version: Now downloads all attachments discovered whilst fetching the pages.

Go mad and bring the forums server to it's knees. Err no...don't do that I have to test it some more.

Heater. · 2013-07-25 04:59

New version: Cleaned up output formatting somewhat.
https://github.com/ZiCog/parallax-scrape

Heater. · 2013-07-25 14:01

Sorry folks parallax-scrape has a serious bug.

It drops Mike Green's name from all his posts.

Sorry Mike. It's not just you. It's all moderators as they have slightly different HTML around their user names in posts.

Probably be fixed tomorrow.

Heater. · 2013-07-26 04:47

OK, Fixed the "Mike Green" bug. All posters names are now extracted correctly.
https://github.com/ZiCog/parallax-scrape

Looks like I'm talking to myself here. Anyone else been testing this thing?

mindrobots · 2013-07-26 04:54

Yes, back to it. I hit a bump of real world stuff that has been keeping me busy.

Looks like I'm talking to myself here.

Look on the bright side, nobody's talking back to you and nobody's complaining - in my book, that's a good day!

Heater. · 2013-07-26 17:50

Urgent bug fixed. parallax-scrape was not downloading attachments as binary files and hence corrupting them. Now fixed.
https://github.com/ZiCog/parallax-scrape

dgately · 2013-07-29 05:08

Sorry, Heater...

Been traveling. In fact, I'm now on the Danube river cruise. I'll try to run a test, soon.

dgately

Heater. · 2013-07-29 05:49

dgately,

Have fun. I'm in Braunschweig this week.

Bob Lawrence (VE1RLL) · 2013-07-29 19:29

@Heater

Interesting project. It's amazing to see so many projects based on Node.js. The only Node.js scraping that I tried was an example using request and cheerio. cheerio was used to generates a DOM tree and provides a subset of the jQuery function set to manipulate it. It worked very well. I will try your htmlparser2 to see how that works. :cool:

Why text output? Why not PDF with the pdfkit? (it works great)

Heater. · 2013-07-29 20:12

Bob,

Perhaps the DOM approach is easier. I tried working with the DOM tree produced by hmlparser2 but did not have much luck. There is something I don't understand going on there,

I should really pull the HTML detection code out into a separate module so that it can be used, with different state tables, for other jobs.

I wanted raw text as I envisaged reading through it and cutting and pasting juicy bits of information into ZiCog documentation that might end up in a git hub repository or wiki page. We will see how that goes.

In general I like text as its easy to grep though and so on.

Bob Lawrence (VE1RLL) · 2013-07-29 20:16

I just tried your scraper and it worked fine. The HTML parser is interesting.

Using a different program I did scrape all of the ZigCog info into html pages(as they were on the web) with graphics and attachments, manuals etc . (screen shots attached) For fun I'll take another shot at it and see if I can save them as text but they still won't have the correct format for you. I have a Lazarus program that I've been working on that can extract text from HTML but it was basically designed to extract URL's from HTML files.

Bob Lawrence (VE1RLL) · 2013-07-29 20:17

@Heater.

re: DOM approach

Have you seen this before? It was fun to use.

http://blog.miguelgrinberg.com/post/easy-web-scraping-with-nodejs

Forum scraping

Comments