MLUG: Re: [MLUG - DISCUSSION] Re: HTML problem with "<base href=" and "<a name="
Re: [MLUG - DISCUSSION] Re: HTML problem with "<base href=" and "<a name="
Email address obfuscation in effect -- please click here to turn it off.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
On Sun, 2 Dec 2007, Michael wrote:

Wget isn't really a scrapper. It's really just a tool for downloading although they are working on a new version that promises to be more powerful. e.g. Wget doesn't do anything with style sheets right now but the new version plans to add support. Wikipedia is also a difficult app to scrap because it does funky things with URLs. :)

I just figured out a couple of things. If you read the wget man page, look for the section about these options: -k, --convert-links. The -k option does not work correctly at least not in wget version 1.10.2 and earlier versions. It fails when the -O option is used (to name the output file), and it uses the filename in "name" links to other parts of the file -- this is a problem because if you rename the file, the links will fail. It also does not deal with the style sheets correctly (as you said), but the wget man page says that it will convert links to style sheets.


So wget is a little buggy.

Mike

_______________________________________________
discussion mailing list
EMAIL:PROTECTED
http://mlug.missouri.edu/mailman/listinfo/discussion