MLUG: Re: [MLUG - DISCUSSION] Re: HTML problem with "<base href=" and "<a name="
Re: [MLUG - DISCUSSION] Re: HTML problem with "<base href=" and "<a name="
Email address obfuscation in effect -- please click here to turn it off.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
The direct db download is probably your best bet. Kind of odd that Wikipedia dosn't offer some sort of XML interface to articles. Dunno - maybe they do and I just don't know about it.

I just figured out a couple of things.  If you read the wget man page,
look for the section about these options:  -k, --convert-links.  The -k
option does not work correctly at least not in wget version 1.10.2 and
earlier versions.  It fails when the -O option is used (to name the output
file), and it uses the filename in "name" links to other parts of the file
-- this is a problem because if you rename the file, the links will fail.
It also does not deal with the style sheets correctly (as you said), but
the wget man page says that it will convert links to style sheets.

So wget is a little buggy.

Attached a lil script I did for scraping a site. Different but I'd just change what the script did a lil to make it work with other sites like Wikipedia.

Attachment: general-get.sh
Description: Bourne shell script

_______________________________________________
discussion mailing list
EMAIL:PROTECTED
http://mlug.missouri.edu/mailman/listinfo/discussion