Email address obfuscation in effect -- please
click here to turn it off.
[
Date Prev][
Date Next][
Thread Prev][
Thread Next][
Date Index][
Thread Index]
On Mon, 3 Dec 2007, Michael wrote:
The direct db download is probably your best bet. Kind of odd that Wikipedia
dosn't offer some sort of XML interface to articles. Dunno - maybe they do
and I just don't know about it.
I just figured out a couple of things. If you read the wget man page,
look for the section about these options: -k, --convert-links. The -k
option does not work correctly at least not in wget version 1.10.2 and
earlier versions. It fails when the -O option is used (to name the
output file), and it uses the filename in "name" links to other parts
of the file -- this is a problem because if you rename the file, the
links will fail. It also does not deal with the style sheets correctly
(as you said), but the wget man page says that it will convert links to
style sheets.
So wget is a little buggy.
Attached a lil script I did for scraping a site. Different but I'd just
change what the script did a lil to make it work with other sites like
Wikipedia.
I can see that you know some good bash coding tricks that I don't know, so
I'm sure I'll read that script and study some of those techniques. I've
decided that bash scripting is one of the things that is worth my time to
learn in more depth. It's very handy to be good at that.
Mike
_______________________________________________
discussion mailing list
EMAIL:PROTECTED
http://mlug.missouri.edu/mailman/listinfo/discussion