Email address obfuscation in effect -- please
click here to turn it off.
[
Date Prev][
Date Next][
Thread Prev][
Thread Next][
Date Index][
Thread Index]
- To: "MLUG Off-Topic Discussion" <EMAIL:PROTECTED>
- Subject: Re: [MLUG - DISCUSSION] Re: HTML problem with "<base href=" and "<a name="
- From: Michael <EMAIL:PROTECTED>
- Date: Sun, 2 Dec 2007 23:14:13 -0700
- Delivery-date: Mon, 03 Dec 2007 00:14:24 -0600
- Dkim-signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:message-id:date:from:sender:to:subject:in-reply-to:mime-version:content-type:references:x-google-sender-auth; bh=UyrzwvC+mWNNaUYoFIMFwPnDAg4XmxXT3wWWzULMvKE=; b=qHLGZ2qLp+U9N0tETy3ZhuvXF+euYyzmvv4/8JH7KerYgJ0VCeWJ89esjTJC+f8/JbToNRqLsjZ97krD1U/weapl+1z4DUaxUI/z4jXHYrfmjnpOB0b+DQV0NHj07xDxN6iJvL7pUPT2pb7/gtikE0VOj7TrkXRFBUNminMCIyc=
- Domainkey-signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=received:message-id:date:from:sender:to:subject:in-reply-to:mime-version:content-type:references:x-google-sender-auth; b=Ii2Y2DVvs3D3LT+cm8WptT8sblK0zHnWwrm4PVeEx4M8bpPUPaV9hZSDQoSUhwmxAALrUbx20U6teQCrtkdZSzLEQwj0oGT/owQCS3OWrsb1tGQxtGhKnww5htu04alMkYwL+XBp8vQb2HtlB0yRBz9BMAADQRfHD/MS3zd2sXg=
- Envelope-to: EMAIL:PROTECTED
- In-reply-to: <EMAIL:PROTECTED>
- References: <EMAIL:PROTECTED> <EMAIL:PROTECTED> <EMAIL:PROTECTED> <EMAIL:PROTECTED>
- Reply-to: MLUG Off-Topic Discussion <EMAIL:PROTECTED>
- Sender: EMAIL:PROTECTED
Wget isn't really a scrapper. It's really just a tool for downloading although they are working on a new version that promises to be more powerful. e.g. Wget doesn't do anything with style sheets right now but the new version plans to add support. Wikipedia is also a difficult app to scrap because it does funky things with URLs. :)
I'd probably create a shell script that'd use wget to download the pages I wanted and then when done would pull the list of downloaded pages from wget's log and run a Perl script over the pages to fix the URLs and do anything else you want.
If you are trying this with only Wikipedia you might want to look to see if you can just download a copy of their database.
http://en.wikipedia.org/wiki/Wikipedia:Database_download
Download Wikipedia pages, usually for record albums, have the HTML
locally, but get the embedded images, CSS, etc., from Wikipedia. I
usually do this where "$1" is the URL of the Wikipedia page:
lynx -source "$1" | perl -pe 's#<head>#<head>\n<base href="" href="http://en.wikipedia.org" target="_blank">
http://en.wikipedia.org">#' > file.html
I'm just grabbing the original source file and adding the base href.
What you are saying about wget is not true. Try it. There are various
wget options that are supposed to deal with the issues I'm trying to deal
with, but they do not work correctly with Wikipedia pages (e.g., the -k
option).
_______________________________________________
discussion mailing list
EMAIL:PROTECTED
http://mlug.missouri.edu/mailman/listinfo/discussion