MLUG: Re: [MLUG] groking rsync
Re: [MLUG] groking rsync
Email address obfuscation in effect -- please click here to turn it off.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
I find that "rsync -a" (or "rsync -av" if I want to see what's being copied) totally sufficient for all my needs.


Pottinger, Hardy J. wrote:
Here's the tutorial/cheat sheet I use for rsync. Originally found here:
http://lists.samba.org/pipermail/rsync/1999-December/001598.html but has
since disappeared. I found a copy here:
http://www.complexfission.com/computers/network/file_system/rsync.php a
few years back, but it's gone again. So, now it gets to live in the MLUG
archives.

If you don't want to read this whole thing, here's the magic command
line (with the C option added, which ignores CVS and SVN folders):

rsync -navzSCHe ssh src:srcdir/. dstdir/

--Hardy

--/0U0QBNx7JIUZLHm
Content-Type: text/plain; charset=us-ascii

I just wrote this note for in-house use here at work, to help kick-start
some projects where rsync will be useful. When I got done spewing it
occurred
to me that other folks might find it helpful.-Bennett


Rsync notes
by Bennett Todd

Rsync is a sexy and elegant replacement for rdist. Like modern releases
of rdist, it transports over rsh by default but can beeasily told to
work over ssh, which is great for security. BTW, these days I recommend
OpenSSH http://www.openssh.org/>, but as long as you don'tbuild against
RSAREF the old ssh-1.2.27 is still useable (outside the US).

Rsync is invoked like rcp, e.g.:	

     rsync srcfile dsthost:dstfile	
     rsync srchost:srcfile dstfile

but it takes a load of options for doing Interesting Things. It has the
most effective directory heirarchy replication code I've seen anywhere,
including working support for preserving sparse files. In fact, I often
use it insteadof "cp -r", "cpio -p" or various tar pipelines, for local
copying.

Rsync uses a really clever algorithm to avoid sending an entire file
when there are only minor changes; there's a research paper that
describes the algorithm included with the source, included in the rpm,
available from the web site, etc. Briefly, the src side sends checksums
for blocks of the srcfile; the dst side does a sweep over the old file
on the dst end and finds anyblocks in the old file that can be used to
help assemble the new file, and only asks the src end to send the parts
that are missing so it can assemble the whole file. One incidental
consequence of this strategy is that rsync is assembling the new file in
a tmp filename (.filenameNOISE) and only when it's all written does it
rename it into place, which means the dest is safe and stable; from the
point of view of other programs, the dst file transitions instantly in a
single atomic operation from the old file to the new.

Now some details on how I use it. The first and most important options
to note and remember and master are "-v" for Verbose, and "-n" for Don't
Actually Do It. Like rdist, rsync is an edged tool, and you can easily
do astonishing amounts of damage with it. I've erased my entire home
directory more than once. ALWAYS ALWAYS ALWAYS use "-vn" to confirm what
an invocation will do before you unleash it, until you get the
invocation frozen and preserved in a script where it's immune from typos
and brainos and so on.


So the first handy invocation of rsync to note is "-a", for "archive";
"-a" is a shorthand for a whole bunch of other options to recursively
walk subdirectories, and preserve dates and times and permissions and
(if run as root) owners and groups.

Another one I tend to use is "-z", for compress; it's better to use
rsync's compression than ssh's compression, since rsync can
intelligently compress knowing more about the internal structure of the
datastream. Two other options I habitually toss in are -S, to be
brilliant about sparse files, and -H, to preserve hard links.

The most dangerous option is "--delete"; with it, files that aren't
present on the source end will be removed if found on the destination
end. This is sometimes needed, but it's the invocation that will erase
the universe if your aim is off.

All my rsync invocations include "-e ssh" to tell it to run over ssh. My
habit, acquired when rsync was younger and perhaps not necessary any
more (I'm not sure) is to do my directory mirroring with an invocation
like	

rsync -nvazSHe ssh src:srcdir/. dstdir/

That tells it to replicate the "." subdir of the source into dstdir/,
and is the safest approach to use by habit, since it's the least likely
to do something unexpectedly awful if you need to --delete. Once you're
sure it's doing just and only what you want, pull the "n" out of the
options list.

Rsync comes from http://samba.anu.edu.au/rsync/>. It comes with Red
HatLinux. You can also do a straight compile and make install on any
platform, it's nice and simple and portable. One recommendation: even if
you do a compile and install that goes into /usr/local/, I advise you to
make at least a symlink so that the rsync executable is available under
the path/usr/bin/rsync. This isn't strictly necessary, but if rsync
isn't visible under that specific path on the destination end, you
probably have to add a "--rsync-path=/usr/local/bin/rsync" option, which
is sorta gabby and clutters up the command line.


One final topic is how to arrange for rsync to run to keep a
more-or-less live mirror of a site. I see two basic choices. The
simplest one, and I find it works very very well indeed, it to make a
simple daemon. Start with	

    #!/bin/sh
        while true; do
        rsync -vazSHe ssh src dst
        sleep 5
    done

It works just as well as a puller or a pusher. It's suitable for putting
in an rc script to start up at boot time. Since ssh is not prone to
hanging forever the way rsh is, this doesn't in my experience get
clogged up and freeze. I did a somewhat more elaborate version (code
available on request) for a distributed symmetrical replication; each
web server in a farm was writing new records into files in a directory
heirarchy, using a perl lib that made a hashed directory tree act like a
database table, with the record key as thefilename. The replication code
looked something like

    while (1) {
        for $other (each of the other hosts in the farm) {
            rsync -vazuSHe $other:/path/to/db/. /path/to/db/
        }
        sleep a random length of time
    }

The random was chosen to keep the latency down under the threshhold of a
CiscoLocalDirector's session lockin time, while spreading the pulls
around enough to make it unlikely that everyone would be screaming at
once. The "-u" option for "update", don't overwrite newer files, makes
this a nice way to merge directories that are getting distributed
updates. Do try to keep the clocks in some kind of reasonable synch of
you're trying this trick. A periodic rdate in cron is probably good
enough, if you don't want the complexity of trying to do NTP. If you do
end up doing NTP in a public webserver farm, I'd probably
packet-filter-screen it entirely within the farm, and use a cheap GPS
timebase as a level-0 ref within the farm. One less thing to worry about
security-wise.

Back to the big question of how to make it run, the other choice would
be a cron-driven periodic launch. You'll want the cron interval to be
several times the typical rsync invocation run time, and since that run
time depends on the size of the area being mirrored and the amount that
changes, this will need some attention and tuning. If I wanted to do a
cron-driven periodic launch, I wouldn't bother with the usual lockfile
strategy. Instead, I'd just have the new one hunt down and kill any
preexisting one before it starts. Perhaps	

     #!/bin/sh
	exec >/dev/null 2>&1	progname=`basename $0`
	pidfile=/var/tmp/$progname.pid
	oldpid=`cat $pidfile 2>/dev/null`
	test -n $oldpid && kill $oldpid && kill -9 $oldpid
	rsync -vazSHe ssh src dst &
	echo $! >$pidfile

That's suitable for running out of cron. The theory here is that even if
the last run hasn't _finished_, as long as you've set the cron interval
well over the typical runtime, the previous run will have gotten well
past its scan-and-analyze phase, and will have made a bunch of progress
at updating the files; the new one will spend a little bit on analysis
then pick up where the last one left off. The place where this would
fall down horribly were if a single, gigantic file took longer to copy
than the cron interval. So don't do that.

The cron approach was substantially safer and more robust for
rsh-transported replicators like rdist in the bad old days, since they
tended to lock upforever, but I've never seen ssh hang indefinitely, and
can happily recommend a simpler looping script instead of the cron stuff
for doing live mirroring today.

-Bennett--
/0U0QBNx7JIUZLHmContent-Type: application/pgp-signature
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.0.0 (GNU/Linux)Comment: For info see
http://www.gnupg.org/iD8DBQE4V8SfL6KAps40sTYRAUqwAJ9G8qoXOcZdHZisXv2+SQy
WqNyGKACeI/z4522Opxt6vQI37MOaCt30g/s==pdAC
-----END PGP SIGNATURE-------/0U0QBNx7JIUZLHm--

_______________________________________________
members mailing list
EMAIL:PROTECTED
http://mlug.missouri.edu/mailman/listinfo/members




_______________________________________________
members mailing list
EMAIL:PROTECTED
http://mlug.missouri.edu/mailman/listinfo/members