VBackup 1.0

created by Vincent Chu

Quick Explanation

Vbackup checks a set of files for changes since the last backup. If changes are found, the changed files are backed up (copied to a location), or listed in a listing file, or both. Vbackup will run on Microsoft Windows NT 4.0, 2000, XP and up.

Table of Contents

A brief history
Technical (slightly) explanation
What vbackup is not/does not come with
Command-line arguments
operations
Options
Tested on these platforms
Backup-specification file (backup-spec)
Rules within parenthesis () of Rules 1 and 3
Rule ordering
Wildcards
Tokens
Hashing (digest)
Copying/backup files
Various Implementation details
Order of operations
Pseudo algolrithm
Incremental file
File Format
Example of a backup-spec file
Contact/Warranty

A brief history

I originally created vbackup to backup my code and data to CD. At the time, I wasn't able to find a program on the Win32 platform that:

  1. Allows me to choose a set of files to include and exclude in my backup that is dependent on my directory structure. Ex. I want to backup all my c:\source-code directory, but exclude all *.obj files. However, I want to backup *.obj files that are NOT in c:\source-code.
  2. Checks for differences in files based on content instead of file timestamp.
  3. Generates a listing file of all files matched and backed up.

Later on, I found another use of vbackup -- to produce incremental updates of websites for release. My work involves improving and updating the company website with new features and content. Usually a feature upgrade involves hundreds of files being modified, and to pick out the individual files in their appropriate directories for release is tedious manual labor no one wanted to do. Vbackup comes to the rescue here. I first tell vbackup to take a snapshot of the production website, then use vbackup with that snapshot on the development website to pick out all files changed. The resultant files would be in their correct directories, and all modified files are found regardless of timestamps (our development software sometimes changes the timestamp of files even tho logically these files aren't modified). Two runs of vbackup and I have a ready release. Bonus!

Technical (slightly) explanation

vbackup reads a backup-specification file (backup-spec) and performs various operations which includes copying matched files, generating list of matched files, and performing checksum comparisons on matches files. The principle behind vbackup is to find all files matching path specifications (path-spec) which can include wildcards, and then possibly comparing the matched files' checksums with previously saved checksums of those files to see if they have changed. Any changed files are then acted upon. Unchanged files are not acted upon.

What vbackup is not/does not come with

Command-line arguments

operations

-c path copy matched files to location specified by path. Directory hierarchy is described in detailed in the copy file section. This option is used most often for backing up files.
-i path-filespec (input) incremental file. If -io is not specified, then this will be the output incremental file as well.
-io path-filespec optional. Output the incremental file to the specified path. If not specified, but -i is, then the path-filespec in -i will be used for output in addition to input. If specified, but -i isn't, then incremental file will be created but no incremental file will be used for input.
-l path-filespec generate list of files matched by backup-spec and survived the incremental pass (if -i is specified)

Options

-D show debug output
-f include incremental file in backup location. Need to specify -i and -c. The incremental file will be outputted to the backup location's root. Ie. if backup location is "c:\backup", then that's where the incremental file will be. The name of the file will be obtained from -i or -io. In addition, the incremental file will STILL be written to the path specified at -i or -io.
-fl list the incremental file in list-file. Require -i and -l. If both -i and -io are specified, the path specified by -i is used.
-hsha use SHA1 hash (slow) for checksum calculation in the incremental stage. If -i and the incremental file was saved using CRC32, then the checksum comparison will be done in CRC32, but the output incremental file will save checksums in SHA1. See hashing section for more details
-hcrc use CRC32 hash for checksum calculation in the incremental stage. If -i and the incremental file was saved using SHA1, then the checksum comparison will be done in SHA1, but the output incremental file will save checksums in CRC32. See hashing section for more details

Tested on these platforms

Test involved both local drive, network mapped drive, and UNC-style drive. File names included regular alphanumerics and Japanese kanji.

Backup-specification file (backup-spec)

A backup-spec contains a list of rules. Each rule seperated from another by a newline. A rule consists of a command and a path.
A path may or may not contain wildcards depending on the command.
A path can refer to a local storage device (like c:\winnt\system32), or to a mapped network drive (like x:\net\user\data, where x is a mapped network drive), or an UNC path (like \\filesrv\users\data).

The backup-spec has 10 types of rules:

1. +path (...) include path in backup but don't recurse into any subdir of path. No wildcards allowed in path. path is intepreted as a directory name. Any number of rules can appear within the ()
2. +path include path in backup. Wildcards are allowed anywhere in path. path is intepreted as a file specification (filespec), which means the last path component in path is treated as a filename (or filemask if wildcards are used.
3. r+path (...) exactly the same as rule 1. The "r" is redundant here since the "recursiveness" nature of the rule is always overriden by the detail rules within
4. r+path include path in backup. Wildcards allowed anywhere in path. path is intepreted as a filespec.
5. d-path exclude path from the backup if path is a directory (files matching path are retained). path can contain wildcards.all subdirs of path are removed as well. path is intepreted as a directory.
6. f+path deprecated. Use will result in syntax error.
7. f-path exclude path in backup if path is a file (directories matching path are retained). path can contain wildcards. path is intepreted as a filespec.
8. -path exclude path in backup. path can contain wildcards. path is intepreted both as a filespec and a directory name.
9. r-path exclude path in backup. path can contain wildcards. path is intepreted as a filespec. Once the directory portion of path is matched, it and all its subdirectories are scanned for the filespec and if matched, is removed from backup. The scanning for the filespec is done both on files and directories. If a directory matches the filespec, it and all its subdirs are removed.
10. #comment a comment, ended by newline

Rules within parenthesis () of Rules 1 and 3

Path within parenthesis MUST BE relative. Paths outside of parenthesis must be absolute

Rule ordering

All addition rules (Rules 1,2,3,4) are applied before deletion rules (Rules 5,7,8,9). Amongst addition rules, there are no specific order in which they are applied. Similar for deletion rules.

Wildcards

Whenever wildcards are allowed, they can appear anywhere within a path. So paths like c:\abc\g?n*\1299\doc*-2002 are allowed. When wildcards appear within the directory portion of the path, only directories are matched against those wildcards.

Meaning of recursive add in wildcards: given a path like c:\abc\gen*\1299\doc*-2002, all paths matching the spec will be included in the add. This is the normal behavior and not what makes the match "recursive". The recursiveness comes after a path is matched, then all subdirectories beneath the matched path is searched and any matching filespec is included as well. So in the example, if the path is a filespec, meaning that doc*-2002 is a spec matching filenames (and not directory names), then all files of spec doc*-2002 in paths that match c:\abc\gen*\1299 are included. Then, all subdirectories beneath c:\abc\gen*\1299 are searched for files matching doc*-2002, and if found, those files are added as well.

Tokens

If a path includes ( ) + - tabs newlines spaces double-quotes, then quote the path to be "path". If the path includes double-quotes, escape each double-quote with \". For example, file-2002 Aug 19.txt should be specified as "file-2002 Aug 19.txt", and I-am-"special".txt should be specified as "I-am-\"special\".txt".

Tokens can be: + - alphanum\alphanum... ( ) "str" #
Tokens are seperated by these characters: + - ( ) whitespace
The lexical analyzer handles and doesn't return comments.
Whitespace (unquoted) is not returned and is used to seperate tokens.
Double-quotes " are collected, taking into account of \" and returned as a token.
Internally, Token is a pair: token-type, lexeme
Token-type: +,-,path (alnum w/ \), string, (,)
Optimizations are done by flagging "*" and "?" to signify possible wildcards

Hashing (digest)

SHA-1 can be used for hashing a digest of a file. It produces 160-bit of hash for each file, which provides 80 bits of security against collisions according to http://csrc.nist.gov/encryption/tkhash.html

CRC32 can be used for hashing a digest of a file. It produces 32-bit of hash for each file.

The SHA-1 ancd CRC32 code comes from the Crypto++ 5.0 library. Many thanks to Wei Dai and other contributors for making this public domain

Copying/backup files

Copying matched files to a location will result in the same directory hierarchy being created at the location. The top level, however, will follow the rules as illustrated in the example below.

Ex. location = c:\backup
All matched files copied to subdirectories of c:\backup in this directory hierarchy:

Various Implementation details

As backup-spec is read, the contents are split into 2 trees, one for "+" rules, one for "-" rules.
The paths are stored as-is in a string
As each path is intepreted, when a file is matched, it is recorded in memory under a directory tree (the full filespec is stored as a tree structure)

Order of operations

The order in which vbackup do various comparisons:

  1. add files specified in the backup-spec to backup-set
  2. remove files specified in the backup-spec from backup-set
  3. match and remove unchanged files (checksum in incremental file same as checksum of file) from backup-set, if -i is specified
  4. generate list file from backup-set, if -l is specified
  5. output incremental file as a merge of backup-set and previous incremental file, if -i is specified
  6. copy files in backup-set to backup location, if -c is specified

Pseudo algolrithm

The algol below is not 100% accurate. It is here to demonstrate the function of vbackup

match files according to backup-spec.  All matches are stored in memory sorted by name.
if incremental specified
	read all of incremental file
	go through each line in the incremental file
		if a line matches a file already matched
			check the hash of the line with the hash of the file currently
			if hashes match, the file hasn't changed since last backup, remove that match file
		end if
end if
if generate-list-of-files specified, then
	if -fl and -i specified, output incremental file name to list-file
	output all matched files to specified list-file
end if
if incremental specified
	create/truncate incremental file.  Name comes from -io, or if -io is not present, use name from -i
	if -f specified, then output incremental file name to incremental file
	output matched files and their checksums
	output files that weren't matched but was in the incremental file in the beginning
end if
if copy-file specified
	if -f specified, then copy the incremental file to the backup location (root)
	copy matched files
end if

Incremental file

This file records files in the backup-set and their corresponding checksum (as of time of backup). This allow subsequent run of vbackup to see if certains files have been changed since the backup recorded by the incremental file. It does this by comparing the checksum recorded in the incremental file with the file in the backup-set. If they differ, then the file has been changed. After the backup, the new checksums are recorded back into the incremental file. If the checksum of a file cannot be computed, then the file is recorded without a checksum. Any file w/o a checksum in the incremental file means vbackup will always retain the file in the bakup-set. Any files recorded in the incremental file but is not in the backup-set (ie, file doesn't exist or is currently unavailable), is kept in the new incremental file that is generated to ensure that if the file becomes available again, its checksum at time of backup is available in the incremental file.

File Format

hash-method-used (can be "crc" or "sha")
filepath|[hash]
...

if hash is not present, that means failure occurred when computing the hash for the file during backup, and the file is always backed up.

Comments can occur on a line by itself. A comment line must have its first character as '#'. Incremental files are generated everytime vbackup is runned with -i. Even though comments are kept across incremental files, their positions are not guaranteed to stay the same since lines can be added to the incremental file.

Example of an incremental file:

# incremental file 2002/08/23
crc
# I have a comment
c:\stuff\journal.xls|a5b403e5
\\srvdev\d_inetpub\nothing.txt|d78e2922
\\srvdev\Inetpub\vv.txt|a84f97c1
c:\stuff\pgp\gnupg-w32-1.0.6\gnupg-w32-1.0.6-2.zip|74e5529d
c:\stuff\pgp\w2k-6.5.8\PGPcmdln_6.5.8_Win32_FW.zip|edd9c013

Example of a backup-spec file

+path(
	d+abc
	-f*
	+vin\ins(
		+gorgoyle +doc\*.* -doc\*.tmp)
	d+cd
)
+doc() # this is allowed ! - but this adds nothing to the backup set since w/o details, nothing is added.
Another example:
r-c:\stuff\plant*.txt
+c:\stuff(
	r+plant*.txt
	+chu.doc
)
r+c:\stuff\p?p\*.zip
+c:\stuff\v*
+\\srvdev\*pub\*.txt
Yet another example:
+f:\proj(
	r+*
	d-_gohome
	d-_gowork
	# VC++ files
	r-debug
	r-release
	r-*.exe
	r-*.ncb
	r-*.opt
	f-COMTest\*.tl?
	# Borland C++ files
	r-*.tds
	r-*.lib
	d-"pcre-3.4\pcreposix___Win32_Debug"
	d-"pcre-3.4\testdata"
	# Individual files
	-tbc\tbc.i
	-tbc\msado15.tl?
)
# Visual source safe
r+f:\vss6\*
+f:\bcbprog(
	# Borland C++ files
	r+*
	r-*.exe
	r-*.zip
	r-*.tds
	r-*.lib
)
# Doc
r+f:\doc\*

Contact/Warranty/Legalities

I would be interested to know if you have found vbackup useful (or not useful). Please drop me an email at vbackup@chuclan.com should you have comments, questions, bug-reports, wish-list etc. To help filter spam, please include the word "vbackup" in the Subject of your emails. Emails without the word "vbackup" in the subject will be deleted automatically. I will try to diagnose and fix bugs reported, and will consider putting frequently asked features into vbackup, though I can only do this on my spare time, so don't expect lightning fast responses.

Vbackup is a copyrighted work:
Copyright (c) 2002-2004 by Vincent Chu. All rights reserved.

Permission to use, copy, and distribute this software and documentation for any purpose, including commercial purposes, is hereby granted without fee, subject to the following restrictions:
1. The copyright notice in this software and documentation must not be altered in any way.
2. This software and documentation must not be altered in any way.

This software and documentation are provided "as-is". I make no representations or warranties, express or implied, including but not limited to, warranties of merchantability or fitness for any particular purpose or that the use of the software or documentation will not infringe any third party patents, copyrights, trademarks or other rights. I am under no obligation to provide any services, by way of maintenance, update, or otherwise on this software and documentation.





Created by Vincent Chu. Copyright (c) 2002-2004, all rights reserved.