510 lines
16 KiB
Groff
510 lines
16 KiB
Groff
.\" $NetBSD: bzip2.1,v 1.9 2010/05/14 16:43:34 joerg Exp $
|
|
.\"
|
|
.Dd May 14, 2010
|
|
.Dt BZIP2 1
|
|
.Os
|
|
.Sh NAME
|
|
.Nm bzip2 ,
|
|
.Nm bunzip2 ,
|
|
.Nm bzcat ,
|
|
.Nm bzip2recover
|
|
.Nd block-sorting file compressor
|
|
.Sh SYNOPSIS
|
|
.Nm bzip2
|
|
.Op Fl 123456789cdfkLqstVvz
|
|
.Op Ar filename Ar
|
|
.Pp
|
|
.Nm bunzip2
|
|
.Op Fl fkLVvs
|
|
.Op Ar filename Ar
|
|
.Pp
|
|
.Nm bzcat
|
|
.Op Fl s
|
|
.Op Ar filename Ar
|
|
.Pp
|
|
.Nm bzip2recover
|
|
.Ar filename
|
|
.Sh DESCRIPTION
|
|
.Nm bzip2
|
|
compresses files using the Burrows-Wheeler block sorting
|
|
text compression algorithm, and Huffman coding.
|
|
Compression is generally considerably better than that achieved by
|
|
more conventional LZ77/LZ78-based compressors, and approaches the
|
|
performance of the PPM family of statistical compressors.
|
|
.Pp
|
|
.Nm bzcat
|
|
decompresses files to stdout, and
|
|
.Nm bzip2recover
|
|
recovers data from damaged bzip2 files.
|
|
.Pp
|
|
The command-line options are deliberately very similar to
|
|
those of
|
|
.Xr gzip 1 ,
|
|
but they are not identical.
|
|
.Pp
|
|
.Nm bzip2
|
|
expects a list of file names to accompany the command-line flags.
|
|
Each file is replaced by a compressed version of
|
|
itself, with the name
|
|
.Dq Pa original_name.bz2 .
|
|
Each compressed file has the same modification date, permissions, and,
|
|
when possible, ownership as the corresponding original, so that these
|
|
properties can be correctly restored at decompression time.
|
|
File name handling is naive in the sense that there is no mechanism
|
|
for preserving original file names, permissions, ownerships or dates
|
|
in filesystems which lack these concepts, or have serious file name
|
|
length restrictions, such as
|
|
.Tn MS-DOS .
|
|
.Nm bzip2
|
|
and
|
|
.Nm bunzip2
|
|
will by default not overwrite existing files.
|
|
If you want this to happen, specify the
|
|
.Fl f
|
|
flag.
|
|
.Pp
|
|
If no file names are specified,
|
|
.Nm bzip2
|
|
compresses from standard input to standard output.
|
|
In this case,
|
|
.Nm bzip2
|
|
will decline to write compressed output to a terminal, as this would
|
|
be entirely incomprehensible and therefore pointless.
|
|
.Pp
|
|
.Nm bunzip2
|
|
(or
|
|
.Nm bzip2 Fl d )
|
|
decompresses all specified files.
|
|
Files which were not created by
|
|
.Nm bzip2
|
|
will be detected and ignored, and a warning issued.
|
|
.Nm bzip2
|
|
attempts to guess the filename for the decompressed file
|
|
from that of the compressed file as follows:
|
|
.Bl -column "filename.tbz2" "becomes" -offset indent
|
|
.It Pa filename.bz2 Ta becomes Ta Pa filename
|
|
.It Pa filename.bz Ta becomes Ta Pa filename
|
|
.It Pa filename.tbz2 Ta becomes Ta Pa filename.tar
|
|
.It Pa filename.tbz Ta becomes Ta Pa filename.tar
|
|
.It Pa anyothername Ta becomes Ta Pa anyothername.out
|
|
.El
|
|
.Pp
|
|
If the file does not end in one of the recognised endings,
|
|
.Pa .bz2 ,
|
|
.Pa .bz ,
|
|
.Pa .tbz2 ,
|
|
or
|
|
.Pa .tbz ,
|
|
.Nm bzip2
|
|
complains that it cannot guess the name of the original file, and uses
|
|
the original name with
|
|
.Pa .out
|
|
appended.
|
|
.Pp
|
|
As with compression, supplying no filenames causes decompression from
|
|
standard input to standard output.
|
|
.Pp
|
|
.Nm bunzip2
|
|
will correctly decompress a file which is the concatenation of two or
|
|
more compressed files.
|
|
The result is the concatenation of the corresponding uncompressed
|
|
files.
|
|
Integrity testing
|
|
.Pq Fl t
|
|
of concatenated compressed files is also supported.
|
|
.Pp
|
|
You can also compress or decompress files to the standard output by
|
|
giving the
|
|
.Fl c
|
|
flag.
|
|
Multiple files may be compressed and decompressed like this.
|
|
The resulting outputs are fed sequentially to stdout.
|
|
Compression of multiple files in this manner generates a stream
|
|
containing multiple compressed file representations.
|
|
Such a stream can be decompressed correctly only by
|
|
.Nm bzip2
|
|
version 0.9.0 or later.
|
|
Earlier versions of
|
|
.Nm bzip2
|
|
will stop after decompressing
|
|
the first file in the stream.
|
|
.Pp
|
|
.Nm bzcat
|
|
(or
|
|
.Nm bzip2 Fl dc )
|
|
decompresses all specified files to the standard output.
|
|
.Pp
|
|
Compression is always performed, even if the compressed file is
|
|
slightly larger than the original.
|
|
Files of less than about one hundred bytes tend to get larger, since
|
|
the compression mechanism has a constant overhead in the region of 50
|
|
bytes.
|
|
Random data (including the output of most file compressors) is coded
|
|
at about 8.05 bits per byte, giving an expansion of around 0.5%.
|
|
.Pp
|
|
As a self-check for your protection,
|
|
.Nm bzip2
|
|
uses 32-bit CRCs to make sure that the decompressed version of a file
|
|
is identical to the original.
|
|
This guards against corruption of the compressed data, and against
|
|
undetected bugs in
|
|
.Nm bzip2
|
|
(hopefully very unlikely).
|
|
The chances of data corruption going undetected is microscopic, about
|
|
one chance in four billion for each file processed.
|
|
Be aware, though, that the check occurs upon decompression, so it can
|
|
only tell you that something is wrong.
|
|
It can't help you recover the original uncompressed data.
|
|
You can use
|
|
.Nm bzip2recover
|
|
to try to recover data from
|
|
damaged files.
|
|
.Sh OPTIONS
|
|
.Bl -tag -width "XXrepetitiveXfastXX"
|
|
.It Fl Fl
|
|
Treats all subsequent arguments as file names, even if they start with
|
|
a dash.
|
|
This is so you can handle files with names beginning with a dash, for
|
|
example:
|
|
.Dl bzip2 -- -myfilename .
|
|
.It Fl 1 , Fl Fl fast
|
|
to
|
|
.It Fl 9 , Fl Fl best
|
|
Set the block size to 100 k, 200 k ... 900 k when compressing.
|
|
Has no effect when decompressing.
|
|
See
|
|
.Sx MEMORY MANAGEMENT
|
|
below.
|
|
The
|
|
.Fl Fl fast
|
|
and
|
|
.Fl Fl best
|
|
aliases are primarily for GNU
|
|
.Xr gzip 1
|
|
compatibility.
|
|
In particular,
|
|
.Fl Fl fast
|
|
doesn't make things significantly faster, and
|
|
.Fl Fl best
|
|
merely selects the default behaviour.
|
|
.It Fl c , Fl Fl stdout
|
|
Compress or decompress to standard output.
|
|
.It Fl d , Fl Fl decompress
|
|
Force decompression.
|
|
.Nm bzip2 ,
|
|
.Nm bunzip2 ,
|
|
and
|
|
.Nm bzcat
|
|
are really the same program, and the decision about what actions to
|
|
take is done on the basis of which name is used.
|
|
This flag overrides that mechanism, and forces
|
|
.Nm bzip2
|
|
to decompress.
|
|
.It Fl f , Fl Fl force
|
|
Force overwrite of output files.
|
|
Normally,
|
|
.Nm bzip2
|
|
will not overwrite existing output files.
|
|
Also forces
|
|
.Nm bzip2
|
|
to break hard links
|
|
to files, which it otherwise wouldn't do.
|
|
.Pp
|
|
.Nm bzip2
|
|
normally declines to decompress files which don't have the correct
|
|
magic header bytes.
|
|
If forced
|
|
.Pq Fl f ,
|
|
however, it will pass such files through unmodified.
|
|
This is how GNU
|
|
.Xr gzip 1
|
|
behaves.
|
|
.It Fl k , Fl Fl keep
|
|
Keep (don't delete) input files during compression
|
|
or decompression.
|
|
.It Fl L , Fl Fl license
|
|
Display the license terms and conditions.
|
|
.It Fl q , Fl Fl quiet
|
|
Suppress non-essential warning messages.
|
|
Messages pertaining to I/O errors and other critical events will not
|
|
be suppressed.
|
|
.It Fl Fl repetitive-fast
|
|
.It Fl Fl repetitive-best
|
|
These flags are redundant in versions 0.9.5 and above.
|
|
They provided some coarse control over the behaviour of the sorting
|
|
algorithm in earlier versions, which was sometimes useful.
|
|
0.9.5 and above have an improved algorithm which renders these flags
|
|
irrelevant.
|
|
.It Fl s , Fl Fl small
|
|
Reduce memory usage, for compression, decompression and testing.
|
|
Files are decompressed and tested using a modified algorithm which
|
|
only requires 2.5 bytes per block byte.
|
|
This means any file can be decompressed in 2300k of memory, albeit at
|
|
about half the normal speed.
|
|
During compression,
|
|
.Fl s
|
|
selects a block size of 200k, which limits memory use to around the
|
|
same figure, at the expense of your compression ratio.
|
|
In short, if your machine is low on memory (8 megabytes or less), use
|
|
.Fl s
|
|
for everything.
|
|
See
|
|
.Sx MEMORY MANAGEMENT
|
|
below.
|
|
.It Fl t , Fl Fl test
|
|
Check integrity of the specified file(s), but don't decompress them.
|
|
This really performs a trial decompression and throws away the result.
|
|
.It Fl V , Fl Fl version
|
|
Display the software version.
|
|
.It Fl v , Fl Fl verbose
|
|
Verbose mode: show the compression ratio for each file processed.
|
|
Further
|
|
.Fl v Ap s
|
|
increase the verbosity level, spewing out lots of information which is
|
|
primarily of interest for diagnostic purposes.
|
|
.It Fl z , Fl Fl compress
|
|
The complement to
|
|
Fl d :
|
|
forces compression, regardless of the invocation name.
|
|
.El
|
|
.Ss MEMORY MANAGEMENT
|
|
.Nm bzip2
|
|
compresses large files in blocks.
|
|
The block size affects both the compression ratio achieved, and the
|
|
amount of memory needed for compression and decompression.
|
|
The flags
|
|
.Fl 1
|
|
through
|
|
.Fl 9
|
|
specify the block size to be 100,000 bytes through 900,000 bytes (the
|
|
default) respectively.
|
|
At decompression time, the block size used for compression is read
|
|
from the header of the compressed file, and
|
|
.Nm bunzip2
|
|
then allocates itself just enough memory to decompress the file.
|
|
Since block sizes are stored in compressed files, it follows that the
|
|
flags
|
|
.Fl 1
|
|
to
|
|
.Fl 9
|
|
are irrelevant to and so ignored during decompression.
|
|
.Pp
|
|
Compression and decompression requirements, in bytes, can be estimated
|
|
as:
|
|
.Bl -tag -width "Decompression:" -offset indent
|
|
.It Compression :
|
|
400k + ( 8 x block size )
|
|
.It Decompression :
|
|
100k + ( 4 x block size ), or 100k + ( 2.5 x block size )
|
|
.El
|
|
Larger block sizes give rapidly diminishing marginal returns.
|
|
Most of the compression comes from the first two or three hundred k of
|
|
block size, a fact worth bearing in mind when using
|
|
.Nm bzip2
|
|
on small machines.
|
|
It is also important to appreciate that the decompression memory
|
|
requirement is set at compression time by the choice of block size.
|
|
.Pp
|
|
For files compressed with the default 900k block size,
|
|
.Nm bunzip2
|
|
will require about 3700 kbytes to decompress.
|
|
To support decompression of any file on a 4 megabyte machine,
|
|
.Nm bunzip2
|
|
has an option to decompress using approximately half this amount of
|
|
memory, about 2300 kbytes.
|
|
Decompression speed is also halved, so you should use this option only
|
|
where necessary.
|
|
The relevant flag is
|
|
.Fl s .
|
|
.Pp
|
|
In general, try and use the largest block size memory constraints
|
|
allow, since that maximises the compression achieved.
|
|
Compression and decompression speed are virtually unaffected by block
|
|
size.
|
|
.Pp
|
|
Another significant point applies to files which fit in a single block
|
|
-- that means most files you'd encounter using a large block size.
|
|
The amount of real memory touched is proportional to the size of the
|
|
file, since the file is smaller than a block.
|
|
For example, compressing a file 20,000 bytes long with the flag
|
|
.Fl 9
|
|
will cause the compressor to allocate around 7600k of memory, but only
|
|
touch 400k + 20000 * 8 = 560 kbytes of it.
|
|
Similarly, the decompressor will allocate 3700k but only touch 100k +
|
|
20000 * 4 = 180 kbytes.
|
|
.Pp
|
|
Here is a table which summarises the maximum memory usage for different
|
|
block sizes.
|
|
Also recorded is the total compressed size for 14 files of the Calgary
|
|
Text Compression Corpus totalling 3,141,622 bytes.
|
|
This column gives some feel for how compression varies with block size.
|
|
These figures tend to understate the advantage of larger block sizes
|
|
for larger files, since the Corpus is dominated by smaller files.
|
|
.Bl -column "Flag" "Compression" "Decompression" "DecompressionXXs" "Corpus size"
|
|
.It Sy Flag Ta Sy Compression Ta Sy Decompression Ta Sy Decompression Fl s Ta Sy Corpus size
|
|
.It -1 Ta 1200k Ta 500k Ta 350k Ta 914704
|
|
.It -2 Ta 2000k Ta 900k Ta 600k Ta 877703
|
|
.It -3 Ta 2800k Ta 1300k Ta 850k Ta 860338
|
|
.It -4 Ta 3600k Ta 1700k Ta 1100k Ta 846899
|
|
.It -5 Ta 4400k Ta 2100k Ta 1350k Ta 845160
|
|
.It -6 Ta 5200k Ta 2500k Ta 1600k Ta 838626
|
|
.It -7 Ta 6100k Ta 2900k Ta 1850k Ta 834096
|
|
.It -8 Ta 6800k Ta 3300k Ta 2100k Ta 828642
|
|
.It -9 Ta 7600k Ta 3700k Ta 2350k Ta 828642
|
|
.El
|
|
.Ss RECOVERING DATA FROM DAMAGED FILES
|
|
.Nm bzip2
|
|
compresses files in blocks, usually 900kbytes long.
|
|
Each block is handled independently.
|
|
If a media or transmission error causes a multi-block
|
|
.Pa .bz2
|
|
file to become damaged, it may be possible to recover data from the
|
|
undamaged blocks in the file.
|
|
.Pp
|
|
The compressed representation of each block is delimited by a 48-bit
|
|
pattern, which makes it possible to find the block boundaries with
|
|
reasonable certainty.
|
|
Each block also carries its own 32-bit CRC, so damaged blocks can be
|
|
distinguished from undamaged ones.
|
|
.Pp
|
|
.Nm bzip2recover
|
|
is a simple program whose purpose is to search for blocks in
|
|
.Pa .bz2
|
|
files, and write each block out into its own
|
|
.Pa .bz2
|
|
file.
|
|
You can then use
|
|
.Nm bzip2
|
|
.Fl t
|
|
to test the integrity of the resulting files, and decompress those
|
|
which are undamaged.
|
|
.Pp
|
|
.Nm bzip2recover
|
|
takes a single argument, the name of the damaged file, and writes a
|
|
number of files
|
|
.Dq Pa rec00001file.bz2 ,
|
|
.Dq Pa rec00002file.bz2 ,
|
|
etc., containing the extracted blocks.
|
|
The output filenames are designed so that the use of wildcards in
|
|
subsequent processing -- for example,
|
|
.Dl bzip2 -dc rec*file.bz2 \*[Gt] recovered_data
|
|
-- processes the files in the correct order.
|
|
.Pp
|
|
.Nm bzip2recover
|
|
should be of most use dealing with large
|
|
.Pa .bz2
|
|
files, as these will contain many blocks.
|
|
It is clearly futile to use it on damaged single-block files, since a
|
|
damaged block cannot be recovered.
|
|
If you wish to minimise any potential data loss through media or
|
|
transmission errors, you might consider compressing with a smaller
|
|
block size.
|
|
.Ss PERFORMANCE NOTES
|
|
The sorting phase of compression gathers together similar strings in
|
|
the file.
|
|
Because of this, files containing very long runs of repeated
|
|
symbols, like
|
|
.Dq aabaabaabaab...
|
|
(repeated several hundred times) may compress more slowly than normal.
|
|
Versions 0.9.5 and above fare much better than previous versions in
|
|
this respect.
|
|
The ratio between worst-case and average-case compression time is in
|
|
the region of 10:1.
|
|
For previous versions, this figure was more like 100:1.
|
|
You can use the
|
|
.Fl vvvv
|
|
option to monitor progress in great detail, if you want.
|
|
.Pp
|
|
Decompression speed is unaffected by these phenomena.
|
|
.Pp
|
|
.Nm bzip2
|
|
usually allocates several megabytes of memory to operate in, and then
|
|
charges all over it in a fairly random fashion.
|
|
This means that performance, both for compressing and decompressing,
|
|
is largely determined by the speed at which your machine can service
|
|
cache misses.
|
|
Because of this, small changes to the code to reduce the miss rate
|
|
have been observed to give disproportionately large performance
|
|
improvements.
|
|
I imagine
|
|
.Nm bzip2
|
|
will perform best on machines with very large caches.
|
|
.Sh ENVIRONMENT
|
|
.Nm bzip2
|
|
will read arguments from the environment variables
|
|
.Ev BZIP2
|
|
and
|
|
.Ev BZIP ,
|
|
in that order, and will process them before any arguments read from
|
|
the command line.
|
|
This gives a convenient way to supply default arguments.
|
|
.Sh EXIT STATUS
|
|
0 for a normal exit, 1 for environmental problems (file not found,
|
|
invalid flags, I/O errors, etc.), 2 to indicate a corrupt compressed
|
|
file, 3 for an internal consistency error (e.g., bug) which caused
|
|
.Nm bzip2
|
|
to panic.
|
|
.Sh AUTHORS
|
|
.An -nosplit
|
|
.An Julian Seward
|
|
.Aq jseward@bzip.org
|
|
.Pp
|
|
.Pa http://www.bzip.org
|
|
.Pp
|
|
The ideas embodied in
|
|
.Nm bzip2
|
|
are due to (at least) the following people:
|
|
.An Michael Burrows
|
|
and
|
|
.An David Wheeler
|
|
(for the block sorting transformation),
|
|
.An David Wheeler
|
|
(again, for the Huffman coder),
|
|
.An Peter Fenwick
|
|
(for the structured coding model in the original
|
|
.Nm bzip ,
|
|
and many refinements), and
|
|
.An Alistair Moffat ,
|
|
.An Radford Neal ,
|
|
and
|
|
.An Ian Witten
|
|
(for the arithmetic coder in the original
|
|
.Nm bzip ) .
|
|
I am much indebted for their help, support and advice.
|
|
See the manual in the source distribution for pointers to sources of
|
|
documentation.
|
|
Christian von Roques encouraged me to look for faster sorting
|
|
algorithms, so as to speed up compression.
|
|
Bela Lubkin encouraged me to improve the worst-case compression
|
|
performance.
|
|
Donna Robinson XMLised the documentation.
|
|
The bz* scripts are derived from those of GNU gzip.
|
|
Many people sent patches, helped with portability problems, lent
|
|
machines, gave advice and were generally helpful.
|
|
.Sh CAVEATS
|
|
I/O error messages are not as helpful as they could be.
|
|
.Nm bzip2
|
|
tries hard to detect I/O errors and exit cleanly, but the details of
|
|
what the problem is sometimes seem rather misleading.
|
|
.Pp
|
|
This manual page pertains to version 1.0.5 of
|
|
.Nm bzip2 .
|
|
Compressed data created by this version is entirely forwards and
|
|
backwards compatible with the previous public releases, versions
|
|
0.1pl2, 0.9.0, 0.9.5, 1.0.0, 1.0.1, 1.0.2 and 1.0.3, but with the
|
|
following exception: 0.9.0 and above can correctly decompress multiple
|
|
concatenated compressed files.
|
|
0.1pl2 cannot do this; it will stop after decompressing just the first
|
|
file in the stream.
|
|
.Pp
|
|
.Nm bzip2recover
|
|
versions prior to 1.0.2 used 32-bit integers to represent bit
|
|
positions in compressed files, so they could not handle compressed
|
|
files more than 512 megabytes long.
|
|
Versions 1.0.2 and above use 64-bit ints on some platforms which
|
|
support them (GNU supported targets, and Windows).
|
|
To establish whether or not
|
|
.Nm bzip2recover
|
|
was built with such a limitation, run it without arguments.
|
|
In any event you can build yourself an unlimited version if you can
|
|
recompile it with MaybeUInt64 set to be an unsigned 64-bit integer.
|