mirror of
https://github.com/cmclark00/retro-imager.git
synced 2025-05-19 16:35:20 +01:00
248 lines
13 KiB
Text
248 lines
13 KiB
Text
|
LIBARCHIVE_INTERNALS(3) BSD Library Functions Manual LIBARCHIVE_INTERNALS(3)
|
||
|
|
||
|
NAME
|
||
|
libarchive_internals -- description of libarchive internal interfaces
|
||
|
|
||
|
OVERVIEW
|
||
|
The libarchive library provides a flexible interface for reading and
|
||
|
writing streaming archive files such as tar and cpio. Internally, it
|
||
|
follows a modular layered design that should make it easy to add new ar-
|
||
|
chive and compression formats.
|
||
|
|
||
|
GENERAL ARCHITECTURE
|
||
|
Externally, libarchive exposes most operations through an opaque, object-
|
||
|
style interface. The archive_entry(3) objects store information about a
|
||
|
single filesystem object. The rest of the library provides facilities to
|
||
|
write archive_entry(3) objects to archive files, read them from archive
|
||
|
files, and write them to disk. (There are plans to add a facility to
|
||
|
read archive_entry(3) objects from disk as well.)
|
||
|
|
||
|
The read and write APIs each have four layers: a public API layer, a for-
|
||
|
mat layer that understands the archive file format, a compression layer,
|
||
|
and an I/O layer. The I/O layer is completely exposed to clients who can
|
||
|
replace it entirely with their own functions.
|
||
|
|
||
|
In order to provide as much consistency as possible for clients, some
|
||
|
public functions are virtualized. Eventually, it should be possible for
|
||
|
clients to open an archive or disk writer, and then use a single set of
|
||
|
code to select and write entries, regardless of the target.
|
||
|
|
||
|
READ ARCHITECTURE
|
||
|
From the outside, clients use the archive_read(3) API to manipulate an
|
||
|
archive object to read entries and bodies from an archive stream. Inter-
|
||
|
nally, the archive object is cast to an archive_read object, which holds
|
||
|
all read-specific data. The API has four layers: The lowest layer is the
|
||
|
I/O layer. This layer can be overridden by clients, but most clients use
|
||
|
the packaged I/O callbacks provided, for example, by
|
||
|
archive_read_open_memory(3), and archive_read_open_fd(3). The compres-
|
||
|
sion layer calls the I/O layer to read bytes and decompresses them for
|
||
|
the format layer. The format layer unpacks a stream of uncompressed
|
||
|
bytes and creates archive_entry objects from the incoming data. The API
|
||
|
layer tracks overall state (for example, it prevents clients from reading
|
||
|
data before reading a header) and invokes the format and compression
|
||
|
layer operations through registered function pointers. In particular,
|
||
|
the API layer drives the format-detection process: When opening the ar-
|
||
|
chive, it reads an initial block of data and offers it to each registered
|
||
|
compression handler. The one with the highest bid is initialized with
|
||
|
the first block. Similarly, the format handlers are polled to see which
|
||
|
handler is the best for each archive. (Prior to 2.4.0, the format bid-
|
||
|
ders were invoked for each entry, but this design hindered error recov-
|
||
|
ery.)
|
||
|
|
||
|
I/O Layer and Client Callbacks
|
||
|
The read API goes to some lengths to be nice to clients. As a result,
|
||
|
there are few restrictions on the behavior of the client callbacks.
|
||
|
|
||
|
The client read callback is expected to provide a block of data on each
|
||
|
call. A zero-length return does indicate end of file, but otherwise
|
||
|
blocks may be as small as one byte or as large as the entire file. In
|
||
|
particular, blocks may be of different sizes.
|
||
|
|
||
|
The client skip callback returns the number of bytes actually skipped,
|
||
|
which may be much smaller than the skip requested. The only requirement
|
||
|
is that the skip not be larger. In particular, clients are allowed to
|
||
|
return zero for any skip that they don't want to handle. The skip call-
|
||
|
back must never be invoked with a negative value.
|
||
|
|
||
|
Keep in mind that not all clients are reading from disk: clients reading
|
||
|
from networks may provide different-sized blocks on every request and
|
||
|
cannot skip at all; advanced clients may use mmap(2) to read the entire
|
||
|
file into memory at once and return the entire file to libarchive as a
|
||
|
single block; other clients may begin asynchronous I/O operations for the
|
||
|
next block on each request.
|
||
|
|
||
|
Decompresssion Layer
|
||
|
The decompression layer not only handles decompression, it also buffers
|
||
|
data so that the format handlers see a much nicer I/O model. The decom-
|
||
|
pression API is a two stage peek/consume model. A read_ahead request
|
||
|
specifies a minimum read amount; the decompression layer must provide a
|
||
|
pointer to at least that much data. If more data is immediately avail-
|
||
|
able, it should return more: the format layer handles bulk data reads by
|
||
|
asking for a minimum of one byte and then copying as much data as is
|
||
|
available.
|
||
|
|
||
|
A subsequent call to the consume() function advances the read pointer.
|
||
|
Note that data returned from a read_ahead() call is guaranteed to remain
|
||
|
in place until the next call to read_ahead(). Intervening calls to
|
||
|
consume() should not cause the data to move.
|
||
|
|
||
|
Skip requests must always be handled exactly. Decompression handlers
|
||
|
that cannot seek forward should not register a skip handler; the API
|
||
|
layer fills in a generic skip handler that reads and discards data.
|
||
|
|
||
|
A decompression handler has a specific lifecycle:
|
||
|
Registration/Configuration
|
||
|
When the client invokes the public support function, the decom-
|
||
|
pression handler invokes the internal
|
||
|
__archive_read_register_compression() function to provide bid and
|
||
|
initialization functions. This function returns NULL on error or
|
||
|
else a pointer to a struct decompressor_t. This structure con-
|
||
|
tains a void * config slot that can be used for storing any cus-
|
||
|
tomization information.
|
||
|
Bid The bid function is invoked with a pointer and size of a block of
|
||
|
data. The decompressor can access its config data through the
|
||
|
decompressor element of the archive_read object. The bid func-
|
||
|
tion is otherwise stateless. In particular, it must not perform
|
||
|
any I/O operations.
|
||
|
|
||
|
The value returned by the bid function indicates its suitability
|
||
|
for handling this data stream. A bid of zero will ensure that
|
||
|
this decompressor is never invoked. Return zero if magic number
|
||
|
checks fail. Otherwise, your initial implementation should
|
||
|
return the number of bits actually checked. For example, if you
|
||
|
verify two full bytes and three bits of another byte, bid 19.
|
||
|
Note that the initial block may be very short; be careful to only
|
||
|
inspect the data you are given. (The current decompressors
|
||
|
require two bytes for correct bidding.)
|
||
|
Initialize
|
||
|
The winning bidder will have its init function called. This
|
||
|
function should initialize the remaining slots of the struct
|
||
|
decompressor_t object pointed to by the decompressor element of
|
||
|
the archive_read object. In particular, it should allocate any
|
||
|
working data it needs in the data slot of that structure. The
|
||
|
init function is called with the block of data that was used for
|
||
|
tasting. At this point, the decompressor is responsible for all
|
||
|
I/O requests to the client callbacks. The decompressor is free
|
||
|
to read more data as and when necessary.
|
||
|
Satisfy I/O requests
|
||
|
The format handler will invoke the read_ahead, consume, and skip
|
||
|
functions as needed.
|
||
|
Finish The finish method is called only once when the archive is closed.
|
||
|
It should release anything stored in the data and config slots of
|
||
|
the decompressor object. It should not invoke the client close
|
||
|
callback.
|
||
|
|
||
|
Format Layer
|
||
|
The read formats have a similar lifecycle to the decompression handlers:
|
||
|
Registration
|
||
|
Allocate your private data and initialize your pointers.
|
||
|
Bid Formats bid by invoking the read_ahead() decompression method but
|
||
|
not calling the consume() method. This allows each bidder to
|
||
|
look ahead in the input stream. Bidders should not look further
|
||
|
ahead than necessary, as long look aheads put pressure on the
|
||
|
decompression layer to buffer lots of data. Most formats only
|
||
|
require a few hundred bytes of look ahead; look aheads of a few
|
||
|
kilobytes are reasonable. (The ISO9660 reader sometimes looks
|
||
|
ahead by 48k, which should be considered an upper limit.)
|
||
|
Read header
|
||
|
The header read is usually the most complex part of any format.
|
||
|
There are a few strategies worth mentioning: For formats such as
|
||
|
tar or cpio, reading and parsing the header is straightforward
|
||
|
since headers alternate with data. For formats that store all
|
||
|
header data at the beginning of the file, the first header read
|
||
|
request may have to read all headers into memory and store that
|
||
|
data, sorted by the location of the file data. Subsequent header
|
||
|
read requests will skip forward to the beginning of the file data
|
||
|
and return the corresponding header.
|
||
|
Read Data
|
||
|
The read data interface supports sparse files; this requires that
|
||
|
each call return a block of data specifying the file offset and
|
||
|
size. This may require you to carefully track the location so
|
||
|
that you can return accurate file offsets for each read. Remem-
|
||
|
ber that the decompressor will return as much data as it has.
|
||
|
Generally, you will want to request one byte, examine the return
|
||
|
value to see how much data is available, and possibly trim that
|
||
|
to the amount you can use. You should invoke consume for each
|
||
|
block just before you return it.
|
||
|
Skip All Data
|
||
|
The skip data call should skip over all file data and trailing
|
||
|
padding. This is called automatically by the API layer just
|
||
|
before each header read. It is also called in response to the
|
||
|
client calling the public data_skip() function.
|
||
|
Cleanup
|
||
|
On cleanup, the format should release all of its allocated mem-
|
||
|
ory.
|
||
|
|
||
|
API Layer
|
||
|
XXX to do XXX
|
||
|
|
||
|
WRITE ARCHITECTURE
|
||
|
The write API has a similar set of four layers: an API layer, a format
|
||
|
layer, a compression layer, and an I/O layer. The registration here is
|
||
|
much simpler because only one format and one compression can be regis-
|
||
|
tered at a time.
|
||
|
|
||
|
I/O Layer and Client Callbacks
|
||
|
XXX To be written XXX
|
||
|
|
||
|
Compression Layer
|
||
|
XXX To be written XXX
|
||
|
|
||
|
Format Layer
|
||
|
XXX To be written XXX
|
||
|
|
||
|
API Layer
|
||
|
XXX To be written XXX
|
||
|
|
||
|
WRITE_DISK ARCHITECTURE
|
||
|
The write_disk API is intended to look just like the write API to
|
||
|
clients. Since it does not handle multiple formats or compression, it is
|
||
|
not layered internally.
|
||
|
|
||
|
GENERAL SERVICES
|
||
|
The archive_read, archive_write, and archive_write_disk objects all con-
|
||
|
tain an initial archive object which provides common support for a set of
|
||
|
standard services. (Recall that ANSI/ISO C90 guarantees that you can
|
||
|
cast freely between a pointer to a structure and a pointer to the first
|
||
|
element of that structure.) The archive object has a magic value that
|
||
|
indicates which API this object is associated with, slots for storing
|
||
|
error information, and function pointers for virtualized API functions.
|
||
|
|
||
|
MISCELLANEOUS NOTES
|
||
|
Connecting existing archiving libraries into libarchive is generally
|
||
|
quite difficult. In particular, many existing libraries strongly assume
|
||
|
that you are reading from a file; they seek forwards and backwards as
|
||
|
necessary to locate various pieces of information. In contrast,
|
||
|
libarchive never seeks backwards in its input, which sometimes requires
|
||
|
very different approaches.
|
||
|
|
||
|
For example, libarchive's ISO9660 support operates very differently from
|
||
|
most ISO9660 readers. The libarchive support utilizes a work-queue
|
||
|
design that keeps a list of known entries sorted by their location in the
|
||
|
input. Whenever libarchive's ISO9660 implementation is asked for the
|
||
|
next header, checks this list to find the next item on the disk. Direc-
|
||
|
tories are parsed when they are encountered and new items are added to
|
||
|
the list. This design relies heavily on the ISO9660 image being opti-
|
||
|
mized so that directories always occur earlier on the disk than the files
|
||
|
they describe.
|
||
|
|
||
|
Depending on the specific format, such approaches may not be possible.
|
||
|
The ZIP format specification, for example, allows archivers to store key
|
||
|
information only at the end of the file. In theory, it is possible to
|
||
|
create ZIP archives that cannot be read without seeking. Fortunately,
|
||
|
such archives are very rare, and libarchive can read most ZIP archives,
|
||
|
though it cannot always extract as much information as a dedicated ZIP
|
||
|
program.
|
||
|
|
||
|
SEE ALSO
|
||
|
archive_entry(3), archive_read(3), archive_write(3),
|
||
|
archive_write_disk(3), libarchive(3)
|
||
|
|
||
|
HISTORY
|
||
|
The libarchive library first appeared in FreeBSD 5.3.
|
||
|
|
||
|
AUTHORS
|
||
|
The libarchive library was written by Tim Kientzle <kientzle@acm.org>.
|
||
|
|
||
|
BSD January 26, 2011 BSD
|