Sparse files

Facts - Problems - Conclusions - Sources

Facts

File contents needs not be represented physically one-to-one on disk. Examples are files in compressed filesystems and sparse files on Unix systems.

Most filesystems organize their files as a (not necessarily contiguous) sequence of disk blocks. The indices of these blocks are (on Unix filesystems) kept in so-called Inodes (and so-called indirect blocks). If one of these blocks consists entirely of "default" bytes (i.e. null bytes), the block needs not be allocated on disk (its index in the inode can be set to null). Such files with unallocated blocks are called sparse files.

Typically, it's up to the applications (not the operating system) to produce and keep sparse files. This is achieved on the Unix system call api level by using lseek(2) instead of write(2). Application implementations are free to use this feature (GNU implementations of Unix commands are a sample).

Samples of applications that more or less rely on the sparse file feature are the ones that use the dbm database format library and database applications that lseek to large offsets obtained from large hash values.

A sparse file can be recognized by comparing its size (ls -l) with the disk space it occupies (du -s or ls -s) or by displaying its disk layout using the filesystem debugger fsdb(1a). fsdb also reveals the locations of the unallocated blocks (which is otherwise not clear if there are allocated and unallocated null-blocks).

Problems

Sparse files may require much more media space and copy time when backed up. Larger tapes must be used and filesystem downtime may grow too long.

Sparse files are eventually expanded when copied (cp, cpio), moved (mv) or restored (tar, cpio, dd). This wastes disk space and disks may be too small for the restores.

Executable loaders of some Unix implementations seem to have problems with sparse executable files.

It is not evident how much disk space an application will eventually occupy, if disk space is reserved in form of sparse files.

Network filesystems may not properly implement the lseek feature, or the target filesystem may not be able to create sparse files.

Conclusions

If you are a programmer, don't use (rely on) the sparse file feature.

As a user, apply tools that preserve the sparseness of such files.

Sources

A compact sparse file creation tool can be found here: sparsefile.c. Its core function 'sparse' may be built into own tools.

Some GNU implementations of Unix commands include sparse file handling: cp, tar at gnu.org (archive names: coreutils-/fileutils-, tar-).

The Safe/Fast I/O Library sfio at research.att.com includes also the handling of sparse files.

Keywords: sparse file, sparsefile, sparse file tool, unallocated blocks, lseek


lr / Tue Jan 7 1997 / links checked Fri Apr 28 2006