Technical white paper | HP ScanJet solutions
32
File size considerations
In environments with high volumes of scanning, file size is an important part of controlling the amount of storage required.
There are many factors that need to be considered when optimizing file size:
Resolution
The higher the dots per inch (dpi), the more data generated. For example, a 200-dpi document generates four times as
much data as a 100 dpi document because twice as much data is contained horizontally and vertically.
Another consideration, however, is if the data will be processed with OCR. If so, it is recommended, for accurate text
recognition, that the resolution not be lower than 150 dpi (ideally 200 dpi or more, and 300 dpi for Asian languages).
Color mode
Options for color mode include color, gray, and black-and-white (both two-tone or halftone). Each has an impact on data
and file size:
• Color scanning will usually produce 24-bit data for every dot.
• Grayscale scanning will usually produce 8 bits for every dot.
• Black-and-white scanning will produce 1 bit for every dot.
Consequently, color scanning will have 24 times more data than black-and-white scanning. This makes black-and-white
scanning very popular for archival.
Compression
The objective here is to reduce the size of the file, but there are many ways of doing so.
Compression techniques can be divided into lossy (where some data is lost in favor of compression) and lossless (where
data is fully preserved):
• Lossy compression will affect image quality. The lower the compression, the better the image quality; the higher the
compression, the more image quality will be affected. High compression may also have a negative impact on scanning
performance (or speed)..
• Lossless compression uses techniques that allow for size reduction without affecting image quality. The resulting sizes
are generally larger than with lossy compression, but if image quality is important, this is the best choice.
Compared to picture scanning, document scanning allows for more lossy compression before the effects on image quality
are visible. For this reason, it is possible to create smaller document files.
Types of compression include:
• JPEG: this is the most popular compression type and is typically used for color scanning. It is lossy compression, so file-
saving parameters will usually include a “quality” setting to configure the level of compression desired.
– JPEG only compresses color data, so if a black-and-white scan is saved with JPEG compression, it first needs to be
converted to color data (i.e., 1-bit data is converted to 24 bits). Consequently, JPEG is not ideal for black-and-white
scanning.
• LZW: this is lossless compression that can be used for scanning in any color mode. This type is particularly effective in
compressing black-and-white documents.
• ZIP: another common lossless compression format. This is typically used for the compression and packaging of software,
but it can also be used for scanned document files.
• CCITT Group 3 and Group 4: these are lossy compression types that are specifically designed for black-and-white
scanning.
File type
The choice of file format will significantly affect the size of a file.
Common formats include:
• Single page formats: these file types can only save one page per file. For each multi-page document, there will be
multiple files created.
– JPEG: this most common lossy-compression file format for pictures can be used for scanned pages, as well.
– PNG: this lossless compression file type will maintain image quality but will usually be larger than the equivalent JPEG
file.
– BMP: this format is similar to PNG but is typically uncompressed; and a scanned page can result in a relatively large file
size.
• Multi-page formats:
– TIFF: sometimes named TIFF for a single-page file and MTIFF for a multi-page file, TIFF is a “container” file format,
where the pages contained in it can use a variety of compression types. For example, a TIFF file can have JPEG-
compressed pages while another can be an LZW-compressed document. Given this capability, it is common to see TIFF