Compression

Introduction

Given the small memory capacity of Palm OS devices it was decided that the text data should be compressed if possible. ZLib seemed an appropriate choice for several reasons:

The Python support for ZLib is very nice: simply pass a string of data to the zlib.compress method and the compressed data are returned.

The Palm OS port of ZLib has the same interface as the standard C version. It was possible to develop a simpler interface.

A description of how compression fits in with the Palm OS database is included in the PDB Format page.

Related

Some other references to ZLib in Poppi

Specifications for ZLib and related standards are available in these RFCs:

Statistics

Below is a comparison of the file sizes of several versions of the data from An Irish Flora.

File TypeCompressionFile Size%age Decrease 
XML-18 5490.0(Original)
XMLGZip5 58269.9 
PDB-13 41227.7 
PDBZLib10 46443.6 
Compression Statistics for a portion of An Irish Flora.

The decrease in size between the compressed and uncompressed PDB format is 22.0%. This is an appreciable saving in space although there is reason to hope for even greater compression: In this portion of the data several items were added without descriptions. This increases the amount of uncompressed data without increasing the amount of compressed data.

The data for the GZipped XML file are included for comparison. GZip uses ZLib to do its compression. The reason for the dramatic compression rate of almost 70% is that the XML file contains a lot of redundant data:

Another major reason is that the XML file is compressed as one stream using a single compression dictionary. This means that the compression algorithm can take advantage of redundancies in the whole file. When operating on the individual (c. 1000 byte) descriptions in the PDB, ZLib can only exploit the redundancy in that chunk of data. It is important that descriptions remain independent so that it is possible to uncompress just the amount of data needed for the current display.

If it were possible to maintain a compression dictionary while generating the PDB and then storing that data structure in the PDB in to make it available to the Palm ZLib library then better compression rates might be available.

ZLib 1.1.4

Apparently there is a bug in ZLib 1.1.3, where a buffer is freed twice. This could present a security hole.

ZLib 1.1.4 has been released but the Palm OS port has not yet been updated. I have applied the Palm OS patch for version 1.1.3-7 to the 1.1.4 source tree and so have created a 1.1.4 PalmOS library. To the best of my knowledge, this is the only ZLib 1.1.4 library available for Palm OS.