PDB Format

Introduction to PDB

Palm OS applications use a data storage format known as a Palm Database (or PDB). Data is stored in discrete records which can be up to 64 KB in size. Each record can have associated with it an application-defined identifier and a category identifier. Poppi uses these features to allow easy lookup of records.

This storage system was designed to be efficient for palm-top applications such as address books or to-do lists. The data that these applications work with can easily be broken down into individual records: for example, an address book entry which consists of a name, address and telephone number for a particular person. While conceptually all the address book entries are part of a single database, in reality they can be distributed wherever is available in the device's memory.

Record Formats

Poppi defines two types of records: taxonomic records for families, genera and species; and key pair records which aid the reader in identifying the plant under inspection in a choose-your-own-adventure fashion.

Taxonomic Records

The taxonomic records all have the following structure:

 short int  name_length
      char  name[name_length]
 short int  original_description_length
 short int  compressed_description_length
      char  compressed_description[compressed_description_length]

The length of this structure varies depending on the length of the name and the description. This seems to be more flexible than having a “fixed-width” name field. Also, since these databases may get quite large it is worth conserving space where possible.

If the description (uncompressed) is of 0 length, then no compression is done and the compressed description length is set to 0. This avoids the small overhead of “compressing” an empty string with zlib.

Otherwise, the data is compressed using zlib. The original length and the compressed length are stored, followed by the compressed data. This allows the Poppi application on the Palm OS device to allocate just enough memory to hold the original and uncompress the description into this space.

Description

In the description, there may be many description items. Each description item has a title part and a body part. The title is separated from the body by a tab character ('\t') and description items are separated by a new line character ('\n').

For example, here is an XML extract:

<di><dt>Habitat</dt><dd>Roadsides, arable and waste ground</dd></di>

and its corresponding text form:

Habitat\tRoadsides, arable and waste ground\n

Key Pair Records

Each key pair is a pair of elements that have a description and a destination. The destinations are unique identifiers, as explained below. The descriptions are compressed and their lengths stored with the same methodology as used for taxonomic records.

identifier  destination_1
identifier  destination_2
 short int  original_description_length_1
 short int  original_description_length_2
 short int  compressed_description_length_1
 short int  compressed_description_length_2
      char  compressed_description[compressed_description_length_1]
      char  compressed_description[compressed_description_length_2]

Related

Some relevant parts of the source code and documentation.

Unique Identifiers

Each record in a PDB can be assigned a non-zero 24-bit unique identifier. In Poppi, this is used to give each record an identifier that encodes its place in the taxonomic hierearchy.

The structure used is as follows:

       int  reserved:8
       int  family:8
       int  genus:8
       int  is_kp:1
       int  species:7

This structure allows each database to have 255 families, each of which can have 255 genera, each of which can have 127 species. This is more than enough to handle all the data in An Irish Flora. Larger bodies of information can be divided between multiple databases, although there is not yet support for links between databases.

Taxonomic Records

The taxonomic records are assigned unique id's by the poppi-pdb program. Each family is assigned a unique id (from 1 to 255) and the genus and species fields are set to 0. The is_kp flag is set to 0.

Each genus within this family has its family field set to its parent family's identifier and is assigned a unique id (from 1 to 255) for its genus field.The species field and the is_kp flag are set to 0.

Each species within this genus has its family and genus fields set to the values used by the parent and is assigned a unique id (from 1 to 127) for its species field. The is_kp flag is set to 0.

Key Pair Records

The key pair records use this structure slightly differently: top-level keys are treated as family records in that they are assigned a unique id (from 1 to 255) for the family field and the genus and species fields are cleared. The is_kp flag is set to 1, to indicate that the record is a key pair.

Keys within a family, which lead to genera of that family, are structured as genus records of that family but with the is_kp flag set to 1. Similarly, keys within a genus, which lead to species of that genus, are like species records, but with the is_kp flag set to 1.

It should be noted that the id's of ordinary records and key pairs can overlap in the sense that it is legal to have both a family with the id 0x00420000 and a key with the id 0x00420080 where the only difference is that one has the is_kp bit clear and the other has it set.

Examples

In a version of the Poppi PDB the family Papavaraceae has the unique identifier 0x00020000, indicating that this is, in some sense, the second family in the database. The genus Papaver of this family has the id 0x00020100, as the first genus within this family. Papaver dubium, a species within Papaver has the id 0x0020103. It is the third species in the genus.

The first key-pair in Papavaraceae has the unique id 0x00020180. A key pair in Papaver has the unique id 0x00020184.

Categories

Palm OS has the notion of categories for records. Each PDB can have 16 categories and each record is in marked as being in one of these categories. The Palm API provides support for retrieving records based on their category. It is easy to see how categories apply to, say, an address book: you could have a “Personal” category for friends and family; a “Business” category for business contacts; and an “Unfiled“ category for those entries that haven't yet been put in a category. In this way the user can filter the display to only show personal contacts, or to show all contacts.

Poppi uses the categories “Family”, “Genus”, “Species” and “Key” to distinguish between the different types of record. This does not strictly follow the filtering concept given above as Poppi attempts to give the user a hierarchical view of the taxonomy.

In some ways, using both categories and unique identifiers is redundant as the unique identifiers can reveal whether a record is for a family, genus, species or key. The category mechanism is used on the assumption that the Palm OS API calls will be more efficient than my own identifier manipulation.

Category Block Structure

The Python module used to create the PDB file, prc.py, did not provide support for creating the Category Block structure needed by Palm OS applications that use categories. A simple Python class to implement this structure was created (see poppidb.py) and is used in the PDB creation process.

Size Estimates

The size of the PDB file is an important consideration. If the user can not access a sufficiently large body of information on their palm-top because of lack of space they may well find a traditional book more useful.

These estimates indicate that it would be possible to make available the full body of information from An Irish Flora in a size suitable for most Palm OS devices.

It should be noted that these estimates are based on the size of the PDB as it exists on the PC's file system. When installed to the Palm its structure is converted to that used internally by the Data Manager routines, which may introduce some overhead.

MeasurementNumberTotalSizeEstimated SizeNotes
Pages530010 KB600 KBIncludes key
Families413210 KB350 KBIgnores key
Estimates of Total PDB Size for An Irish Flora.

As larger XML files are produced it will be possible to make a better estimate of the the total file size.