Pokology - Padding and aligning data in GNU poke


Pokology - a community-driven site around GNU poke
                             

     _____
 ---'   __\_______
            ______)         Padding and aligning data in GNU poke
            __)             
           __)
 ---._______)


Table of Contents
_________________

1. Esoteric and exoteric padding
2. Reserved fields
3. Payloads
4. Aligning struct fields
5. Padding array elements


It is often the case in binary formats that certain elements are
separated by some data that is not really used for any meaningful
purpose other than occupy that space.  The reason for keeping that space
varies from case to case; sometimes to reserve it for future use,
sometimes to make sure that the following data is aligned to some
particular alignment.  This is known as "padding".  There are several
ways to implement padding in GNU poke.  This article shows these
techniques and discusses their advantages and disadvantages.


1 Esoteric and exoteric padding
===============================

  So padding is the technique of keeping some amount of space between
  two different elements in some data stream.  GNU poke provides two
  different ways to express sequences of data elements: the fields of a
  struct type, which are defined one after the other, and elements in an
  array.

  We call adding space between two struct fields esoteric (or internal)
  padding.

  We call adding space between two array elements exoteric (or external)
  padding.

  Let's see some examples of the two kinds and how to better handle them
  in Poke.


2 Reserved fields
=================

  People designing binary encoded formats tend to be cautious and try to
  avoid future backward incompatibilities by keeping some unused fields
  that are reserved for future use.  This is the first kind of padding
  we will be looking at, and is particularly common in structures like
  headers.

  See for example the header used to characterize compressed section
  contents in ELF files:

  ,----
  | type Elf64_Chdr =
  |   struct
  |   {
  |     Elf_Word ch_type;
  |     Elf_Word ch_reserved;
  |     offset<Elf64_Xword,B> ch_size;
  |     offset<Elf64_Xword,B> ch_addralign;
  |   };
  `----


  where the ch_reserved field is reserved for future use.  When the time
  comes the space occupied by that field (32 bits in this case) will be
  used to hold additional data in the form of one or more fields.  The
  idea is that implementations of the older formats will still work.

  The most obvious way to handle this in Poke is using a named field
  like ch_reserved above.  This field will be decoded/encoded by poke
  when constructing/mapping/writing struct values of this type, and will
  be available to the user as chdr.ch_reserved.

  Sometimes reserved space is required to be filled with certain data
  values, such as zeroes.  This may be to simplify things, or to force
  data producers to initialize the memory in order to avoid potential
  leaking of sensible information.  In these cases we can use Poke
  initial values:

  ,----
  | type Elf64_Chdr =
  |   struct
  |   {
  |     Elf_Word ch_type;
  |     Elf_Word ch_reserved = 0;
  |     offset<Elf64_Xword,B> ch_size;
  |     offset<Elf64_Xword,B> ch_addralign;
  |   };
  `----


  This will make poke to check that ch_reserved is zero when
  constructing or mapping headers for compressed sections raising a
  constraint violation exception otherwise.  It will also make poke to
  make sure ch_reserved to 0 when constructing Elf64_Chdr struct values:

  ,----
  | (poke) Elf64_Chdr { ch_reserved = 23 }
  | unhandled constraint violation exception
  `----


  An alternative way to characterize reserved space in Poke is to use
  anonymous fields.  For example:

  ,----
  | type Elf64_Chdr =
  |   struct
  |   {
  |     Elf_Word ch_type;
  |     Elf_Word;
  |     offset<Elf64_Xword,B> ch_size;
  |     offset<Elf64_Xword,B> ch_addralign;
  |   };
  `----


  Using Poke anonymous fields to implement reserved fields has at least
  two advantages.  First, the user cannot anymore temper with the data
  in the reserved space in an easy way, i.e. chdr.ch_reserved = 666 is
  no longer valid.  Second, the printed representation of anonymous
  struct fields is more compact and denotes better than the involved
  space is not to be messed with:

  ,----
  | (poke) Elf64_Chdr {}
  | Elf64_Chdr {
  |   ch_type=0x0U,
  |   0x0U,
  |   ch_size=0x0UL#B,
  |   ch_addralign=0x0UL#B
  | }
  `----


  A disadvantage of using anonymous fields is that you cannot specify
  constraint expressions for them, nor initial values.  At some point we
  will probably add syntax to declare certain struct fields as
  read-only.

  At this point, it is important to note that anonymous fields are still
  encoded/decoded by poke every time the struct value is mapped or
  written, exactly like regular fields.  Therefore using them doesn't
  pose any advantage in terms of performance.


3 Payloads
==========

  The reserved fields discussed in the previous section are most often
  discrete units like words, double-words, and the like, they are
  usually of some fixed size, and they are used to delimit some space
  that is not to be used.

  Another kind of padding happens when an entity contains space to be
  used to store some kind of payload whose contents are not determined.
  This would be such an example:

  ,----
  | type Packet =
  | struct
  | {
  |   offset<uint<32>,B> payload_size;
  |   byte[payload_size] payload;
  |   int flags;
  | };
  `----


  In this example we are using a payload field which is an array of
  bytes.  The size of the payload is determined by the packet header,
  and the contents are not determined.  Of course this assumes that the
  payload sizes are divisible in whole bytes; a bit-oriented format may
  need to use an array of bits instead.

  This approach of using a byte (or bit) array like in the example above
  has the advantage of providing a field with the bytes (or bits) to the
  user, for inspection and modification:

  ,----
  | (poke) packet.payload
  | [23UB, ...]
  | (poke) packet.payload[0] = 0
  `----


  The user can still map whatever payload structure in that space using
  the attributes of a mapped Packet.  For example, if the packet
  contains an array of ULEB128 numbers, we could do:

  ,----
  | (poke) var numbers = ULEB128[packet.payload'size] @ packet.payload'offset
  `----


  But this approach has a disadvantage: every time the packet structure
  is mapped or written the entire payload array gets decoded and
  encoded.  If the payloads are big enough (think about the data blocks
  of a file described by a filesystem i-node for example) this can be a
  big problem in terms of performance.

  Another problem of using byte (or bit) arrays for payloads is that the
  printed representation of the struct values include the contents of
  the arrays, and most often the user won't be interested in seeing
  that:

  ,----
  | (poke) Packet { payload_size = 23#B }
  | Packet {
  |   payload_size=0x17U#B,
  |   payload=[0x0UB,0x0UB,0x0UB,0x0UB,0x0UB,...],
  |   flags=0x0
  | }
  `----


  Another alternative is to implement the padding implied by a payload
  using field labels:

  ,----
  | type Packet =
  | struct
  | {
  |   offset<uint<32>,B> payload_size;:   
  |   int flags @ OFFSET + payload_size;
  | };
  `----


  Note how a payload field no longer exists in the struct type, and the
  field flags is defined to start at offset OFFSET + payload_size.  This
  way no explicit array is encoded/decoded when manipulating Packet
  values:

  ,----
  | (poke) .set omaps yes
  | (poke) Packet { payload_size = 500#Mb }
  | Packet {
  |   payload_size=62500000U#B @ 0UL#b,
  |   flags=0 @ 4000000032UL#b
  | } @ 0UL#b
  `----


  In this example we used the omaps option, which asks poke to print the
  offsets of the fields.  The offset of flags is 4000000032 bits, or 500
  megabytes:

  ,----
  | (poke) 4000000032UL #b/#MB
  | 500UL
  `----


  Mapping this new Packet involves reading and decoding five bytes, for
  the payload_size and flags only.  This is clearly much faster and
  avoids unneeded IO.

  However you may be wondering, if there is no explicit payload field,
  how to access the payload space?  A way is to define a method to the
  struct to provide the payload attributes:

  ,----
  | type Packet =
  | struct
  | {
  |   offset<uint<32>,B> payload_size;:   
  |   var payload_offset = OFFSET;
  |   int flags @ OFFSET + payload_size;
  | 
  |   method get_payload_offset = off64:
  |   {
  |     return payload_offset;
  |   }
  | };
  `----


  Note how we captured the offset of the payload using a variable in the
  strict type definition.  Returning OFFSET in get_payload_offset
  wouldn't work for obvious reasons: in the body of the method OFFSET
  evaluates to the end of flags in this case.

  Using this method you can easily access the payload (again as an array
  of ULEB128 numbers) like this:

  ,----
  | var numbers = ULEB128[packet.payload_size @ packet.get_payload_offset
  `----


  Finally, using labels for this purpose makes the printed
  representation of the struct values more readable by not including the
  payload bytes in it:

  ,----
  | (poke) Packet {}
  | Packet {
  |   payload_size=0x0U#B,
  |   flags=0x0
  | }
  `----


4 Aligning struct fields
========================

  Another kind of esoteric padding happens when certain fields in
  entities are required to be aligned to some particular alignment.  For
  example, suppose that the flags field in the packets used in the
  previous sections is required to always be aligned to 4 bytes
  regardless of the size of the payload.  This would be a common
  requirement if the format is intended to be implemented in systems
  where data is to be accessed using its "natural" alignment.

  Using explicit fields for both the payload and the additional padding,
  we could come with:

  ,----
  | type Packet =
  | struct
  | {
  |   offset<uint<32>,B> payload_size;
  |   byte[payload_size] payload;
  |   byte[alignto (OFFSET, 4#B)] padding;
  |   int flags;
  | };
  `----


  Where alignto is a little function defined in the Poke standard
  library, like this:

  ,----
  | fun alignto = (uoff64 offset, uoff64 to) uoff64:
  | {
  |   return (to - (offset % to)) % to;
  | }
  `----


  Alternatively, using the labels approach (which is generally better as
  we discussed in the last section) the definition would become:

  ,----
  | type Packet =
  | struct
  | {
  |   offset<uint<32>,B> payload_size;:   
  |   var payload_offset = OFFSET;
  |   int flags @ OFFSET + payload_size + alignto (payload_size, 4#B);
  | 
  |   method get_payload_offset = off64:
  |   {
  |     return payload_offset;
  |   }
  | };
  `----


  In this case, the payload space is still completely characterized by
  the payload_size field and the get_payload_offset method.


5 Padding array elements
========================

  Up to now all the examples of padding we have shown are in the
  category of esoteric or internal padding, i.e. it was intended to add
  space between fields of some particular entity.

  However, sometimes we want to specify some padding between the
  elements of a sequence of entities.  In Poke this basically means an
  array.

  Suppose we have a simple filesystem that is conformed by a sequence of
  inodes.  The contents of the filesystem have the following form:

  ,----
  | +-----------------+
  | |      inode      |
  | +-----------------+
  | :                 :
  | :      data       :
  | :                 :
  | +-----------------+
  | |      inode      |
  | +-----------------+
  | :                 :
  | :      data       :
  | :                 :
  | +-----------------+
  | |      ...        |
  `----


  That's it, each inode describes a block of data of variable size that
  immediately follows.  Then more pairs of inode-data follow until the
  end of the device.  However, a requirement is that each inode has to
  be aligned to 128 bytes.

  Let's start by writing a simple type definition for the inodes:

  ,----
  | type Inode =
  |   struct
  |   {
  |     string filename;
  |     int perms;
  |     offset<uint<32>,B> data_size;
  |   };
  `----


  This definition is simple enough, but it doesn't allow us to just map
  an array of inodes like this:

  ,----
  | (poke) Inode[] @ 0#B
  `----


  We could of course add the data and padding explicitly to the inode
  structure:

  ,----
  | type Inode =
  |   struct
  |   {
  |     string filename;
  |     int perms;
  |     offset<uint<32>,B> data_size;
  |     byte[data_size] data;
  |     byte[alignto (data_size, 128#B) padding;
  |   };
  `----


  Then we could just map Inode[] @ 0#B and we would the get expected
  result.

  But this is not a good idea.  On one hand because, as we know, this
  would imply mapping the full filesystem data byte by byte, and that
  would be very very slow.  On the other hand, because the data is not
  part of the inode, conceptually speaking.

  A better solution is to use this idiom:

  ,----
  | type Inode =
  |   struct
  |   {
  |     string filename;
  |     int perms;
  |     offset<uint<32>,B> data_size;
  | 
  |     byte[0] @ OFFSET + data_size + alignto (data_size, 128#B);
  |   };
  `----


  This uses an anonymous field at the end of the struct type, of size
  zero, located at exactly the offset where the data plus padding would
  end in the version with explicit fields.

  This later solution is fast and still allows us to get an array of
  inodes reflecting the whole filesystem with:

  ,----
  | (poke) var inodes = Inode[] @ 0#B
  `----


  Like in the previous sections, a method get_data_offset can be added
  to the struct type in order to allow accessing the data blocks
  corresponding to a given inode.