Pokology - Tips and Tricks


Pokology - a community-driven site around GNU poke
                             

     _____
 ---'   __\_______
            ______)         Tips and Tricks
            __)             
           __)
 ---._______)


Table of Contents
_________________

1. Sequence of bytes ended by two consecutive zero bytes
2. Sub-byte endianness swaps





1 Sequence of bytes ended by two consecutive zero bytes
=======================================================

  So today I was writing a pickle for the BER encoding of ASN-1 and came
  with an interesting problem: when the length of the contents of a BER
  data value is indefinite (not explicitly specified) we shall read
  bytes until we find an end marker that consists on two consecutive 0
  bytes.

  Sounds simple enough... but I really had to think for a while before
  coming to this:

  ,----
  | type BER_Variable_Contents =
  |   struct
  |   {
  |     type Datum =
  |       union
  |       {
  |         byte[2] pair : pair[1] != 0UB;
  |         byte single : single != 0UB;
  |       };
  | 
  |     Datum[] data;
  |     byte[2] end : end == [0UB,0UB];
  |  };
  `----


  How does this work?

  First, notice that there must be at least two bytes in the IO space,
  which would correspond to the two 0 bytes that terminate the sequence.
  Otherwise it would be an integrity error.

  When it reaches the data field of the struct, which is an unbounded
  array, poke will decode Datum=s and add them to the array =data until
  either a constraint fails or the end of the IO space is reached.  So
  it will first try to decode a pair of bytes in which the second byte
  is not zero.  If that fails (due to either a constraint error or EOF)
  then it will try to map a single byte that is not zero.  If that also
  fails, poke is done with data and then it expects the two zero bytes
  in end.

  This is how the resulting structures look like:

  ,----
  | (poke) .mem
  | (poke) byte[] @ 0#B = ['a','b','c','d','e']
  | (poke) BER_Variable_Contents @ 0#B
  | BER_Variable_Contents {
  |   data=[Datum {
  |     pair=[0x61UB,0x62UB]
  |   },Datum {
  |     pair=[0x63UB,0x64UB]
  |   },Datum {
  |     single=0x65UB
  |   }],
  |   end=[0x0UB,0x0UB]
  | }
  `----


  A suitable method could be added to the struct type in order to get an
  array with all the bytes:

  ,----
  | method get_bytes = byte[]:
  | {
  |   var bytes = byte[]();
  |   for (d in data)
  |     try bytes += d.pair;
  |     catch if E_elem { bytes += [d.single]; }
  |   return bytes;
  | }
  `----


  In action:

  ,----
  | (poke) (BER_Variable_Contents @ 0#B).get_bytes
  | [0x61UB,0x62UB,0x63UB,0x64UB,0x65UB]
  `----


2 Sub-byte endianness swaps
===========================

  Byte-endianness is typically handled quite transparently in poke by
  using integral types and integral structs.  However, endianness can
  have an impact on the ordering of C bit-fields in a structure, even
  when the bit-fields are smaller than one byte.  This ordering is
  indeed implementation specific.

  A good (or bad) example of this is the way registers are encoded in
  eBPF instructions.  eBPF is the in-kernel virtual machine of Linux,
  and features an ISA with ten general-purpose registers.  eBPF
  instructions generally use two registers, namely the source register
  and the destination register.  Each register is encoded using 4 bits,
  and the fields encoding registers are consecutive in the instructions.

  Typical.  However, the kernel hackers used the following C struct to
  denote BPF instructions:

  ,----
  | struct bpf_insn {
  |   __u8      code;           /* opcode */
  |   __u8      dst_reg:4;      /* dest register */
  |   __u8      src_reg:4;      /* source register */
  |   __s16     off;            /* signed offset */
  |   __s32     imm;            /* signed immediate constant */
  | };
  `----


  The ordering of the sub-byte bit fields (like dst_reg and src_reg) can
  change depending on the endianess, and it is implementation-specific.
  In the case of GCC the source and destination register fields are
  switched depending on the endianness.

  In big-endian systems the order is:

  ,----
  | dst:4 src:4
  `----


  Whereas in little-endian systems the order is:

  ,----
  | src:4 dst:4
  `----


  In Poke, the obvious way of representing data whose structure depends
  on some condition is using an union.  In this case, it could read like
  this:

  ,----
  | type BPF_Insn_Regs =
  |   union
  |   {
  |     struct
  |     {
  |       BPF_Reg src;
  |       BPF_Reg dst;
  |     } le : get_endian == ENDIAN_LITTLE;
  | 
  |     struct
  |     {
  |       BPF_Reg dst;
  |       BPF_Reg src;
  |     } be;
  |   };
  `----


  Note the call to the get_endian function (which takes no arguments and
  thus can be called Algol68-style, without specifying an empty argument
  list) in the constraint of the union alternative.  This way, the
  register fields will have the right order corresponding to the current
  endianness.

  Nifty.  However, there is an ever better way to denote the structure
  of these fields.  This is it:

  ,----
  | type BPF_Insn_Regs =
  |   struct
  |   {
  |     var little_p = (get_endian == ENDIAN_LITTLE);
  |    
  |     BPF_Reg src @ !little_p * 4#b;
  |     BPF_Reg dst @ little_p * 4#b;
  |   };
  `----


  This version, where the ordering of the fields is implemented using
  field labels, is not only more compact, but also has the virtue of not
  requiring additional "intermediate" fields like le and be above.  It
  also shows how convenient can be to declare variables inside structs.