Pokology - a community-driven site around GNU poke

_____ ---' __\_______ ______)

Tips and Tricks

__) __) ---._______) Table of Contents _________________ 1. Sequence of bytes ended by two consecutive zero bytes 2. Sub-byte endianness swaps 1 Sequence of bytes ended by two consecutive zero bytes ======================================================= So today I was writing a pickle for the BER encoding of ASN-1 and came with an interesting problem: when the length of the contents of a BER data value is indefinite (not explicitly specified) we shall read bytes until we find an end marker that consists on two consecutive 0 bytes. Sounds simple enough... but I really had to think for a while before coming to this: ,---- | type BER_Variable_Contents = | struct | { | type Datum = | union | { | byte[2] pair : pair[1] != 0UB; | byte single : single != 0UB; | }; | | Datum[] data; | byte[2] end : end == [0UB,0UB]; | }; `---- How does this work? First, notice that there must be at least two bytes in the IO space, which would correspond to the two 0 bytes that terminate the sequence. Otherwise it would be an integrity error. When it reaches the data field of the struct, which is an unbounded array, poke will decode Datum=s and add them to the array =data until either a constraint fails or the end of the IO space is reached. So it will first try to decode a pair of bytes in which the second byte is not zero. If that fails (due to either a constraint error or EOF) then it will try to map a single byte that is not zero. If that also fails, poke is done with data and then it expects the two zero bytes in end. This is how the resulting structures look like: ,---- | (poke) .mem | (poke) byte[] @ 0#B = ['a','b','c','d','e'] | (poke) BER_Variable_Contents @ 0#B | BER_Variable_Contents { | data=[Datum { | pair=[0x61UB,0x62UB] | },Datum { | pair=[0x63UB,0x64UB] | },Datum { | single=0x65UB | }], | end=[0x0UB,0x0UB] | } `---- A suitable method could be added to the struct type in order to get an array with all the bytes: ,---- | method get_bytes = byte[]: | { | var bytes = byte[](); | for (d in data) | try bytes += d.pair; | catch if E_elem { bytes += [d.single]; } | return bytes; | } `---- In action: ,---- | (poke) (BER_Variable_Contents @ 0#B).get_bytes | [0x61UB,0x62UB,0x63UB,0x64UB,0x65UB] `---- 2 Sub-byte endianness swaps =========================== Byte-endianness is typically handled quite transparently in poke by using integral types and integral structs. However, endianness can have an impact on the ordering of C bit-fields in a structure, even when the bit-fields are smaller than one byte. This ordering is indeed implementation specific. A good (or bad) example of this is the way registers are encoded in eBPF instructions. eBPF is the in-kernel virtual machine of Linux, and features an ISA with ten general-purpose registers. eBPF instructions generally use two registers, namely the source register and the destination register. Each register is encoded using 4 bits, and the fields encoding registers are consecutive in the instructions. Typical. However, the kernel hackers used the following C struct to denote BPF instructions: ,---- | struct bpf_insn { | __u8 code; /* opcode */ | __u8 dst_reg:4; /* dest register */ | __u8 src_reg:4; /* source register */ | __s16 off; /* signed offset */ | __s32 imm; /* signed immediate constant */ | }; `---- The ordering of the sub-byte bit fields (like dst_reg and src_reg) can change depending on the endianess, and it is implementation-specific. In the case of GCC the source and destination register fields are switched depending on the endianness. In big-endian systems the order is: ,---- | dst:4 src:4 `---- Whereas in little-endian systems the order is: ,---- | src:4 dst:4 `---- In Poke, the obvious way of representing data whose structure depends on some condition is using an union. In this case, it could read like this: ,---- | type BPF_Insn_Regs = | union | { | struct | { | BPF_Reg src; | BPF_Reg dst; | } le : get_endian == ENDIAN_LITTLE; | | struct | { | BPF_Reg dst; | BPF_Reg src; | } be; | }; `---- Note the call to the get_endian function (which takes no arguments and thus can be called Algol68-style, without specifying an empty argument list) in the constraint of the union alternative. This way, the register fields will have the right order corresponding to the current endianness. Nifty. However, there is an ever better way to denote the structure of these fields. This is it: ,---- | type BPF_Insn_Regs = | struct | { | var little_p = (get_endian == ENDIAN_LITTLE); | | BPF_Reg src @ !little_p * 4#b; | BPF_Reg dst @ little_p * 4#b; | }; `---- This version, where the ordering of the fields is implemented using field labels, is not only more compact, but also has the virtue of not requiring additional "intermediate" fields like le and be above. It also shows how convenient can be to declare variables inside structs.