Kaitai Struct bits int fuzzer

A simple fuzzing system intended to rigorously test methods for reading bit-sized integers in all runtime libraries that Kaitai Struct supports, namely:

List of read_bits_int_{be,le} implementations in all runtime libraries

C++/STL - cpp_stl_98 , cpp_stl_11
- read_bits_int_be
- read_bits_int_le
C# - csharp
- ReadBitsIntBe
- ReadBitsIntLe
Go - go
- ReadBitsIntBe
- ReadBitsIntLe
Java - java
- readBitsIntBe
- readBitsIntLe
JavaScript - javascript
- readBitsIntBe
- readBitsIntLe
Lua - lua
- read_bits_int_be
- read_bits_int_le
Nim - nim
- readBitsIntBe
- readBitsIntLe
Perl - perl
- read_bits_int_be
- read_bits_int_le
PHP - php
- readBitsIntBe
- readBitsIntLe
Python - python
- read_bits_int_be
- read_bits_int_le
Ruby - ruby
- read_bits_int_be
- read_bits_int_le

Project structure

gen.ipynb (Jupyter Notebook) - 1. Generating tests
test.sh - run generated tests in the specified language (Bash script) - 2. Running tests

How it works

1. Generating tests (`gen.ipynb`)

Bit layouts

Configuration:

Name	Description	Recommended value
`MAX_BITS`	Maximum number of bits for a single bit integer (i.e. max value of the argument to the `read_bits_int{be,le}` method).	`64` for most languages, `32` for JavaScript
`TOTAL_BITS`	Total size in bits of the entire layout. Must be divisible by 8 (i.e. 8, 16, 24, 32, 40, etc.)	`MAX_BITS + 8` (works well with `FIELDS = 3`)
`FIELDS`	Number of fields in the layout.	`3`

Function get_bit_layouts will enumerate all bit layouts matching the parameters above. However, there will usually be too many of these, much more than is needed to reliably reveal bugs in the bits int reading functions.

If we make an assumption that all spontaneous errors in the implementations only affect readings of ints with bit width close to the limit (mainly MAX_BITS, less often MAX_BITS - 1), we can substantially reduce the number of bit layouts by allowing only those that contain fields with potentially problematic sizes:

bit_layouts_all = get_bit_layouts(FIELDS, MAX_BITS, TOTAL_BITS)
bit_layouts = list(filter(lambda bl: (MAX_BITS in bl) or (MAX_BITS - 1 in bl), bit_layouts_all))

Using the recommended configuration, this will select 57 layouts from a total of 2593. These 57 layouts will generate 2688 tests¹. If this is too much, exclude layouts only containing the MAX_BITS - 1 field:

bit_layouts = list(filter(lambda bl: (MAX_BITS in bl), bit_layouts_all))

2688 tests (for both MAX_BITS and MAX_BITS - 1 fields) doesn't sound like much, but in compiled languages (C++/STL, Nim):

every test is compiled to a large executable or object file (C++/STL: ~0.6 MiB ".o" object file, Nim: ~0.2 MiB ".exe" executable on Windows), so thousands of tests require lots of disk memory (e.g. in C++/STL, so 2688 object files alone need 1.6 GiB),
take tens of minutes to compile - you'll notice if you wait 40 minutes (for 2688 tests) or 20 minutes (for 1248 tests) in C++/STL per endianness.

Excluding the MAX_BITS - 1 field reduces the number to 1248 tests (27 layouts), which keep these problematic factors within reasonable limits: building tests in C++/STL takes 18 minutes for a single bit endianness (double for both endians); object files are 730 MiB and the resulting executable ks_tests takes 166 MiB, so it fits into 1 GiB nicely.

However, it is important to eventually test bit layouts with the MAX_BITS - 1 field as well, because they can also detect bugs (especially in languages where signed bit shifts are used).

Fill patterns

A bit layout only says how to split TOTAL_BITS into several summands (number of summands is FIELDS); now we have to fill the fields with actual values. Some bugs may only occur when using a certain combination of fillings. Since it is not obvious what combinations will trigger bugs, all combinations of predefined fill patterns are generated. Default fillings look like this:

{0: [''],
 1: ['0', '1'],
 2: ['00', '01', '10', '11'],
 3: ['000', '011', '100', '111'],
 4: ['0000', '0111', '1000', '1111'],
 5: ['00000', '01111', '10000', '11111'],
 6: ['000000', '011111', '100000', '111111'],
 # ...
}

For illustration, there are 16 (= 4 * 4) filling combinations for the bit layout (5, 3):

Filling combinations for layout (5, 3)

[((0, '00000'), (0, '000')),
 ((0, '00000'), (1, '011')),
 ((0, '00000'), (2, '100')),
 ((0, '00000'), (3, '111')),

 ((1, '01111'), (0, '000')),
 ((1, '01111'), (1, '011')),
 ((1, '01111'), (2, '100')),
 ((1, '01111'), (3, '111')),

 ((2, '10000'), (0, '000')),
 ((2, '10000'), (1, '011')),
 ((2, '10000'), (2, '100')),
 ((2, '10000'), (3, '111')),

 ((3, '11111'), (0, '000')),
 ((3, '11111'), (1, '011')),
 ((3, '11111'), (2, '100')),
 ((3, '11111'), (3, '111'))]

Test generation

Please configure the target directory:

target_dir = Path("""R:\Temp""")

R:\Temp is just what works for me when using ImDisk Toolkit on Windows to create a 1 GiB ramdisk. You can set it to any existing directory.

The generated directory structure will look like this:

{target_dir}/
 ├── ksy/
 │   ├── be/
 │   |   ├── b0b8b64.ksy
 │   |   ├── b0b64b8.ksy
 │   |   ├── b1b7b64.ksy
 │   |   ├── ⋮
 │   |   └── b64b8b0.ksy
 │   └── le/
 │       ├── b0b8b64.ksy
 │       ├── ⋮
 │       └── b64b8b0.ksy
 ├── kst/
 │   ├── be/
 │   │   ├── b0b8b64_v0x0x0.kst
 │   │   ├── b0b8b64_v0x0x1.kst
 │   │   ├── ⋮
 │   │   ├── b0b8b64_v0x3x3.kst
 │   │
 │   │   ├── b0b64b8_v0x0x0.kst
 │   │   ├── ⋮
 │   │   ├── b0b64b8_v0x3x3.kst
 │   │
 │   │   ├── b1b7b64_v0x0x0.kst
 │   │   ├── ⋮
 │   │   ├── b1b7b64_v1x3x3.kst
 │   │
 │   │   ├── ⋮
 │   │   └── b64b8b0_v3x3x0.kst
 │   └── le/
 │       ├── b0b8b64_v0x0x0.kst
 │       ├── ⋮
 │       └── b64b8b0_v3x3x0.kst
 └── src/
     ├── 00_00_00_00_00_00_00_00_00.bin
     ├── 00_00_00_00_00_00_00_00_01.bin
     ├── 00_00_00_00_00_00_00_00_0f.bin
     ├── ⋮
     ├── ff_ff_ff_ff_ff_ff_ff_ff_fe.bin
     └── ff_ff_ff_ff_ff_ff_ff_ff_ff.bin

Directory ksy/ contains .ksy specifications of bit layouts (there are subfolders be/ and le/ for each meta/bit-endian value). For example, here are the contents of ksy/le/b0b8b64.ksy:

meta:
  id: b0b8b64
  bit-endian: le
seq:
  - id: a
    type: b0
  - id: b
    type: b8
  - id: c
    type: b64

Directory kst/ is intended for KST test specs that KST translator translates into unit test modules for a particular target language. A typical .kst spec looks like this (kst/be/b1b7b64_v1x2x1.kst):

id: b1b7b64
data: c0_7f_ff_ff_ff_ff_ff_ff_ff.bin
asserts:
  - actual: a
    expected: true
  - actual: b
    expected: 0x40
  - actual: c
    expected: 0x7fff_ffff_ffff_ffff

The data key specifies what binary file from the src/ directory to use. Names of src/ files exactly reflect their contents².

2. Running tests (`test.sh`)

First, configure the base directories (see the top of test.sh):

o_base=/r/Temp
l_base=/c/temp/kaitai_struct/tests

o_base is the output base directory, it must be in sync with what you set in fuzz.ipynb (section Test generation).
l_base is the location of the tests directory in the full https://github.com/kaitai-io/kaitai_struct repository checkout. In the tests repository, the branch generalmimon:ks-bits-fuzzer must be checked out.

Usage:

./test.sh <language>

Valid <language> identifiers are listed at the top of this README - expand the "list of runtime libraries" section.

The test.sh script compiles and runs tests generated previously by gen.ipynb in the specified language. This is done for each bit endianness individually. It happens in these steps:

Set up directory symlinks in l_base directory (kaitai_struct_tests repo) pointing to subdirectories of o_base.³

For target languages where testing depends on project configuration files or bootstrap files: if $l_base/spec/<language> is a regular directory (not a symlink), recursively copy any files other than test specs from here to $o_base/spec/<language> before replacing the directory with a symlink.
Use kaitai-struct-compiler to translate .ksy files to parsing modules into the $o_base/compiled/<language>/ directory.
Use KST translator to translate .kst specs ($o_base/kst/{be,le}/*.kst) to unit test modules, stored in $o_base/spec/<language>/.
Launch the ./ci-<language> script, which takes the source code of generated parsers, unit tests and the runtime library and finally builds (if source code needs to be compiled in the particular language) and runs the tests.

The auxiliary project files mentioned in step 0 sometimes require manual intervention. On the first run, everything should work, provided that the branch generalmimon:ks-bits-fuzzer is checked out in the $l_base/tests repository. However, if you have already run the tests and then deleted the directory with test specs, the project files are deleted too. At this point, if you want to run the tests again, you have to go to /tests directory and discard changes to spec/<language> folder using Git so that the project files are restored and can be used again.

BTW, this can be easily calculated. In these 57 layouts, 12 layouts have a 1-bit field (2 possible fillings: '0' and '1') and 12 layouts contain a 0-bit field (only 1 possible filling: '') :
```
>>> (57 - 12 - 12) * (4**3) + 12 * (2 * 4**2) + 12 * (1 * 4**2)
2688
```
↩
If KaitaiStream objects were directly created from byte arrays, the entire src/ folder could be eliminated (and the possible inefficiency of having to open and read from actual files too), but KST translator currently only supports stream instantiation from a file. It could be modified to support instantiation from literal byte arrays, but I decided to take the path of least resistance and reuse the existing infrastructure as much as possible with minimal modifications.

In some languages, I encountered "too many open files" errors when exceeding a certain number of tests, but these can and should be fixed by ensuring that files are closed as soon as they're no longer needed (this has been already done for Lua: https://github.com/generalmimon/kaitai_struct_tests/commit/eb49372d96c449328da730a0c130b7f06340beb4). Other than that, not even a hundred thousand files (required by ~125k tests) were a problem on a ramdisk. ↩
This is to allow having the output o_base directory on a ramdisk, while the main kaitai_struct_tests repository is on a persistent hard drive. Symlinks satisfy almost all test scripts and project files that rely on hardcoded paths and avoid most trouble related to changing the paths, having to refactor them to variables/arguments, etc. ↩

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
gen.ipynb		gen.ipynb
test.sh		test.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

gen.ipynb

gen.ipynb

test.sh

test.sh

Repository files navigation

Kaitai Struct bits int fuzzer

Project structure

How it works

1. Generating tests (`gen.ipynb`)

Bit layouts

Fill patterns

Test generation

2. Running tests (`test.sh`)

About

Releases

Packages

Languages

License

generalmimon/ks-bits-fuzzer

Folders and files

Latest commit

History

Repository files navigation

Kaitai Struct bits int fuzzer

Project structure

How it works

1. Generating tests (gen.ipynb)

Bit layouts

Fill patterns

Test generation

2. Running tests (test.sh)

Footnotes

About

Topics

Resources

License

Stars

Watchers

Forks

Languages

1. Generating tests (`gen.ipynb`)

2. Running tests (`test.sh`)