Skip to content
jordansissel edited this page May 15, 2011 · 2 revisions

RPM Internals

After many hours of reading librpm's code (which is pretty horrible, by the way) and then reading http://www.rpm.org/max-rpm/s1-rpm-file-format-rpm-file-format.html, I think I figured out how rpms are structured.

There are 3 main sections of the rpm. The lead, the header(s), and the payload.

The lead

This section of the file isn't really used much anymore, though it is required to exist. Much of the data in the lead is superseded by the headers (next section)

The lead is 96 bytes, starts at offset 0 of the file, and goes like this:

  • 4 bytes, the magic. hex values ED AB EE DB
  • 1 byte, number, the 'major' version of this rpm file
  • 1 byte, number, the 'minor' version of this rpm file
  • 2 bytes, short, the 'type' of this rpm (usually 0 == source rpm, 1 == binary rpm)
  • 2 bytes, short, architecture number. This isn't really used anymore. No idea what it means.
  • 66 bytes, string, the package name. Any unused bytes are nulls.
  • 2 bytes, short, "os number' - though I don't know what it really means. 0x01 probably means linux, though this is superseded by headers below.
  • 2 bytes, short, signature type. This is always 5 (0x05) if there is a signature (I haven't found an rpm without this section yet)
  • 16 bytes, unused reserved data for future.

Headers

After the lead starts the header. The header itself has three parts, the header-header sigh, "tags", and "data".

I have seen RPMs with two header sections; why two? I think the first header is sometimes dedicated to signing/integrity checks (md5, dsa, sha1, etc)? Not sure... Anyway, you should expect two header sections, that means two sections that start with the 'magic' below.

A header section starts with a 16 byte ... header ...:

  • begin with the 8-byte header magic value: 8D AD E8 01 00 00 00 00
  • 4 byte 'tag count'
  • 4 byte 'data length'

After the magic, is the 'tags' sequence. Each 'tag' is 4 4-byte values (16 bytes total). The full length (in bytes) of the tags sequence is the 'tag count' * 16

  • 4 bytes, integer, tag id
  • 4 bytes, integer, data type
  • 4 bytes, integer, offset in data
  • 4 bytes, integer, count (instance count of 'data type')

The 'data type' can be 1 (char), 2 (int8), 3 (int16), 4 (int32), 5 (int64), 6 (string, null terminated), 7 (binary), 8 (string array), 9 (i18n string)

Count is how many of the data type is in this tag. A datatype of int32 and count of 8 means there are 8 int32s for this tag. For string array, it's how many strings to check for. Strings are null-terminated and otherwise you are given no hints about the length of the string.

The offset is an offset into the data segment of this header (not of the file)

Payload

The payload is just that, the rest of the data. The headers contain two tags that describe the data format; tag id 1124 (payload format) can be "cpio", "tar", etc. Another tag id 1125 (payload compressor) can be gzip, xz, etc.