A Short Guide to the GGUF Format

As my recent FOSDEM talk suggests, I find the GGML ecosystem an effective way to introduce system programmers such as myself to the fancy new fashion of AI – meaning transformers and their uses.

For my FOSDEM demo, which ported GGML to my kernel library, I wrote a crude GPT-2 implementation based on GGML's official example.

This demo led me to focus on how real models — like those running on llama.cpp — are actually distributed. The answer is GGUF, GGML's attempt at creating a universal format for distributing models.

I spent a weekend hacking away at GGUF, and here’s what I figured out.

GGUF as a universal format

GGUF is the latest format understood by llama.cpp to load and run models.

Its history points to an organic evolution — GGML, GGMF, GGJT, and now GGUF — and it is now at version three of the format.

The first thing to notice about GGUF is that it is meant to be mmap'd into memory and thus data on disk appear in the same order as they do in memory.

What about endianness? Well, it is implicit. Version three, the current version of the GGUF format, lets data be stored in big-endian, but there is no flag whatsoever to signal this.

Interesting choice, but it does make sense, little-endian is expected in 2025: in my recent experience, big-endian machines are either embedded routers or hypotheses.

Having cleared the encoding on disk, let's have a look at what a GGUF file is.

GGUF: An overview

In broad strokes, a GGUF file is composed of:

A fixed-size header
A key-value store
A list of typed, named tensors

GGUF structure

Let's go down section by section and see how they're actually represented on disk.

The GGUF header

The GGUF header is a fixed-size data structure, and there are no surprises there.

struct gguf_header_t {
     char     magic[4];
     uint32_t version;
     uint64_t tensor_count;
     uint64_t metadata_kv_count;
};

The first four bytes contain the ASCII characters ‘G’, ‘G’, ‘U’, ‘F’ to identify the file as GGUF.

A 32-bit unsigned integer follows to indicate the version. Currently, the latest version is 3.

Next, there are two important fields:

tensor_count: how many tensors this model includes.
metadata_kv_count: how many key-value elements the metadata has.

How to find tensors and metadata is the core of GGUF parsing, and will be the main topic of this post.

Let's start with the metadata, because they are placed right after the header.

The Key-Value Store (Metadata)

Following the header is a sequence of key-value data. The format is:

struct {
    struct gguf_string  key;
    uint32_t value_type;
    /* Value appended here. */
};

The metadata is named by the string key, and has type value_type.

A GGUF string has this format:

struct gguf_string {
    uint64_t len;
    char str[0];
}

Where len is the number of bytes composing the string, and str is a non-NULL terminated string that follows the len field.

The type is stored in a 32-bit unsigned integer, which currently can be one of the following:

enum gguf_metadata_value_type {
    GGUF_MVT_UINT8 = 0,
    GGUF_MVT_INT8 = 1,
    GGUF_MVT_UINT16 = 2,
    GGUF_MVT_INT16 = 3,
    GGUF_MVT_UINT32 = 4,
    GGUF_MVT_INT32 = 5,
    GGUF_MVT_FLOAT32 = 6,
    GGUF_MVT_BOOL = 7,
    GGUF_MVT_STRING = 8,
    GGUF_MVT_ARRAY = 9,
    GGUF_MVT_UINT64 = 10,
    GGUF_MVT_INT64 = 11,
    GGUF_MVT_FLOAT64 = 12,
};

Most of the fields should be self-descriptive, but there are two types that need a bit more explanation: bool and array.

bool is a one-byte value, where zero is false.

array is what makes metadata parsing complicated.

When an element is described as an array, the following structure is appended:

struct gguf_array {
    uint32_t type;
    uint64_t len;
}
/* 'len' elements of type 'type' follow */

type is, once again, described by the enum gguf_metadata_value_type above.

What is complicated about this, you ask? Well, the type of an array can be GGUF_MVT_ARRAY, so you can have multi-dimensional arrays described in the metadata.

Powerful, but requires care during parsing.

The number of metadata elements in the file is specified by the GGUF header field metadata_kv_count.

After this, the tensor store begins.

The Tensor Store

Tensors are stored in two separate structures: the Tensor Info Array and the Tensor Data.

The Tensor Info Array starts right at the end of the metadata, and is a sequence of these fields:

struct gguf_string name: a GGUF string naming the tensor
uint32_t ndims: A 32-bit integer indicating the number of dimensions of the tensor:
uint64_t dims[ndims]: The size of each dimension follows as a sequence of ndims 64-bit integers.
uint32_t ggmltype: The type of data stored in the tensor. This is defined as enum ggml_type, and contains data type natively supported by GGML. The enum is too long to list here, but supports from common floating point data to quantization formats.
uint64_t offset: A 64-bit offset. This offset is counted from the start of the tensor data, which is the region following the tensor info array.

And here lies the real surprise of the format: the tensor data alignment.

The Tensor Data

The tensor data do not start immediately after the tensor info. Tensor data can be aligned to a specific boundary. This is important, I believe, because aligned data do speed up certain instructions – AVX for example – and in some cases it might even be required. This allows data to maintain alignment when the file is mmapd.

There's a caveat though. The value of the alignment is stored in the metadata. More specifically, the alignment value is a 32-bit integer under the key general.alignment. If such metadata is not present, the default alignment value is 32.

Tensor data starts at the next alignment boundary from the end of the tensor info array.

After this simple calculation, the rest is easy. The tensor info includes an offset, and this offset – which must be aligned – is added to the start of the tensor data, to retrieve the actual tensor.

We have now described how to scan the header, read the metadata, and retrieve the tensors. This is all that there is in a GGUF file.

Conclusions

The GGUF format was in a way a pleasant surprise. It's a simple, binary format that is almost self-explanatory.

There are things I would have done differently. Here are the things that left me a bit puzzled about it:

- Tensor data alignment value in metadata. I think that inserting a file's structural information inside a high-level data structure – general.alignment – rather than in a quickly accessible field is a suboptimal choice. But once again, I don't know how this file format evolved, and, as everything in software engineering, certain choices make sense only when seen through the lens of historical evolution.

- Everything is serial. In order to find the tensor info array, or the tensor data start, we have to scan everything before it. Including parsing the metadata.

The picture of the perfect GGUF variant in my head is something with this header format:

struct gguf_header_t {
     char     magic[4];
     uint32_t version;
     uint64_t tensor_offset;      /* NEW */
     uint64_t tensor_count;
     uint64_t metadata_kv_offset; /* NEW */
     uint64_t metadata_kv_count;
     uint64_t tensor_data_offset; /* NEW */
};

I.e., adding three offsets to the header, indicating the start of each section: tensor info, tensor data and metadata.

This would have allowed for arbitrary alignment to be implicitly supported without the need for special metadata keys. It would also allow any scanner to quickly find the section it is searching for.

Moreover, this would allow a more flexible layout regarding the ordering of the sections, without complicating the creation and saving of the file.

In any case, nothing is perfect in this world, and all in all I am pleasantly surprised by the simplicity and expandability of this format.