Root Cause

Exploring the Security of the GGML RPC Server

Updated: July 28th, 2025
Chris Rohlf

If you're not familiar with GGML it is the tensor processing library that powers inference in llama.cpp and other open source projects. A tensor is a multi-dimensional array of numerical values that generalizes scalars (0 dimensions), vectors (1 dimension), and matrices (2+ dimensions) to an arbitrary number dimensions (the GGML library supports up to 4). Some tensors are stored on disk with the model weights (in GGUF format) and others are created on the fly as a result of tokenization and inference process. GGML stores all of these tensors in a Directed Acyclic Graph (DAG) that it compiles statically. Most nodes in the graph contain a GGML operation, for example GGML_OP_MUL_MAT for 'matrix multiplication'. These operations are executed across the graph using various backends (CPU, Apple Metal, CUDA etc) that it supports. Some nodes on this graph might use the CPU backend while others are intended for a CUDA or Metal backend. The built-in GGML RPC server allows you to distribute this work across multiple backends. All of this is configurable using llama-cli or other llama.cpp frontends.

The GGML maintainers are aware of, and have documented, the lack of security on the RPC server. At the time of this writing security advisories are no longer created for vulnerabilities in the server until there is higher confidence in the security of the implementation. As a side project I thought it might be fun to explore the (in)security of the server and mitigate the issues I find. Most of this research was assisted by OpenAI's o4-mini model which helped significantly speed up the analysis time but missed even obvious security issues.

I am not the first person to look at this code from a security perspective. Ruikai "Patrick" Peng published a great writeup on his work exploiting a remotely reachable heap overflow in the implementation. There are other advisories affecting the RPC server as well.

The code doesn't use a standard RPC library like gRPC or Thrift, it instead uses packed C structures with the following naming convention rpc_msg_get_cmd_name_req and rpc_msg_get_alloc_cmd_name_rsp. The majority of these message types are simple integers fields with the exception of rpc_tensor which has a slightly more complex structure. The protocol itself is rather simple on the wire, the first byte sent is always the command type, followed by the message size, and then the message body. For fixed size commands there is validation of the size field matching the size of the message type. For variable sized messages the size is trusted implicitly and usually used to resize a std::vector which will hold the data. RPC responses are very similar, a size followed by the response data if any. Overall the serializing and deserializing of messages is simple, and so the implementation is mostly sane. It is more likely there are security vulnerabilities deeper in the backend code which are reachable via RPC. It is the design of the RPC server that is most insecure as it lacks authentication, sandboxing and other basic security controls.

After reading the RPC server code I wondered if anyone was fuzzing the implementation. There are llama.cpp fuzzers in the OSSFuzz repository but none of them target the GGML RPC layer. After carefully reading the rpc_serve_client function I realized that fuzzing this loop locally should be trivial. Fuzzing a server that is designed to take inputs from a network protocol can be tricky depending on the design. The GGML RPC main server loop has a convienent function prototype that takes a socket descriptor and enters a while loop that calls send and recv on that socket. This means with some minor modifications we should be able to call socketpair to generate a socket file descriptor for both ends of the connection, write to it, and then pass the server end to rpc_serve_client where RPC commands will be processed.

The simplified fuzzing loop looks something like this (shortened for brevity and commented):

rpc_server_params params;
ggml_backend_t backend;
ggml_backend_reg_t reg;
void (*start_server_fn)(ggml_backend_t backend, const char *cache_dir, sockfd_t sockfd, size_t free_mem, size_t total_mem);

extern "C" int LLVMFuzzerInitialize(int *argc, char ***argv) {
    // Load all GGML backends and create one for RPC
    ggml_backend_load_all();
    backend = create_backend(params);
    reg = ggml_backend_reg_by_name("RPC");

    // Get a function pointer to the main server loop
    // We create this stub function separately in ggml-rpc.cpp
    start_server_fn = (decltype(fuzz_rpc_serve_client)*) 
        ggml_backend_reg_get_proc_address(reg, "fuzz_rpc_serve_client");

    return 0;
}

extern "C" int LLVMFuzzerTestOneInput(const uint8_t *Data, size_t Size) {
    int sv[2];

    // Create the local socket pair using AF_UNIX
    if (socketpair(AF_UNIX, SOCK_STREAM, 0, sv) != 0) {
        return 0;
    }

    // Ensure the first command is HELLO command per the server implementation
    uint64_t payload_len = 0;
    const int RPC_CMD_HELLO = 14;

    // Write the HELLO command and the fuzzed message according to the spec:
    // RPC request : | rpc_cmd (1 byte) | request_size (8 bytes) | request_data (request_size bytes) |
    // RPC response: | response_size (8 bytes) | response_data (response_size bytes) |
    write(sv[1], &RPC_CMD_HELLO, 1);
    write(sv[1], &payload_len, sizeof(payload_len));
    write(sv[1], Data, Size);

    // Call shutdown on the socket so that the last recv() sees EOF and doesn't block
    shutdown(sv[1], SHUT_WR);

    // Invoke the server loop and pass it one of the socket descriptors
    try {
        start_server_fn(backend, nullptr, sv[0], 0, 0);
    } catch (...) {
    }

    // Close the socket pair
    close(sv[0]);
    close(sv[1]);

    return 0;
}

Once the rpc_serve_client function enters the loop it will call recv and process the fuzzer produced data as RPC commands until it reaches EOF. This all happens within the same process which allows libfuzzer to trace the code branches and modify the inputs ensuring new code paths are discovered. The full commit including the fuzzer, and the CMake changes required to compile it can be found here.

For convenience I added it to the rpc-server.cpp tool file that ships with llama.cpp but it is only conditionally compiled when -DGGML_SANITIZE_FUZZER=ON is passed to CMake. Compiling the fuzzer is simple if your clang toolchain has libfuzzer support. On MacOS you can install a version of LLVM/clang that has support using homebrew, and then run the following commands in the llama.cpp directory:

  $ cmake -B build/ -DCMAKE_BUILD_TYPE=Debug -DLLAMA_SANITIZE_ADDRESS=ON  \
    -DGGML_SANITIZE_FUZZER=ON -DLLAMA_RPC=1 -DGGML_METAL=OFF              \
    -DCMAKE_C_COMPILER=/opt/homebrew/Cellar/llvm/20.1.8/bin/clang         \
    -DCMAKE_CXX_COMPILER=/opt/homebrew/Cellar/llvm/20.1.8/bin/clang++

$ make -C build/ -j8

$ build/bin/rpc-server -device cpu

$ build/bin/llama-cli -m models/Meta-Llama-3.1-8B-Instruct-Q6_K.gguf --rpc localhost:50052 -n 64 -ngl 99

You can modify the fuzzers rpc_server_params to enable or disable different GGML backends as needed as long as you have the underlying hardware. I have Metal disabled in the command above because fuzzing with it enabled resulted in my Mac locking up and rebooting several times.

The problem with this fuzzer is that it will immediately crash on a number of different unexploitable issues. The first set of them I encountered were NULL pointer dereferences stemming from calling RPC functions out of order and generally not tracking client state. Some commands, such as RPC_CMD_ALLOC_BUFFER, are used to allocate a backend buffer for later use in tensor deserialization. Those buffers are passed back to the client in a response and are intended to be included on future requests such as those that include an rpc_tensor field. If you don't include the correct buffer identifier then some commands such as RPC_CMD_SET_TENSOR, RPC_CMD_SET_TENSOR_HASH, RPC_CMD_GET_TENSOR, and RPC_CMD_COPY_TENSOR will crash on a NULL pointer dereference. I put up a pull request to fix these here.

You may be wondering how these buffers are identified across client and server connections. Unfortunately the answer is through operations such as this call to alloc_buffer via the RPC_CMD_ALLOC_BUFFER command:

void rpc_server::alloc_buffer(const rpc_msg_alloc_buffer_req & request, rpc_msg_alloc_buffer_rsp & response) {
  ...
  response.remote_ptr = reinterpret_cast(buffer);
  ...
}

And this call to buffer_get_base via the RPC_CMD_BUFFER_GET_BASE command:

bool rpc_server::buffer_get_base(const rpc_msg_buffer_get_base_req & request, rpc_msg_buffer_get_base_rsp & response) {
  ...
  ggml_backend_buffer_t buffer = reinterpret_cast(request.remote_ptr);

    if (buffers.find(buffer) == buffers.end()) {
        GGML_LOG_ERROR("[%s] buffer not found\n", __func__);
        return false;
    }
    void * base = ggml_backend_buffer_get_base(buffer);
  ...
}

Note the reinterpret_cast operator which casts the raw buffer pointer to a uint64_t so it can be sent to the client in the first call, and how the untrusted value sent by the client in subsequent calls is cast back to a pointer and dereferenced. Disclosing a memory address like this could be used to defeat Address Space Layout Randomization (ASLR), an important exploit mitigation, when exploiting other vulnerabilities in the code. After spotting these suspicious casts I thought I might see if o4-mini could see them. It unfortunately failed to identify them as memory address disclosures.

I fixed these issues in this pull request. The fix swaps out the existing unordered_set for storing buffers in the rpc_server class for an unordered_map<uint64_t, ggml_backend_buffer_t>. This new data structure will store an opaque random ID as a handle the client can send in future commands that the server will use to lookup the corresponding backend buffer. This should be performant enough as both insertion and search in unordered_map are O(1) in the average case. Unfortunately there are many other ASLR leaks in the implementation still to fix around rpc_tensor handling in the data and view_src fields.

The second set of issues I ran into while fuzzing are the out-of-memory errors. As mentioned earlier, some variable sized command message types, such as RPC_CMD_INIT_TENSOR call a specific version of the recv_msg function designed to handle variable sized payloads. The function, shown below, reads an untrusted uint64_t directly from the wire and uses it to resize the input vector.

static bool recv_msg(sockfd_t sockfd, std::vector & input) {
    uint64_t size;
    if (!recv_data(sockfd, &size, sizeof(size))) {
        return false;
    }
    try {
        input.resize(size);
    } catch (const std::bad_alloc & e) {
        fprintf(stderr, "Failed to allocate input buffer of size %" PRIu64 "\n", size);
        return false;
    }
    return recv_data(sockfd, input.data(), size);
}

Passing any size near or above INT32_MAX is likely to cause the vector's constructor to throw an exception. Some of these are easily patched by hardcoding an upper bound on the size. These constraints likely won't work in production but should enable the fuzzer to reach deeper code paths.

There is a lot more work to be done to secure the GGML RPC implementation such as adding basic authentication, fixing ASLR leaks, hardening the RPC state machine to handle more complex message types and more. I agree with the maintainers that this server should not be exposed on any public interface until substantial work is done to improve the code and harden its design first.