Updated: July 28th, 2025
Chris Rohlf
If you're not familiar with GGML it is the tensor processing library
that powers inference in llama.cpp and other open source projects.
A tensor is a multi-dimensional array of numerical values that generalizes scalars (0 dimensions), vectors (1 dimension),
and matrices (2+ dimensions) to an arbitrary number dimensions (the GGML library supports up to 4). Some tensors are stored
on disk with the model weights (in GGUF format) and others are created on the fly as a result of tokenization and inference
process. GGML stores all of these tensors in a Directed Acyclic Graph (DAG) that it compiles statically. Most nodes in the
graph contain a GGML operation, for example GGML_OP_MUL_MAT
for 'matrix multiplication'. These operations are executed
across the graph using various backends (CPU, Apple Metal, CUDA etc) that it supports. Some nodes on this graph might use the
CPU backend while others are intended for a CUDA or Metal backend. The built-in GGML RPC server allows you to distribute this
work across multiple backends. All of this is configurable using llama-cli or other llama.cpp frontends.
The GGML maintainers are aware of, and have documented, the
lack of security on the RPC server. At the time of this writing security advisories are no longer created for vulnerabilities
in the server until there is higher confidence in the security of the implementation. As a side project I thought it might
be fun to explore the (in)security of the server and mitigate the issues I find. Most of this research was assisted by
OpenAI's o4-mini model which helped significantly speed up the analysis time but missed even obvious security issues.
I am not the first person to look at this code from a security perspective. Ruikai "Patrick" Peng
published a great writeup on his work exploiting a remotely reachable
heap overflow in the implementation. There are other advisories
affecting the RPC server as well.
The code doesn't use a standard RPC library like gRPC or Thrift, it instead uses packed C structures with the following
naming convention rpc_msg_get_cmd_name_req
and rpc_msg_get_alloc_cmd_name_rsp
. The majority of these
message types are simple integers fields with the exception of rpc_tensor
which has a slightly more complex
structure. The protocol itself is rather simple on the wire, the first byte sent is always the command type, followed
by the message size, and then the message body. For fixed size commands there is validation of the size field matching
the size of the message type. For variable sized messages the size is trusted implicitly and usually used to resize
a std::vector
which will hold the data. RPC responses are very similar, a size followed by the response data if any.
Overall the serializing and deserializing of messages is simple, and so the implementation is mostly sane. It is more likely
there are security vulnerabilities deeper in the backend code which are reachable via RPC. It is the design
of the RPC server that is most insecure as it lacks authentication, sandboxing and other basic security controls.
After reading the RPC server code I wondered if anyone was fuzzing the implementation. There are llama.cpp
fuzzers in the OSSFuzz repository
but none of them target the GGML RPC layer. After carefully reading the
rpc_serve_client
function I realized that fuzzing this loop locally should be trivial. Fuzzing a
server that is designed to take inputs from a network protocol can be tricky depending on the design. The GGML RPC main
server loop has a convienent function prototype that takes a socket descriptor and enters a while
loop that
calls send
and recv
on that socket. This means with some minor modifications we should be able to
call socketpair
to generate a socket file descriptor for both ends of the connection, write to it, and then
pass the server end to rpc_serve_client
where RPC commands will be processed.
The simplified fuzzing loop looks something like this (shortened for brevity and commented):
rpc_server_params params; ggml_backend_t backend; ggml_backend_reg_t reg; void (*start_server_fn)(ggml_backend_t backend, const char *cache_dir, sockfd_t sockfd, size_t free_mem, size_t total_mem); extern "C" int LLVMFuzzerInitialize(int *argc, char ***argv) { // Load all GGML backends and create one for RPC ggml_backend_load_all(); backend = create_backend(params); reg = ggml_backend_reg_by_name("RPC"); // Get a function pointer to the main server loop // We create this stub function separately in ggml-rpc.cpp start_server_fn = (decltype(fuzz_rpc_serve_client)*) ggml_backend_reg_get_proc_address(reg, "fuzz_rpc_serve_client"); return 0; } extern "C" int LLVMFuzzerTestOneInput(const uint8_t *Data, size_t Size) { int sv[2]; // Create the local socket pair using AF_UNIX if (socketpair(AF_UNIX, SOCK_STREAM, 0, sv) != 0) { return 0; } // Ensure the first command is HELLO command per the server implementation uint64_t payload_len = 0; const int RPC_CMD_HELLO = 14; // Write the HELLO command and the fuzzed message according to the spec: // RPC request : | rpc_cmd (1 byte) | request_size (8 bytes) | request_data (request_size bytes) | // RPC response: | response_size (8 bytes) | response_data (response_size bytes) | write(sv[1], &RPC_CMD_HELLO, 1); write(sv[1], &payload_len, sizeof(payload_len)); write(sv[1], Data, Size); // Call shutdown on the socket so that the last recv() sees EOF and doesn't block shutdown(sv[1], SHUT_WR); // Invoke the server loop and pass it one of the socket descriptors try { start_server_fn(backend, nullptr, sv[0], 0, 0); } catch (...) { } // Close the socket pair close(sv[0]); close(sv[1]); return 0; }Once the
rpc_serve_client
function enters the loop it will call recv
and process the fuzzer produced data
as RPC commands until it reaches EOF. This all happens within the same process which allows libfuzzer to trace the code
branches and modify the inputs ensuring new code paths are discovered. The full commit including the fuzzer, and the
CMake changes required to compile it can be found
here.
rpc-server.cpp
tool file that ships with llama.cpp but it is only
conditionally compiled when -DGGML_SANITIZE_FUZZER=ON
is passed to CMake. Compiling the fuzzer is simple if
your clang toolchain has libfuzzer support. On MacOS you can install a version of LLVM/clang that has support
using homebrew, and then run the following commands in the llama.cpp directory:
$ cmake -B build/ -DCMAKE_BUILD_TYPE=Debug -DLLAMA_SANITIZE_ADDRESS=ON \ -DGGML_SANITIZE_FUZZER=ON -DLLAMA_RPC=1 -DGGML_METAL=OFF \ -DCMAKE_C_COMPILER=/opt/homebrew/Cellar/llvm/20.1.8/bin/clang \ -DCMAKE_CXX_COMPILER=/opt/homebrew/Cellar/llvm/20.1.8/bin/clang++ $ make -C build/ -j8 $ build/bin/rpc-server -device cpu $ build/bin/llama-cli -m models/Meta-Llama-3.1-8B-Instruct-Q6_K.gguf --rpc localhost:50052 -n 64 -ngl 99
rpc_server_params
to enable or disable different GGML backends as needed
as long as you have the underlying hardware. I have Metal disabled in the command above because fuzzing with it enabled
resulted in my Mac locking up and rebooting several times.
RPC_CMD_ALLOC_BUFFER
,
are used to allocate a backend buffer for later use in tensor deserialization. Those buffers are passed back
to the client in a response and are intended to be included on future requests such as those that include
an rpc_tensor
field. If you don't include the correct buffer identifier then some commands such as
RPC_CMD_SET_TENSOR, RPC_CMD_SET_TENSOR_HASH, RPC_CMD_GET_TENSOR,
and RPC_CMD_COPY_TENSOR
will
crash on a NULL pointer dereference. I put up a pull request to fix these
here.
alloc_buffer
via the RPC_CMD_ALLOC_BUFFER
command:
void rpc_server::alloc_buffer(const rpc_msg_alloc_buffer_req & request, rpc_msg_alloc_buffer_rsp & response) { ... response.remote_ptr = reinterpret_cast(buffer); ... }
buffer_get_base
via the RPC_CMD_BUFFER_GET_BASE
command:
bool rpc_server::buffer_get_base(const rpc_msg_buffer_get_base_req & request, rpc_msg_buffer_get_base_rsp & response) { ... ggml_backend_buffer_t buffer = reinterpret_cast(request.remote_ptr); if (buffers.find(buffer) == buffers.end()) { GGML_LOG_ERROR("[%s] buffer not found\n", __func__); return false; } void * base = ggml_backend_buffer_get_base(buffer); ... }
reinterpret_cast
operator which casts the raw buffer pointer to a uint64_t
so
it can be sent to the client in the first call, and how the untrusted value sent by the client in
subsequent calls is cast back to a pointer and dereferenced. Disclosing a memory address like this could be used
to defeat Address Space Layout Randomization (ASLR), an important exploit mitigation, when exploiting other vulnerabilities
in the code. After spotting these suspicious casts I thought I might
see if o4-mini could see them. It unfortunately failed to identify them as memory address disclosures.
unordered_set
for storing
buffers in the rpc_server
class for an unordered_map<uint64_t, ggml_backend_buffer_t>
. This new
data structure will store an opaque random ID as a handle the client can send in future commands that
the server will use to lookup the corresponding backend buffer. This should be performant enough as both insertion and search
in unordered_map
are O(1) in the average case. Unfortunately there are many other ASLR leaks in the
implementation still to fix around rpc_tensor
handling in the data
and view_src
fields.
RPC_CMD_INIT_TENSOR
call a specific version of the recv_msg
function designed to handle variable sized payloads. The function, shown below,
reads an untrusted uint64_t
directly from the wire and uses it to resize the input
vector.
static bool recv_msg(sockfd_t sockfd, std::vector& input) { uint64_t size; if (!recv_data(sockfd, &size, sizeof(size))) { return false; } try { input.resize(size); } catch (const std::bad_alloc & e) { fprintf(stderr, "Failed to allocate input buffer of size %" PRIu64 "\n", size); return false; } return recv_data(sockfd, input.data(), size); }
INT32_MAX
is likely to cause the vector's constructor to throw an
exception. Some of these are easily patched by hardcoding an upper bound on the size. These constraints
likely won't work in production but should enable the fuzzer to reach deeper code paths.