Trade-offs and Pitfalls in zkVM design (or, some ways to make your zkVM code 3-10 times slower)

In the past few months, I have done some work writing optimized code for the Succinct SP1 zkVM. Tuning zkVM code is a remarkably fruitful area for finding ways to speed or slow down your code by large factors. Here, I wil list a few issues I investigated, all of which I think represent interesting performance issues. From looking at the Jolt/RISC-0 codebases, it seems like other generic, RISC-V based zkVMs also deal with these issues. Anyway, here is a listicle:

1. Input serialization/deserialization

A core part of a zkVM (and zk-SNARKs as a whole) is being able to provide public/private inputs to the circuit. zkVMs typically represent this I/O within the abstractions used by the zkVM. This means that there are “syscalls” for reading/writing data from outside of the zkVM into the zkVM.

All zkVMs that let you write Rust (RISC-0, SP1, Jolt) that I checked let you pass arbitrary (serializable) structs from regular Rust code running on a CPU into the zkVM. To achieve this, they serialize the struct using a library like serde, pass the serialized bytes into the zkVM, and then have the struct be deserialized inside of the zkVM. There is a problem though, running deserialization inside the zkVM can be pretty expensive.

For the use case I was working on, we were passing a few hundred kilobytes of data into SP1, and deserializing the data took twice as long as the actual function we wanted to compute! The SP1 developers realized this was a potential performance bottleneck. If you are just passing bytes into the zkVM, you can pass those in directly without going through a (de)serialization step. By using this specialized function, my SP1 code became about 3x faster! (you call a function called read_vec rather than a generic read function)

There is not much of a trade-off here, but a combination of 2 issues: 1 is that this pitfall was not well documented and I had to read the source code myself. The second issue is that Rust doesn’t let you have specializations of generics the way that C++ does. With specialized generics, the SP1 source library could tell that you’re passing in raw bytes, and know that it shouldn’t do an unnecessary (de)serialization step. This type of specialization exists in Rust nightly but not in stable Rust. I wrote a pull request to update the SP1 documentation with this information.

2. Floating-Point Operations

At one point, I tried to write some SP1 code which heavily used floating point operations because we wanted to try a floating point-heavy algorithm. My zkVM code went from taking a few minutes to not being able to terminate. It was taking ~40 times longer (estimated). The reason for this is pretty simple. No zkVM that I could find implements floating point instructions natively, since most RISC-V zkVMs implement RV32I, which does not have special instructions for floating point numbers.

Thus, when compiling my code that used floating point operations into code for RISC-V, the compiler had to emulate floating point numbers using integer instructions. This meant that each floating point operation was transformed into ~100 integer instructions to handle the complexities of floating point. This resulted in a much, much slower prover. For our specific use case, we were able to use fixed-point numbers. Making this replacement resulted in a ~7x speedup just by replacing floating-point numbers with 64-bit fixed point ones (and a small precision loss).

The trade-off here is pretty obvious: circuits for floating point operations in zk are expensive, even with SOTA hand-optimized circuits (see here). It seems like most applications of zkVM’s have no need for floating point operations, so the smaller circuit sizes and less complexity are worth more than the feature. However, if a zkVM wants to be fully generic, or wants to be useful for floating point-heavy workloads like computer graphics or machine learning, then it should be extended with custom circuits for floating point operations. Until then, this is a performance pitfall that may be surprising to users who haven’t thought about instruction sets.

3. Generating commitments to outputs

At another point, we experimented with returning multiple kilobytes of output from our code. This also seemed to gum up our prover logic and use a ton of zkVM cycles.

After looking into the code, we learned that SP1 computes the SHA-256 hash of all the data it outputs. This is needed because “wrapper” circuits for recursively verifying the zkVM’s proofs are hardcoded to publicly output 32 bytes, so all output from the zkVM needs to be hashed into one output hash. One use case that SP1 is specializing for is creating proofs that can be verified in Ethereum contracts, where SHA256 is particularly efficient (from precompiles), so they use SHA-256 for hashing the output.

The tradeoff here is that proofs can be efficiently verified on-chain and that the “wrapper” circuits are much simpler, while outputting data from the circuit has a high in-circuit cost, and this cost is paid by all use cases that are unrelated to Ethereum.

4. Access to randomness

Commit-and-prove techniques are a really powerful tool in zero-knowledge proofs. They give circuits in zero-knowledge proofs the ability to access randomness in their computations. The high level idea is that pseudorandom values are generated based on a commitment to the prover’s private inputs to the circuit. Lots of super-efficient proof techniques are possible with access to randomness from commit-and-prove, like really fast RAM constructions or verifying matrix multiplications.

Unfortunately, the SP1 zkVM does not let you access pseudorandom values inside of the zkVM, even though their memory-checking uses randomness. They actually override the Rust default RNG to make it deterministic. It seems like Jolt doesn’t offer an in-circuit RNG while RISC-0 offers one for precompiles.

I think there is an opportunity to really improve the things you can do in zkVMs by overriding the default Rust RNG with a commit-and-prove based RNG, so you can do various fast randomized tricks that wouldn’t otherwise be possible. On one hand, this is a bit of a power user feature, so it’s probably not worth prioritizing too much. You might make an argument like Vitaliks “glue and coprocessor model” and say that it’s fine that zkVM code is inefficient, since this will just implement “glue code”. However, this commit-and-prove randomness is not even accessible in precompiles for the SP1 zkVM, so if I wanted to do “power-user” optimizations based on random sampling, I would not be able to do them in any way while working with SP1.

Conclusion

To conclude, I think issues 2 and 3 represent genuinely interesting tradeoffs in designing a zkVM, while 1 is more of an annoying pitfall and 4 is a potential future feature. One way to deal with the tradeoffs associated with these issues is to make a super-configurable zkVM where you can decide which instruction set extensions to use, decide whether you want the proofs to be optimized for on-chain verification, and make decisions about any other tradeoffs that may come up.

On the other hand, I remember my last job had a team of people whose job was basically deciding which clang (C++ compiler) flags the company’s code would use. I think most large companies that heavily use C++ have such a team. This is a remarkably deep and complicated area, and a lot of very senior engineers worked there. Some of my most hair-pulling moments involved pushing code that seemed to work only for me to cause bugs in production because one obscure version of the compiler flags didn’t play nice with what I wrote. Traditional compilers have gone in the direction of maximum configurability to deal with all of the potential performance quirks of using them, but this has resulted in a ton of complexity in configuring and using compilers optimally. I hope zkVM’s can come up with more user friendly ways of dealing with configuration involving performance tradeoffs.