Profiling Rust verification

KLEE logo Automatic formal verification is usually pushing up against what Leino calls “the decidability ceiling”: pushing the tools beyond what they can be guaranteed to solve in some reasonable time, taking the risk that the tools will blow up, but often getting away with it. But what can we do when the toast lands butter-side-down? This is a summary of a quick investigation to find out if anybody had any answers. Tl;dr: they do.

I found four relevant sources of information:

Bornholt and Torlak, “Finding code that explodes under symbolic evaluation”, OOPSLA 2018
Cadar and Kapus, “Measuring the coverage achieved by symbolic execution”, Blog 2020
Galea et al., “Evaluating manual intervention to address the challenges of bug finding with KLEE”, arXiv 2018
KLEE website, “Tutorial on How to Use KLEE to Test GNU Coreutils (sections 6 and 7)”, 2018?

(Huge thanks to Sean Heelan for his really useful response to my Twitter request for suggestions – resulting in the last two sources and which form the basis of most of the rest of this post.)

These are great resources and, if you want to learn more about this topic, I strongly recommend them. They have one minor problem though, none of them are about verifying Rust code which this website is all about. To make things more Rusty, I decided to try to apply the KLEE profiling ideas to some code I had recently tried to verify and that was scaling horribly. The code called the Rust regex crate and, like a lot of string-processing code, it was behaving so badly that I could only work with trivial length strings and carefully constrained regex queries.

I created a trivial benchmark test harness that matches a sequence of digits and asserts that converting those digits to an integer is less than some limit. This version should fail because one possible result is 11: I also tested with a non-failing variant that asserts “i <= 11”.

#[test]
fn regex_should_fail() {
    // generate a symbolic string of length 2
    let a = verifier::verifier_nondet_ascii_string(2);

    // assume the string consists of 0's and 1's
    verifier::assume(Regex::new(r"[0-1]{2}").unwrap().is_match(&a));

    // convert to int
    let i: u32 = a.parse().unwrap();

    // check the maximum value
    verifier::assert!(i < 11);
}

I expected that this would perform badly and I was not disappointed. Running our cargo verify tool to compile the code for verification and then use KLEE to verify it spent distressingly long in the verifier for such a short string.¹

$ cargo verify --test=regex_should_fail -v
Using Klee as backend
Checking regex
  Building regex for verification
  Patching LLVM file for initializers and feature tests
  Checking regex_should_fail
Running 1 test(s)
test regex_should_fail ... ASSERT_FAILED
test result: ASSERT_FAILED. 0 passed; 1 failed
Build 47.085s
Verify 37.907s
Total 84.992s
VERIFICATION_RESULT: ASSERT_FAILED

The problem was probably in the Regex library, the question was whether I could pin down which lines of code were responsible?

The KLEE tutorial explains that KLEE generates a profile of the verification process that gives you various metrics about every line of code in your program (including any libraries that you linked against). And both Galea et al. and Bornholt and Torlak highlight the performance impact of path splitting on the cost of symbolic execution. So what I needed to do was run the kcachegrind tool on the profile file generated by KLEE.

Before I could do that though, there was one small problem to take care of. Looking inside the profile file, I could see that it was referring to functions by their “mangled” names like _ZN5regex17regex_should_fail17h11a2866ac324f18dE intead of human-readable names like regex::regex_should_fail. To fix this, I wrote a small demangling tool that could demangle the profile file. With this in place, I could demangle the profile and view it like this

$ rust2calltree kleeout/regex_should_fail/run.istats -o callgrind.out
$ kcachegrind

This produced a window that looked like this (I have selected the “forks” metric and clicked on the most expensive function that is not my test harness.)

Fork profile of regex_should_fail

This profile shows that the function that is most responsible for the path explosion in my benchmark is the function aho_corasick::ahocorasick::AhoCorasick::find.

This was a surprise to me – I was expecting that it would be complaining about part of the regex crate that my benchmark was calling. I did a little digging and I quickly found out that the aho-corasick crate was a fast string matching library used by the regex crate for simple parts of regexps. Since my regexp was quite simple, it seemed reasonable that this crate would be doing most of the work and therefore would be potentially part of the problem. (I didn’t confirm that this really was the problem. To do that, I would probably have to modify the Aho-Corasick crate to fix the problem – which I have not tried doing, yet.)

Summary

Verification tools are very likely to hit performance bottlenecks because they are designed to push against the decidability ceiling.

There has not been a lot of work on profiling to help you identify what part of your code is responsible for any performance problems you hit when trying to verify your code but the papers/tutorials that I found were very useful and are worth reading.

With the aid of a little script, I was able to use the instructions in the KLEE tutorial to profile a small benchmark and it produced results that seemed entirely plausible and that highlighted some code that I would probably not have looked at first.

The task of confirming that the code identified by profiling can be “fixed” is something that I leave for future work.

You will notice that compilation time is even worse than verification time. Part of this is due to the time fetching and compiling the regex library and part of it this is due to our tool not making good use of cargo’s incremental compilation. We plan to fix this – but it is not our priority at the moment. ⤾