Property-based testing #1: What is it anyway?
This is the first and introductory post in a series about property-based testing. This post explains what property-based testing is, and what a typical property-based test looks like. The rest of the series will dive deeper into how libraries for property-based testing are implemented - admittedly I'm biased as the maintainer of FsCheck but they are microcosms of interesting and widely applicable ideas.
Series:
Property-based Testing: What is it anyway? ← you are here
Property-based testing was introduced in 2000 by Koen Claessen and John Hughes via the Haskell library QuickCheck. It has gained popularity in a relatively short amount of time - these days almost all languages and platforms have some sort of property-based testing library available.
But what is property-based testing and more importantly what can you use it for?
In traditional unit testing, a test consists of an example input that is fed to the system under test, followed by an assertion that the output is what we expect. In property-based testing a test instead relates the input and the output of the system under test, by asserting that certain general properties hold on a ton of automatically generated inputs.
As an example, let's imagine a function sort_by_age
that takes as argument a list of Person
objects, which have various fields (like name
), and returns them sorted by age
. A unit test for our problem could be:
persons = [
Person("Max",12),
Person("Nemo",22),
Person("Eloise",15)
]
actual = sort_by_age(persons)
expected = [
Person("Max",12),
Person("Eloise",15),
Person("Nemo",22)
]
assert actual == expected
To instead write a property-based test for sort_by_age
, consider how to write a function is_valid(persons_in, persons_out) -> bool
which given the input and the output of sort_by_age
, returns whether that input-output pair is valid.
Go on, think about it for a moment. It's fine to come up with checks that seem naïve (I promise they won't be), and it's also fine to come up with checks that are insufficient, even pathologically so. We're just trying to find bugs here, we're not building a formal proof.
Chances are you've come up with a few things to check that at first blush are trivial. For example, is_valid
could check whether the two lists are of equal length - after all, sorting a list should not change its length:
def is_valid_length(persons_in, persons_out):
assert len(persons_in) == len(persons_out)
Another one could be checking that the list persons_out
is indeed sorted by age:
def is_valid_sorted(persons_in, persons_out):
assert all(
p[i].age <= p[i + 1].age
for i in range(len(persons_out)-1)
)
Lastly we could check that sorting Person
values doesn't change them - for brevity we'll just check that their name hasn't changed:
def is_valid_unchanged(persons_in, persons_out):
assert { p.name for p in persons_in } == { p.name for p in persons_out }
If you came up with one or more of these, then congratulations - you've just come up with a property and you're now a fully-fledged property-based tester!
Not convinced? Prepare to be! Given an is_valid
method a property-based test for sort_by_age
is easy to write:
for _ in range(ONE_GAZILLION):
persons_in = generate_persons()
persons_out = sort_by_age(persons_in)
is_valid_length(persons_in, persons_out)
is_valid_sorted(persons_in, persons_out)
is_valid_unchanged(persons_in, persons_out)
Compared to unit-testing - perhaps better characterized as example-based testing - where we come up with an example input and an expected output, here something (we'll come back to this soon) comes up with inputs, and we test that the input and output pair represent valid behavior of the system under test.
Since we no longer need to manually come up with example inputs, we can generate as many test cases as we want - typically 100s of them. For tests with a large combination of inputs to explore, we can even let the test run overnight and check in the morning whether it has found any bugs on millions or billions of inputs. Assuming that the generated inputs are sensible, this gives great confidence that the implementation is correct.
The power of this approach lies in combining multiple, individually simple properties. It's easy to come up with a trivially wrong implementation of sort_by_age
that passes one of the is_valid
properties - for example, to pass is_valid_unchanged
, sort_by_age
just needs to return the input list as-is. It's way harder to come up with an implementation that passes all three of them and that doesn't do exactly what we expect.
If you thought about writing is_valid
yourself a few paragraphs ago, you hopefully experienced that the thought process behind writing a property-based test is quite different from writing a unit test. Anecdotally, this is what many people struggle with when first exposed to property-based testing - "I have trouble coming up with properties" is an often heard complaint.
The rewards however are that we identified various aspects of the behavior of our function and made them clearly, separately and explicitly visible in the test code. While going through this process - and it is a process that can require some thought! - I typically learn something new about the implementation. For sort_by_age
perhaps you were wondering if the sort should be stable. According to the properties above it needn't be, but we could easily add that requirement if we wanted to. Such a concern would likely go unnoticed altogether with traditional unit testing. In an example-based test properties are all implicit, and anyone who reads the tests needs to infer them from the examples. A property-based test is like a lightweight specification, expressed directly in our programming language of choice.
That said, concrete examples are helpful, so in my experience the best test suites blend example- and property-based tests. Just like good documentation has a tutorial and how-to section (concrete, specific, "examples"), but also some background and reference material (abstract, general, "properties").
Now, how about the magic generate_persons()
- how does that work? And how can we be sure it's actually generating sensible values, and not a gazillion empty lists?
This is exactly where property-based testing libraries like QuickCheck and its many descendants come in. They provide functions that allow you to write value generators for your domain types (like Person
). They also have a number of basic generators for primitive values, lists, dictionaries, and so on, that can then be composed into generators for custom types. We'll go into the structure of that API in much more detail in the rest of the series - to give you an idea here's a generator written in a real property-based testing library for Python, Hypothesis:
# hypothesis calls generators "strategies":
from hypothesis import strategies as st
composite_generator = st.tuples(st.booleans(), st.text())
The composite generator (I'll keep calling it generator here for consistency, even though Hypothesis has a different naming convention) blends together three generators from the base library to create a generator that produces tuples of booleans and strings.
On top of that libraries provide ways to run tests and specify stopping conditions, get some statistics about the generated values, allow you to replay a particular failing test, and various other niceties.
Here's an example in Hypothesis again of a (trivial) test with the above generator:
@given(st.tuples(st.booleans(), st.text()))
def test_tuples(t):
assert len(t) == 2
assert isinstance(t[0], bool)
assert isinstance(t[1], str)
Hopefully that gives you a reasonable idea of what property-based tests look like. Undoubtedly you are left wondering how the values are actually generated.
The best known strategy is pseudo-random generation - this is the approach that QuickCheck pioneered. The idea is to generate values randomly, with a distribution skewed to values that typically cause bugs, like 0, -1, 1 for integers. This strategy is surprisingly effective in practice and relatively simple to implement. The downside is that it can generate very large values, and when a bug is found with a large value typically only a small part or aspect of the value triggers the bug. We don't know what this aspect is, which makes it hard to understand or debug the failure. To help with that, all modern property-based testing libraries attempt to "shrink" values if they fail the test - that is, when a test fails they'll try to find smaller values (e.g. smaller integers, shorter strings) that still fail the test. This process is comparable to running git bisect
to identify a commit that causes something to break.
Another strategy is exhaustive generation. There, all possible values for some type are generated in some well-defined order - typically from "small" to "large" values, and with some upper bound, as once you go past booleans the number of values for most types are (countably) infinite. For example, trying all the integers between -20 and 20 in "zig zag" order 0,1,-1,2,-2,.... SmallCheck for Haskell and SciFe for Scala do this, but this approach is not so well-known. It's a shame as random and exhaustive generation are complementary - if you think of generating values as exploring some large space to find failing tests, random generation is a serendipitous type of exploration, while exhaustive generation is diligently mapping out all the paths in some area.
The final strategy is what I'll call optimization-guided generation, where an algorithm tries to generate values such that some measure is maximized. In the approaches I know about, this measure is typically code coverage, and is achieved through smart code analysis. Effectively the code under test is symbolically executed, and a solver calculates input values that cause branches to go one way or the other. Pex for .NET is one of those - unfortunately no longer maintained and probably not usable anymore by now. I'm not convinced the complexity is worth it - clearly the generation strategy is orders of magnitude more computationally expensive, not to mention the development and maintenance cost of these libraries. This may explain why I mostly see them coming out of academia or research.
To conclude and reinforce what we've learned, let's see how the mother of property-based testing, QuickCheck, describes itself:
QuickCheck is a tool for testing Haskell programs automatically. The programmer provides a specification of the program, in the form of properties which functions should satisfy, and QuickCheck then tests that the properties hold in a large number of randomly generated cases. Specifications are expressed in Haskell, using combinators defined in the QuickCheck library. QuickCheck provides combinators to define properties, observe the distribution of test data, and define test data generators.
Hopefully that makes some sense now. A few remarks:
"QuickCheck [...] tests [...] programs automatically" (emphasis mine). This needs a charitable interpretation - the aspect that is automated is the generation of example inputs. As explained above and also in the subsequent sentence, users still need to write tests.
"QuickCheck provides combinators [...]". The term combinators has a rather specific meaning here, which we'll go into more detail in subsequent posts. Briefly, combinators are ordinary functions or methods that can be arbitrarily combined, i.e. they compose well.
That concludes the introduction to this series on property-based testing. In the next post I'll walk through a basic implementation of a property-based testing library that uses pseudo-random generation, and in the posts after that we'll dive into shrinking - it turns out that a lot of innovation in random property-based testing has been around shrinking strategies.
Until next time!
Acknowledgement
I stole the sorting example and learnt about this approach for explaining property-based testing from the paper Using Relational Problems to Teach Property-Based Testing by John Wrenna, Tim Nelsona, and Shriram Krishnamurthia. This made a nice change from the typical (and much maligned) "the reverse of the reverse of a list is the original list" example. Any errors and omissions are entirely my responsibility.