This is the fifth post in a series about property-based testing. This post describes "internal shrinking", a different implementation of shrinking that has some interesting advantages. Previous posts are:
The complete code for this post can be found on GitHub - in particular example.py and internal_shrink.py.
In the last two posts we tried shrinking values by directly changing values: we assumed an oracle that comes up with smaller candidate values, and search those in the hope to find some that still make the test fail.
We touched on one annoying problem with this approach: it doesn't deal well with mutable objects, because CandidateTree
captures the randomly generated object as its root, and then lazily comes up with smaller candidates. If the object is mutated in between capture and finding a smaller candidate, things go sideways. We can try to work around this problem by asking users to clone mutable values, but we can do better - by changing our approach.
To introduce the main idea behind the new approach, we'll need some context - another perspective on how random generation works.
It's all about making choices
Let's check where in our mini-PBT library we make random choices. All we need to do is look up where we use Python's random module. There's only one place where it's called:
def int_between(low: int, high: int) -> Random[int]:
return Random(lambda: random.randint(low, high))
(You may remember in the last post this was called random_int_between
to disambiguate with tree_int_between
. However, thanks to the implementation in this post, we won't need the tree_
functions anymore, so I've omitted the random_
prefix.)
When we generate a complex value like a list of Person
objects, ultimately int_between
chooses the length of the list, the letters of the persons' names and their age. The code of the generators makes that easy to see:
ages = int_between(0,100)
letters = map(chr, int_between(ord('a'), ord('z')))
simple_names = map("".join, list_of_length(6, letters))
persons = mapN(lambda a: Person(*a), (simple_names, ages))
lists_of_person = list_of(persons)
Reading bottom-up, i.e. in execution order, random choices are:
list_of
chooses a random length of the list.persons
generates the firstPerson
. It first generates asimple_name
by randomly choosing 6 letters. Each letter is chosen by choosing a random integer, and then mapping that to a letter.A random age is chosen.
Repeat 2 and 3 for each element of the list.
All told we make 7 choices per Person
, and one choice to figure out how many Person
objects we should generate.
Exactly how many choices there are is not important - the point is that we can think of creating a list of Person
objects as a sequence of random choices, which are represented by a sequence of integers generated by int_between
. The process of generating any random value is then a function that takes a sequence of choices as input, and creates a corresponding value as output.
We can use this insight to reproduce any randomly generated value. While generating a random value we record the sequence of choices the generator makes. To re-create the value we simply run the generator again, but we replay the same choices we made randomly before. You can think of the sequence of choices as the DNA of the generated value: choice sequence and generator uniquely determine the value.
Hopefully you're with me so far. The last step is to ask: is it possible to manipulate the choice sequence to make the generated value smaller? If we could do that, we'd effectively have a way to shrink.
Looking through the example, this seems possible. For example, if we make the choice of list length smaller, we'd generate shorter lists. If we make the choice of letter smaller, we'd generated smaller letters and names. We can also try more interesting changes to the choice sequence - for example we could generate a list of the same elements but in a different order by moving the sub-sequences that determine each person around. In principle, any change directly to the value can also be made by by changing the choice sequence instead.
It turns out this approach works well in practice. This method of shrinking was invented by David R. MacIver for the Python property-based testing library Hypothesis. He calls it "internal shrinking" as opposed to "external shrinking" which we've been discussing in the last two posts. Internal/external is relative to the random generation process: with external shrinking, we make some random choices and generate a value, which is then made smaller by directly manipulating the value. Shrinking is external to random generation. With internal shrinking on the other hand, we manipulate the sequence of random choices to indirectly get smaller values. Shrinking is internal to the random generation process, and runs the exact same generator code, just with a different choice sequence.
A simplified implementation
Let's look at this idea more concretely. We're not changing the way values are initially randomly generated, but we do need to somehow remember the sequence of choices that lead to that value, as that sequence is what we'll be using for shrinking.
We'll start with wrapping the call to random.randint
so we can record the choice sequence. I've chosen to write a small class for this wrapper. There is more to it than I'm showing here, but we'll go piecemeal.
class ChoiceSeq:
def __init__(self) -> None:
self.history: list[int] = []
def randint(self, low: int, high: int) -> int:
result = random.randint(low, high)
self.history.append(result)
return result
ChoiceSeq
makes a random choice, and keeps the results. Now we need to change the implementation of int_between
to use this class instead of random.randint
directly. To do that we need to change Random
to pass an instance of ChoiceSeq
around so we can use it in int_between
:
class Random(Generic[T]):
def __init__(self, generator: Callable[[ChoiceSeq], T]):
self._generator = generator
def generate(self, choose: ChoiceSeq) -> T:
return self._generator(choose)
def int_between(low: int, high: int) -> Random[int]:
return Random(lambda choose: choose.randint(low, high))
Other functions are essentially unchanged, e.g. here is map
:
def map(func: Callable[[T], U], gen: Random[T]) -> Random[U]:
return Random(lambda choose: func(gen.generate(choose)))
Since map
doesn't have to make a new choice, it can just pass the ChoiceSeq
through. The same is true for mapN
and bind
.
The types are also unchanged from before, as well as the implementation of for_all
:
Gen = Random[T]
Property = Gen[TestResult]
def for_all(
gen: Gen[T],
property: Callable[[T], Union[Property,bool]]
) -> Property:
... # unchanged from last post
However, test
is where it gets interesting:
def test(property: Property):
def do_shrink(choices: ChoiceSeq) -> None:
...
for test_number in range(100):
choices = ChoiceSeq()
result = property.generate(choices)
if not result.is_success:
print(
f"Fail: at test {test_number} with "
f"arguments {result.arguments}."
)
do_shrink(choices)
return
print("Success: 100 tests passed.")
We start out much the same as always: by generating up to 100 random values. However, thanks to the ChoiceSeq
class, we've now got a record of all the random choices that were made during generation of each value. When all tests pass, we don't do anything with the choices, although you could imagine storing them in a database, as a reproducible log of the executed tests - at least provided the generators don't change too much!
For this post, we only care about do_shrink
. Given a choice sequence, we now need to figure out how we're going to change it to make the resulting value smaller.
In a real implementation, this is a decent engineering challenge to do well. I'll just demonstrate the principle, by re-using our list shrinking function from a couple posts ago:
def shrink_candidates(choices: ChoiceSeq) -> Iterable[ChoiceSeq]:
# this is part of the list shrinker from vintage.py!
for i,elem in enumerate(choices.history):
for smaller_elem in shrink_int(elem):
smaller_history = list(choices.history)
smaller_history[i] = smaller_elem
yield ChoiceSeq(smaller_history)
The idea is that we take the history out of ChoiceSeq
, which is a list of int
, and create a new list with one of the elements made smaller using shrink_int
. We'll need to change ChoiceSeq
as well, which we'll do later. As far as do_shrink
goes, an initial attempt looks as follows:
def do_shrink(choices: ChoiceSeq) -> None:
for smaller_choice in shrink_candidates(choices):
result = property.generate(smaller_choice)
if not result.is_success:
print(
"Shrinking: found smaller arguments "
f"{result.arguments}"
)
do_shrink(smaller_choice)
break
else:
print(f"Shrinking: gave up")
do_shrink
takes the initial failing ChoiceSeq
, generates smaller values via shrink_candidates
and recursively shrinks if it finds a successful shrink.
What is going on in property.generate(smaller_choice)
? In that call, we want the ChoiceSeq
to replay the choices that were made previously, instead of randomly making new choices and appending them. We need to add replay behavior to ChoiceSeq
:
class ChoiceSeq:
def __init__(self, history: Optional[list[int]] = None):
if history is None:
self._replaying: Optional[int] = None
self.history: list[int] = []
else:
self._replaying = 0
self.history = history
def randint(self, low: int, high: int) -> int:
if self._replaying is None:
# recording
result = random.randint(low, high)
self.history.append(result)
return result
else:
# replaying
if self._replaying >= len(self.history):
raise InvalidReplay()
value = self.history[self._replaying]
self._replaying += 1
if value < low or value > high:
raise InvalidReplay()
return value
def replay(self) -> None:
self._replaying = 0
def replayed_prefix(self) -> ChoiceSeq:
if self._replaying is None:
raise InvalidOperation()
return ChoiceSeq(self.history[:self._replaying])
There's a lot going on here so bear with me.
ChoiceSeq
now has two modes: record or replay. When created without a history
argument, it is in recording mode - any call to randint
generates a new random int and appends it to the list, as above. However, if replay
is called explicitly, or a history
parameter is given to __init__
, then ChoiceSeq
replays the history of choices without any random choices. In other words, creating a new ChoiceSeq(choices.history)
and passing that to generate
exactly re-creates the original value. The premise of internal shrinking is that we can generate smaller values by making the list of choices smaller.
There is however a snag - by changing choices earlier in the sequence, we'll influence the behavior of the generator. For example, the generator may need more history than the initial randomly generated value that was recorded, or an integer in the history is out of the desired bounds. In those cases, we fail with InvalidReplay
- later on, we'll react to InvalidReplay
by declaring the resulting value as an unsuccessful shrink. Not that if this happens, we bail out quickly, in most cases before we even run the test.
Vice versa, thanks to our devious manipulations, we'll hopefully use fewer choices - that indicates we've actually managed to make the value smaller. For example, if we make the length of a list smaller, then it's likely we won't use the the last part of the history. The method replayed_prefix
extracts the part of the choice history that is used so far, so we can keep shrinking just that part.
With all that, here is the final version of do_shrink
:
def do_shrink(choices: ChoiceSeq) -> None:
for smaller_choice in shrink_candidates(choices):
try:
result = property.generate(smaller_choice)
except InvalidReplay:
continue
if not result.is_success:
print(
"Shrinking: found smaller arguments"
f"{result.arguments}"
)
do_shrink(smaller_choice.replayed_prefix())
break
else:
choices.replay()
print(
"Shrinking: gave up at arguments"
f"{property.generate(choices).arguments}"
)
There are three things to note here:
As explained, we react to
InvalidReplay
by considering the shrink as failed, and carry on with the next candidate.If a shrink is successful, we carry on with the part of the choice sequence that's actually used by calling
replayed_prefix
.To show the value of our last and most successful shrink, we use the
replay
method to reproduce the last value for display purposes.
And that is the essence of the internal shrinking approach! Let's see it in action next. But first: coffee and biscuit time.
Putting it to the test
We don't need to change usage of our API at all - generators and properties are implemented the same as before. This is an amazing achievement - the original QuickCheck API from over 20 years ago needs no changes despite one big refactor and one re-implementation. There's not many APIs I've designed that I expect to pass the test of time that well.
Let's try our new implementation on our usual example sort_by_age
. If you need a refresher, see the first post in this series.
> test(prop_sort_by_age)
Success: 100 tests passed.
Hardly interesting. Shrinking is what we want to see!
> test(prop_wrong_sort_by_age)
Fail: at test 1 with arguments ([Person(name='pjahih', age=0), Person(name='iapxij', age=55), Person(name='mvrnyq', age=57), Person(name='qyzwko', age=94), Person(name='iuzotw', age=4)],).
Shrinking: found smaller arguments ([Person(name='pjahih', age=0), Person(name='iapxij', age=55), Person(name='mvrnyq', age=57), Person(name='qyzwko', age=94)],)
Shrinking: found smaller arguments ([Person(name='pjahih', age=0), Person(name='iapxij', age=55), Person(name='mvrnyq', age=57)],)
Shrinking: found smaller arguments ([Person(name='pjahih', age=0), Person(name='iapxij', age=55)],)
Shrinking: found smaller arguments ([Person(name='ojahih', age=0), Person(name='iapxij', age=55)],)
Shrinking: found smaller arguments ([Person(name='njahih', age=0), Person(name='iapxij', age=55)],)
...
Shrinking: found smaller arguments ([Person(name='abaaaa', age=0), Person(name='aaaaaa', age=2)],)
Shrinking: found smaller arguments ([Person(name='abaaaa', age=0), Person(name='aaaaaa', age=1)],)
Shrinking: gave up at arguments ([Person(name='abaaaa', age=0), Person(name='aaaaaa', age=1)],)
Remember, wrong_prop_sort_by_age
doesn't sort a list of Person
objects by age, but by name and age. This particular shrink had about 160 successful shrinks out of 20,000 tries. It's definitely putting in the effort, and the result is not too shabby: it has found a minimal counterexample. In my cursory testing it comes up with a two person list with ages 0 and 1 most of the time (and names being some variant of aaaaaa
and baaaaa
). Sometimes it only finds a three person list.
This is a good result! The occasional misses don't indicate a shortcoming, just that my implementation is simple minded. By manipulating the choice sequence in smarter ways, it's possible to make this shrink to a single canonical example every time. If that doesn't work for some case, then we need to add smarter strategies - Hypothesis has an internal mini-language to describe how to manipulate the choice sequence for various cases - apparently, shrinking floating point values in particular needs special care.
Conclusion
It definitely seems like we have a winner on our hands. Compared to the previous approach we have the same outcome: we don't need to burden the user with writing shrink candidate functions in addition to generator functions. Since shrinking is now internal to random generation, the question of how to integrate random generation and shrinking doesn't even come up - the two are integrated at the core. Because of this tighter integration, internal shrinking has additional advantages:
Works great for mutable objects. Each shrink re-creates the object from scratch from the replayed choice sequence, running the exact same generator code as during random generation. As a result, this approach is much better suited for imperative-minded languages like Python, C, C# and Java.
Can reproduce shrunk values. So far it's been easy to reproduce randomly generated values, by storing the pseudo-random seed, but that doesn't work for any of the shrunk values. But now that we have the choice sequence, we can easily reproduce the shrunk values too.
Bind without remorse. In the last post, we noticed that if a generator is written using
bind
, the integrated approach loses a lot of information because it needs to randomly re-generate values for the inner generator. For example, if you write a generator for a list by randomly choosing a length, and then usebind
to generate the values, every time it shrinks the length, the values of the list are randomly re-generated. With internal shrinking, this is not the case - the choice sequence can be changed so the exact same elements are re-generated. You can see this in the example above: the first few shrinks chop elements from the end of the list, while keeping the first elements the same. The end result is that internal shrinking is better at preserving information from the initial value that made the test fail.
All seems well. So, is this the final word in shrinking for property-based testing? Not entirely! I know of one more approach, which again has all the advantages of internal shrinking, with a simpler implementation. But for that, you'll have to wait for the next post...
As usual, comment below or discuss on Twitter @kurt2001.
Until next time!