Heisenbugs

In Castle, one can dynamically connect components and send “events” over those connections. Typically this is done as an action on an incoming message (see: The Actor Model). And depending on ‘The Machinery (ToDo)’, those events can be queued. It is this combination that can result in a beautiful Heisenbug.

First, let’s explain the Heisenbug, before we give an example. Then we analyze it, show how to improve the code, and finally formulate a requirement to prevent & detect this kind of bug in Castle.

What is a Heisenbug?

The heisenbug is named after the theoretical physics Werner Heisenberg, who described the “observer effect”: when you look closely, the behavior changes. The same can happen to software (bugs). The behavior apparently changes when you study -or slightly adjust- that code. Often this is due to (small) changes in timing; possibly even in generated code. Therefore old (old-fashioned) sequential code on slow CPUs is less likely to have heisenbugs than concurrent code on fast multi-core systems. It’s also common in threaded programs.

The Sieve goes wrong

My standard example, The Sieve demo (start/DRAFT), suffered from this issue. The initial version did work for years but failed horribly when another “machinery” was used. After studying this, the bug is simple and easy to fix.

There are two related timing issues that together result in the Heisenbug. First, we introduce them and then show how the combination may fail.

Event-order

Conceptually, the Generator sends (events with) integers to Sieve(2), which may forward them to Sieve(3), then to Sieve(5), etc. As shown in the Conceptual sidebar, we probably like to assume that each integer is sieved before the next ’int’ starts: the classic “sequential view” we are used to.

However, that isn’t how it works. In Castle, the order of events on a connection is defined (one by one, sequential). And given the code, the integer sent by a Sieve is always later than the incoming one. That is all we may assume.
The timing of unrelated events on multiple connections is not defined. That order may depend on The Machinery (ToDo) and many other factors. Do not, as a developer, assume any order –as I did!

As shown in the One-by-One sidebar diagram, this can result that the Generator is outputting all events first. Next, Sieve(2) filters out the even integers, then Sieve(3) processes all its input, then Sieve(5), etc.
Although we aren’t using concurrency, and it needs tremendous buffers when finding big primes, it does conceptually work. And so, it is an allowed execution [2].

Reconnecting

The chain of Sieves will grow as we find more primes. Whenever an int isn’t filtered-out and reaches the Finder a new prime is found. Then, a new Sieve element is created and inserted into the chain.
Building the chain is done by the Main component (which is signaled by the Finder) [1].

Therefore, Main remembers the last_sieve and reconnects that output to the newly creates Sieve. Which output is temporally connected to the Finder
For every newly found prime, this is repeated.

This detail is shown in the With Details sidebar diagram; where the Finder and Main and all its messages are shown too.

Assuming the initial “conceptual” order, you will see the same Sieve(s) become alive (“new” message), and are added to the end of the sieve chain. The integers still flow (now, shown as “try(int)” messages) by this sieve.
You will also notice the Finder does indeed find all primes.

The combination

Now let us study how the sieve chain will grow with a “fast generator”, and the one-by-one order of events is used. This diagram is shown below.

As we can see in the picture, it goes dreadfully wrong. No proper chain is created, and we will find integers like 4 and 6. This is wrong, they are not prime.
With a (very) fast Generator, all integers are sent to the Finder –before any Sieve is created, and so any int is reported as prime. besides, too many elements are added to the chain as a Sieve component is created for each found “prime”. On top of that, no integer is ever sieved…

This is just an example. As we can’t predict (or assume) any order, we can find other results too. And, when we add “debugging print statement” (and look closer), we change the timing and will find other results. We found the observer effect!

@startuml
'hide footbox
title Wrong (Generate fast & Slow creation)

participant  Main        as  M
participant  Finder      as  F
participant "Generator"  as  G
participant "Sieve(2)"   as S2
participant "Sieve(3)"   as S3
participant "Sieve(4)"   as S4 #Red\IndianRed
participant "Sieve(5)"   as S5
participant "Sieve(6)"   as S6 #Red\IndianRed
participant "Sieve(7)"   as S7

... Suppose, the Generator is active first ...

G -[#green]\\   F: try(2)
G -[#purple]\\  F: try(3)
G -[#purple]\\  F: try(4)
G -[#purple]\\  F: try(5)
G -[#purple]\\  F: try(6)
G -[#purple]\\  F: try(7)

... And then the Finder ...

F -[#green]\  M: found(2)
F -[#purple]\ M: found(3)
note right: Despite 3 is a prime, it never passed //Sieve(2//. <color:red>Wrong</color>
F -[#red]\    M: found(4)
note right  #IndianRed/Red: Found prime: <color:white>**4**</color> -- wrong result
F -[#purple]\ M: found(5)
note right: Again, 5 is a prime, but it's **not** //sieved//. <color:red>Wrong</color>
F -[#red]\    M: found(6)
note right  #IndianRed/Red: Found prime: <color:white>**6**</color> -- wrong result
F -[#purple]\ M: found(7)
note right: Again, 7 is a prime, but .... <color:red>wrong</color>

... And last: we create the  Sieves & reconnect ...

M -\ S2: new
activate S2
G \\--o M: "reconneced to Sieve(2)"
hnote right: This reconnect is to late, as sereval messages are already sent


M -[#purple]\ S3: new
activate S3
S2 \\--o M: "reconneced to Sieve(3)"
hnote right: This reconnect is to late, as sereval messages are already sent


M -[#red]\ S4: new
activate S4
note right: Another flaw: We shouldn't have **Sieve(4)**
S3 \\--o M: "reconneced to Sieve(4)"


M -[#purple]\ S5: new
activate S5
hnote right: Sieve(5) never gets input; or at least to little
S4 \\--o M: "reconneced to Sieve(5)"


M -[#red]\ S6: new
activate S6
note right: Another flaw: We shouldn't have **Sieve(6)**
S5 \\--o M: "reconneced to Sieve(6)"


M -[#red]\ S7: new
activate S7
S6 \\--o M: "reconneced to Sieve(6)"
hnote right: Again: too late

== etc ==

@enduml

Warning

It is not “the timing” that is wrong!
A concurrent program is only correct when it goes right for any possible timing.

As in all software engineering, we can prove it is buggy when it goes wrong at least once. That is what we have shown. And so, the original code is wrong.

How to improve?

Finding this heisenbug triggered an investigation to improve Castle as well as ‘The Sieve demo (start/DRAFT)’. Where our goal is not just to improve the sieve code, but all programs. And give the programmer options to prevent heisenbugs.

As we will see fixing ‘the sieve is simple: use the SLowStart (base) protocol. It changes only three lines!
But we start with a new requirement, for a new tool in the Castle Workshop.

Simulation

As we will see below the SlowStart protocol will remove our Heisenbug. But it does not abolish the Heisenbug!

The feature does by no means prevent a developer from using other solutions, with a comparable flaw. Besides, it’s hopeless to use testing to prove the absence of a Heisenbug. As we have seen above, all variants of all possible concurrent execution orders should be correct. Where we have very limited control over which variant is used.

This led to a new requirement: Simulate & Verify all possi... (U_Tools_EventOrder.) To assist the developer the Castle Workshop will come with tools to detect such bugs. See Simulation/Verification (todo) for more on this.

SlowStart

Castle (now) comes [3] with the parametric base protocol SlowStart, which is based on TCP Slow start and contains a queueing model [4] to control the speed of an (event) connection. As the name suggests, initially the connections with be slow. Roughly, the parameter set the maximal number of unhandled events on a connection.
The (improved) version of The Sieve demo (start/DRAFT) uses a SlowStart of 1. And Main will remove (or increase) that limit after reconnecting.

Initially, the Generator is only allowed to send one event, which is received by the Finder. Then, Main will create the first Sieve (Sieve(2)), reconnects the Generator to that component, and increases the “speed” of the connection. As the connection Generator->Sieve(2) is stable, the is no need to limit the “queue”.

The Generator acts as before: it sends (events with) integers over its output. But now, the SimpleSieve protocol can slow down the Generator –no code changes are needed for this. This may happen when the second event is sent before it is received (or “handled”) by the Finder, and the limit is set to 1.
As this limit is removed when a Sieve-component is inserted into the chain, only the start is slow…

The same happens to every Sieve: initially (when it is connected to the Finder) there is a limit of 1 event in the queue. But when that connection is reconnected –and the potential Heisenbug is gone– the limit is removed.

Tip

Unlimited or better?

In this blog, we remove the limit of the SlowStart protocol completely, for simplicity. Then the Heisenbug is solved.

That is not the only option.
Given the Generator is a simple loop it can produce many integers fast. And so cause huge piles of queued events. That can be done better, by the same concept: a maximal queue size. (again: just model).

It’s up to de developer to optimize this. Some prefer the maximum queue length equal to the number of Sieve-components, others relate the maximum queue length to the number of available cores. Some use static values, others will adjust them over the run-time of the application.
That is all possible with a few lines in the Main component. But also the Sieve component can set this limit, both for incoming-ports as well for outgoing ports.


Footnotes

Comments

comments powered by Disqus