.. (C) 2023,2024 Albert Mietus. Part of CCastle project .. _Castle-Heisenbug: ======================== A Heisenbug in the Sieve ======================== .. post:: 2024/03/27 :category: CastleBlogs, Castle, DesignStudy :tags: Castle, DRAFT In Castle, one can dynamically connect components and send *events* over those connections. Typically, this is an action on an incoming message (see: :ref:`CCC-Actors`). And depending on ‘:ref:`TheMachinery`’, those events can be queued. It is this combination that *can result* in a **beautiful Heisenbug**. First, let’s explain the Heisenbug, before we give an example. Then we analyze it, show how to improve the code, and finally formulate a *requirement* to prevent & detect this kind of bug in Castle. What is a Heisenbug? ==================== The `heisenbug `__ is named after the theoretical physics *Werner Heisenberg*, who described the *“observer effect”*: when you look closely, the behaviour changes. The same can happen to software (bugs). The behaviour apparently changes when you study -or slightly adjust- that code. Often, this is due to small changes in timing, possibly even in generated code. Therefore old (old-fashioned) sequential code on slow CPUs is less likely to have heisenbugs than concurrent code on fast multi-core systems. It’s also common in threaded programs. .. include:: ./Heisenbug-sidebar-Sequence.irst The Sieve goes wrong ==================== My standard example, :ref:`Castle-TheSieve`, suffered from this issue. The initial version did work for years but failed horribly when another “machinery” was used. After studying this, the bug is simple and easy to fix. There are two related timing issues that together result in the Heisenbug. First, we introduce them and then show how the combination may fail. Event-order ----------- Conceptually, the `Generator` sends (events with) integers to `Sieve(2)`, which may forward them to `Sieve(3)`, then to `Sieve(5)`, etc. As shown in the **Conceptual sidebar**, we probably assume that each integer is sieved before the next *’int’* *starts*: the classic “sequential view” we are used to. However, that isn’t how it works. In Castle, *only* the order of events on a single connection is defined (*one by one, sequential*). And given the code, the integer sent by a `Sieve` is always later than the incoming one. That is all we may assume. |BR| The timing of unrelated events on multiple connections is not defined. That order may depend on :ref:`TheMachinery` and many other factors. Do not assume any order -- as I did! The diagram in the **One-by-One** sidebar shows another (allowed) order. The Generator outputs all events first before Sieve(2) starts filtering. Then, the odd integers are passed to Sieve(3), which processes all its input, then Sieve(5), etc. |BR| Although we aren’t using concurrency, and it needs tremendous buffers when finding big primes, it does conceptually work. And so, it is an allowed execution [#ButImprove]_. As that order is “possible” we should have a design (and code) that handles it. But it doesn't ... (in the basic variant). Reconnecting ------------ The chain of `Sieve`\s will grow as we find more primes. Whenever an *int* isn’t filtered out and reaches the `Finder` a *new prime* is found. Then, a new Sieve element is created and inserted into the chain. |BR| Building the chain is done by the Main component (which is signalled by the Finder) [#orVariant]_. Therefore, `Main` remembers the ``last_sieve`` and reconnects that output to the newly created `Sieve`. Which output is temporally connected to the `Finder` |BR| For every newly found prime, this is repeated. This detail is shown in the “**with Details**” sidebar diagram; where the `Finder` and `Main` and all its messages are shown too. Assuming the initial “conceptual” order, you will see the same Sieve(s) become alive (“new” message), and are added to the end of the sieve chain. The integers still flow (now, shown as “try(`int`)” messages) by this sieve. |BR| You will also notice the `Finder` does indeed find all primes. The combination =============== Now let us study how the sieve chain will grow with a “fast generator”, and the one-by-one order of events is used. This diagram is shown below. As we can see in the picture, it goes dreadfully wrong. No proper chain is created, and we will find integers like **4** and **6**. This is wrong, they are not prime. |BR| With a (very) *fast Generator*, **all** integers are sent to the `Finder` --before any `Sieve` is created, and so any int is reported as prime. besides, too many elements are added to the chain as a Sieve component is created for each found “prime”. On top of that, no integer is ever sieved... This is just **an** example. As we can’t predict (or assume) any order, we can find other results too. And, when we add “debugging print statement” (and *look closer*), we change the timing and will find other results. We found the *observer effect*! .. uml:: ./sieve-Sequence-Wrong.puml .. warning:: It is not *“the timing”* that is wrong! |BR| A concurrent program is only correct when it goes right for **any** possible timing. As in all software engineering, we can prove it is buggy when it goes wrong at least once. That is what we have shown. And so, the original code is *wrong*. How to improve? =============== Finding this heisenbug triggered an investigation to improve Castle as well as ‘:ref:`Castle-TheSieve`’. Where our goal is not just to improve the sieve code, but all programs. And give the programmer options to prevent heisenbugs. As we will see fixing ‘:ref:`the sieve ` is simple: use the **SLowStart** (base) protocol. It changes only three lines! |BR| But we start with a new requirement, for a new tool in the :ref:`Castle Workshop `. Simulation ---------- As we will see below the `SlowStart` protocol will *remove* our Heisenbug. But it does not abolish the Heisenbug! The feature does by no means prevent a developer from using other solutions, with a comparable flaw. Besides, it’s hopeless to use testing to prove the absence of a Heisenbug. As we have seen above, all variants of all possible concurrent execution orders should be correct. Where we have very limited control over which variant is used. This led to a new requirement: :need:`U_Tools_EventOrder.` To assist the developer the Castle Workshop will come with tools to detect such bugs. See :ref:`simulation` for more on this. .. include:: ./sieve-protocols2-sidebar.irst SlowStart --------- Castle (now) comes [#KipEi]_ with the parametric *base* protocol :ref:`SlowStart `, which is based on `TCP Slow start `__ and contains a queueing model [#ModelOnly]_ to control the speed of an (event) connection. As the name suggests, initially the connections with be slow. Roughly, the parameter set the maximal number of unhandled events on a connection. |BR| The (improved) version of :ref:`Castle-TheSieve` uses a SlowStart of **1**. And `Main` will remove (or increase) that limit after reconnecting. Initially, the `Generator` is only *allowed* to send one event, which is received by the `Finder`. Then, `Main` will create the first Sieve (`Sieve(2)`), reconnect the Generator to that component, and increase the “speed” of the connection. As the connection **Generator->Sieve(2)** is stable, the is no need to limit the “queue”. The `Generator` acts as before: it sends (events with) integers over its output. But now, the SimpleSieve protocol can slow down the `Generator` --no code changes are needed for this. This may happen when the second event is sent before it is received (or “handled”) by the Finder, and the limit is set to **1**. |BR| As this limit is removed when a Sieve-component is inserted into the chain, only the start is slow... The same happens to every Sieve: initially (when it is connected to the Finder) there is a limit of 1 event in the queue. But when that connection is reconnected --and the potential Heisenbug is gone-- the limit is removed. .. tip:: Unlimited or better? In this blog, we remove the limit of the ``SlowStart protocol`` completely, for simplicity. Then the Heisenbug is solved. That is not the only option. |BR| Given the `Generator` is a simple loop it can produce many integers fast. And so causes huge piles of queued events. That can be done better, by the same concept: a maximal queue size. (again: just model). It’s up to the developer to optimize this. Some prefer the maximum queue length equal to the number of Sieve-components, others relate the maximum queue length to the number of available cores. Some use static values, others will adjust them over the run-time of the application. |BR| That is all possible with a few lines in the `Main` component. But also the Sieve component can set this limit, both for incoming ports as well for outgoing ports. ----- .. rubric:: Footnotes .. [#orVariant] Again, there are many variants of :ref:`Castle-TheSieve`. In some adding a `Sieve` to the chain is done by the `Finder`, in others by `Main`. In both alternatives, the same reconnect is needed. There is no real difference -- but the name of the component initiating the change. |BR| The Heisenbug will not (trivially) disappear when switching between hose variants .. [#ButImprove] Still, as language designers, we need to give the programmer more options to hint at a more optimal implementation. .. [#KipEi] This *SlowStart* base protocol is part of Castle; see :ref:`Protocol-SlowStart`. But the need for it follows --like this blog-- from the discovery of these :ref:`Castle-Heisenbug` in :ref:`Castle-TheSieve`. .. [#ModelOnly] Remember, this queue exists as a *model* **only** (like everything in Castle-code)! |BR| Depending on ‘:ref:`TheMachinery`, there may be no need to implement the queue (e.g.with DirectCall) at all; or there may only be a queue length and -maximum, or ... .. LocalWords: heisenbug heisenbugs