Concurrent Computing Concepts

Sooner as we realize, even embedded systems will have piles & heaps of cores, as I described in “Keep those cores busy!”. Castle should make it easy to write code for all of them: not to keep them busy, but to maximize speed up [useCase: In Castle is easy to use th... (U_ManyCore)]. I also showed that threads do not scale well for CPU-bound (embedded) systems. Last, I introduced some (more) concurrency abstractions. Some are great, but they often do not fit nicely in existing languages.
Still, as Castle is a new language, we have the opportunity to select such a concept and incorporate it into the language.

In this blog, we explore a bit of theory. I will focus on semantics and the possibilities to implement them efficiently. The exact syntax will come later.

Basic terminology

Many theories are available, as are some more practical expertise, regrettably hardly non of them share a common vocabulary. For that reason, I first describe some basic terms, and how they are used in these blogs. As always, we use Wikipedia as common ground and add links for a deep dive.
Again, we use ‘task’ as the most generic term for work-to-be-executed; that can be (in) a process, (on) a thread, (by) a computer, etc.


Concurrency is the ability to “compute” multiple tasks at the same time.
Designing concurrent software isn’t that complicated but; demands another mindset than when we write software that does one task after the other.

A typical example is a loop: suppose we have a sequence of numbers and we like to compute the square of each one. Most developers will loop over those numbers, get one number, calculate the square, store it in another list, and continue. It works, but we have also instructed the computer to do it in sequence. Especially when the task is a bit more complicated, the compiler does know whether the ‘next task’ depends on the current one, and can’t optimize it.

A better plan is to tell the compiler about different tasks. Most are independent: square a number. There is also one that has to be run at the end: combine the results into a new list. And one is a bit funny: distribute the elements over the “square tasks”. Clearly one has to start with this one, but it can be concurrent with many others too.
This is not a parallel algorithm. When not specifying the order, we allow parallel execution. We do not demand it, sequential execution is allowed too.


Parallelism is about executing multiple tasks (seemingly) at the same time. We will on focus running many multiple concurrent tasks (of the same program) on “as many cores as possible”. When we assume a thousand cores, we need a thousand independent tasks (at least) to gain maximal speed up. A thousand at any moment!
It’s not only about doing a thousand tasks at the same time (that is not too complicated, for a computer) but also — probably: mostly — about finishing a thousand times faster…

With many cores, multiple “program lines” can be executed at the same time, which can introduce unforeseen effects: changing the same variable, accessing the same memory, or competing for new, “free” memory. And when solving that, we introduce new hazards: like deadlocks and even livelocks.


A special form of parallelism is Distributed-Computing: computing on many computers. Many experts consider this an independent field of expertise. Still –as Multi-Core is basically “many computers on a chip”– it’s an available, adjacent [3] theory, and we should use it, to design our “best ever language”.

Communication Efficiently

When multiple tasks run concurrently, they have to communicate to pass data and control progress. Unlike in a sequential program – where the control is trivial, as is sharing data– this needs a bit of extra effort.
There are two main approaches: shared-data of message-passing; we will introduce them below.

Communication takes time, especially wall time [4] (or clock time), and may slow down computing. Therefore communication has to be efficient. This is an arduous problem and becomes harder when we have more communication, more concurrency, more parallelism, and/or those tasks are short living. Or better: it depends on the ratio of time-between-communications and the time-between-two-communications.

Shared Memory

In this model all tasks (usually threads or processes) have some shared/common memory; typically “variables”. As the access is asynchronous, the risk exists the data is updated “at the same time” by two or more tasks. This can lead to invalid data and so Critical-Sections are needed.
This is a very basic model which assumes that there is physical memory that can be shared. In distributed systems this is uncommon, but for threads it’s straightforward.

An advantage of shared memory is the fast communication-time. The wall-time and CPU-time are roughly the same: the time to write & read the variable added to the (overhead) time for the critical section – which is typically the bigger part.
The big disadvantage of this model is that is hazardous: The programmer needs to insert Critical_Sections into his code at all places that variable is used. Even a single access to a shared variable, that is not protected by a Critical-Section, can (will) break the whole system [5].


A more modern approach is Message-Passing: a task sends some information to another; this can be a message, some data, or an event. In all cases, there is a distinct sender and receiver – and apparently no common/shared memory– so no Critical-Sections [6] are needed; at least not explicitly. Messages are easier to use and more generic: they can be used in single-, multi-, and many-core systems. Even distributed systems are possible – then the message (and its data) is serialised, transmitted over a network, and deserialised.

As you may have noticed, there is an analogy between Message-Passing and Events (in an event-loop). They have separate histories but are quite similar in nature. Like a “message”, the “event” is also used to share data (& control) to isolated “tasks”.


Many people use the networking mental model when they think about Message-Passing, and wrongly assume there is always serialisation (and network) overhead. This is not needed for parallel cores as they typically have shared (physical) memory.

Then, we can use the message abstraction at developer-level, and let the compiler translate that into shared memory instructions for the processor level.
Notice: As the compiler will insert the (low level) Semaphores, the risk that a developer forgets one is gone!

Messaging Aspects

There are many variants of messaging, mostly combinations of some fundamental aspects. Let mentions some basic ones.


Synchronous messages resemble normal function calls. Typically a “question” is sent, the call awaits the answer message, and that answer is returned. This can be seen as a layer on top of the more fundamental send/receive calls. A famous example is RPC: the Remote Procedure Call.

Asynchronous messages are more basic: a task sends a message and continues. That message can be “data”, an “event”, a “command”, or a “query”. Only in the latter case, some response is essential. With async messages, there is no desire to get the answer immediately.

As an example: A task can send many queries (and/or other messages) to multiple destinations at once, then go into listen-mode, and handle the replies in the order they are received (which can be different than as sent). Typically, this speeds up (wall) time, and is only possible with async messages. Notice: the return messages need to carry an “ID” of the initial messages to keep track – often that is the query itself.


Despite it is not truly a characteristic of the message itself, messages can be buffered, or not. It is about the plumbing to transport the message: can this “connection” (see below) contain/save/store messages? When there is no storage at all the writer and reader need to rendezvous: send and receive at the same (wall) time.
With a buffer (often depicted as a queue) multiple messages may be sent before they need to be picked up by the receiver; the number depends on the size of the buffer.

Note: this is always asymmetric; messages need to be sent before they can be read.

Connected Channels (or not)

Messages can be sent over (pre-) connected channels or to freely addressable end-points. Some people use the term “connection-oriented” for those connected channels, others use the term “channel” more generic and for any medium that is transporting messages. I try to use “connected-channel” when is a pre-connected channel.

When using connected channels, one writes the message to the channel; there is no need to add the receiver to the message. Also when reading, the sender is clear.
Clearly, the channel has to be set up before it can be used.

Without connected channels, each message needs a recipient; often that receiver is added (“carried”) to the message itself.
A big advantage is, that one does not need to create channels and end-points first – which especially counts when a low number (possible one) of messages are sent to the same receiver, and/or many receivers exist (which would lead to a huge number of channels).

(Non-) Blocking

Both the writer and the reader can be blocking (or not); which is a facet of the function-call. A blocking reader it will always return when a message is available – and will pause until then. Equally, the write-call can block: pause until the message can be sent – e.g. the reader is available (rendezvous) or a message buffer is free.

When the call is non-blocking, the call will return without waiting and yield a flag whether it was successful or not. Then, the developer will commonly “cycle” to poll for a profitable call; and let the task do some other/background work as well.

Futures (or promises)

A modern variant of non-blocking makes use of “Futures”. The call will always return this opaque data structure immediately. It may be a blank – but the procedure can continue. Eventually, that data will be filled in “by the background”. It also contains a flag (like done), so the programmer can check (using an if) [8] whether the data is processed.

Uni/Bi-Directional, Many/Broad-cast

Messages can be sent to one receiver, to many, or even to everybody. Usually, this is modeled as a characteristic of the channel. At the same time, that channel can be used to send messages in one or two directions.

It depends on the context of the exact intent. For example in (TCP/IP) networking, ‘Broadcasting’ (all not point-to-point variants) focus on reducing the amount of data on the network itself. In distributed computing ‘Broadcasting’ is a parallel design pattern. Whereas the ‘Broadcast’ flag in TV steaming is completely different: is it allowed to save (record) a broadcast…

We use those teams with a functional aim. We consider the above-mentioned RCP connection as Unidirectional – even the channel can carry the answer. When both endpoints can take the initiative to send messages, we call it Bidirectional.
With only 2 endpoints, we call the connection Point-to-Point (p2p). When more endpoints are concerned, it’s Broadcast when a message is sent to all others (on that channel), and Manycast when the user (the programmer) can (somehow) select a subset.

Reliability & Order

Especially when studying “network messages”, we have to consider Reliability too. Many developers assume that a send message is always received and that when multiple messages are sent, they are received in the same order. In most traditional –single-core– applications this is always the chase. With networking applications, this is not always true. Messages can get lost, received out of order, or even read twice. Although it is always possible to add a “reliability layer”.
Such a layer makes writing the application easier but introduces overhead too. And therefore not always the right solution.

In Castle, we have “active components”: many cores are running parallel, all doing a part of the overall (concurrent) program. This resembles a networking application – even while there is no real network – where at least three nodes are active.

This is a bit more complicated, so let us start with an example. Say, we have 3 components A, B1, and B2. All are connected to all others. We assume that messages are unbuffered, non-blocking, never got lost, and that two messages over the same channel are never out-of-order. Sound simple, isn’t it?
Now state that A send a message (m1) to B1 and then one (m2) to B1. The “B components” will –on receiving a message from A – send a short message to the other one (m3 and m4). And that message triggers (again both in B1 and B2) to send an answer to A; so m5 and m6.

Now the question is: in which order those answers (in A) are received? The real answer is: You don’t know!
It’s clear that A will get m5 and m6 – given that all messages (aka channels) are reliable. But there are many ways those messages may receive in the opposite order. Presumably, even in more ways, than you can imagine. For example, B1 might process m4 before it processes m1! This can happen when channel A->B1 is slow, or when B2 gets CPU-time before B1, or…

When we add buffering, more connected components, etc this “network” acts less reliable than we might aspect (even though each message is reliable). When we add some real-time demands (see below), the ability to use/model a solution using an unreliable message becomes attractive …
It’s not that you should always favor unreliable, out-of-order messages. Without regard, No! We are designing a new language, however –one that should run efficiently on thousands of core, in a real-time embedded system– then the option to utilize them may be beneficial.


As a simple example to demonstrate the advantage of an “unreliable connection”, let us consider an audio (bidirectional) connection, that is not 100% reliable.
When we use it “as is”, there will be a bit of noise, and even some hick-ups. For most people, this is acceptable, when needed they will use phrases such as “Can you repeat that?”.

To make that connection reliable, we need checksums, low-level confirmation messages, and once in a while have to send a message again. This implies some buffering (at both sides), and so the audio stream will have a bit of delay. This is a common solution for unidirectional PODcasts, and such.

For a bidirectional conversation, however, this buffering is not satisfactory. It makes the slow, people have to wait on each other and will interrupt one other.
Then, a faster conversation with a bit of noise is commonly preferred.

Process calculus & more

After studying many concurrent concepts, we need to address one more, before we can design the Castle language. That is “How do we determine what is ‘best’ (ever)”? Can we calculate the performance of every aspect? The answer is no; but there are formal systems that can help: Process-Calculus (or -Algebra).
Unfortunately, there are many of them. And I like to avoid the recursions-trap: study them all, find a meta-calculus to determine the best, etc.

So let us give a quick overview. And recall, the term ‘process’ is pretty general: it denotes the behavior of a system, not the more limited practice most software developers use.

Mathematical models

Many Process-Calculuses are invented around 1980. As often, those traditional ones focus on the issues that were current back then. And although they are still useful, they might be unaware of modern aspects of computing – like huge code bases, and over a thousand cores.

Petri Ne

Probably the oldest mathematical model to describe distributed systems is the Petri-Net, invented in 1962 – some claim it even started in 1939(!). In its graphical representation, it looks like a network of circles (‘places’) and bars, (‘actions’) connected by arrows (‘arcs’). It also contains ‘tokens’ – zero or more dots inside a circle. They can flow through the network and kind of represent the (global) state.
There is an extensive, complex mathematical model behind this. Which makes Petri-Nets very powerful.

A drawback however is, that all tokens have to move to the next place at the same time. When using Petri-Nets as a calculus, this is not an issue. But it becomes impossible to execute that in a distributed (or Multi-Core) environment, or a base of a language.

Communicating sequential processes

CSP is probably the best-known formal language to describe (patterns in) concurrent systems. It started in 1978 as a kind of programming language and has evolved since then. Occam –the language to program the once famous Transputer– is based on CSP.

Also ‘Go’ (the language) is influenced by CSP. A sign the CSP isn’t too old.

Calculus of Communicating Systems

CCS is also quite old (1980) and quite useful to calculate deadlocks and livelocks

Algebra of Communicating Processes

Also, ACP dates back from 1982 and is a real algebra – it probably contrived the general term Process-Calculus. Like any algebra, it has (transformation) “rules” [10].
By those rules, one can convert (“transform”) a process of concurrent and/or sequential actions into other, equivalent processes – and prove they are the same (or nor). And look for patterns that should not (never) happen; like deadlocks and livelocks.

I like this algebra aspect, as we can use it inside some Castle Tools (ToDo) to validate the design is sound.


Pi-Calculus is a more recent (1992) addition to the family of Process-Calculuses. It allows some dynamic behavior; like transmitting the names of the channel – which facilitates the growth & reconfiguration of the network during execution.
That expect, for example, is needed for the Sieving Prime components (ToDo).
It also shows some of the shortcomings of “traditional” models, as hinted above.

As it is a small (but expressive) “language”, that resembles λ-calculus a bit, it has some omissions too: no numbers, no functions, not even an if-statement (all quite fundamental for a programming language). It is based on names, which mirror both variables and channels.

The Actor Model

The Actor-Model is strictly speaking not a Process-Calculus, but it has many similarities. A big dissimilarity is its inspiration; where a Process-Calculus are based on mathematics, the Actor-Model is inspirited by physics. See Calculus-vs-Actors for more on their (dis)similarities.
The Actor-Model began in 1973, matured in the ’80s, and become fashionable when “cloud computing” started. There are many “bold-on” actor packages for almost all popular languages. They focus mostly on robust, scalable, distributed applications; less on the speed-up of a single program. Still, the ”Many-Core” concept we use for Castle is closely related.

Being inspired by physics, which is concurrent by nature, the perception differs. An actor is local, “active”, and independent. It can only act on the messages that it receives, sent new messages, and/or create new actors. It (typically) has an internal state, but that is completely internal (or private, as developers call it).
There is no global state, no central synchronisation, no “shared memory”, and no (overall) orchestration. Everything is decentral.

One can model many well-known software systems as an Actor-Model: like email, SOAP, and other web services. Also, interrupt-handling can be modeled with actors: An extern message triggers the “interrupt-handler actor” –async of the main code; another actor– which has to send data (aka a message) to the main actor.

Another interesting dissimilarity is that the Actor-Model, and the Actor-Model-Theory, are also influent by SW-Engineering and their languages. This probably made is also convenient to design new programming languages on this theory.

Resolution: Castle will use Actors as the main Concurrent Computing Concept R_CCC-Actors ../../_images/arrow-right-circle.svg
links outgoing: U_ManyCore


Unlike Process-Calculuses, there is only one Actor-Model



comments powered by Disqus