Concurrent Computing Concepts¶
Sooner as we realize, even embedded systems will have piles & heaps of cores, as I described in
“Keep those cores busy!”. Castle should make it easy to write code for all of them: not to keep them busy, but to maximize
speed up [useCase: In Castle is easy to use th... (U_ManyCore)]. I also showed that threads do not scale well for CPU-bound (embedded)
systems. Last, I introduced some (more) concurrency abstractions. Some are great, but they often do not fit
nicely in existing languages.
Still, as Castle is a new language, we have the opportunity to select such a concept and incorporate it into the
language.
In this blog, we explore a bit of theory. I will focus on semantics and the possibilities to implement them efficiently. The exact syntax will come later.
Basic terminology¶
Many theories are available, as are some more practical expertise, regrettably hardly non of them share a common
vocabulary. For that reason, I first describe some basic terms, and how they are used in these blogs. As always, we use Wikipedia
as common ground and add links for a deep dive.
Again, we use ‘task’ as the most generic term for work-to-be-executed; that can be (in) a process, (on) a thread, (by) a
computer, etc.
Concurrent¶
Concurrency is the ability to “compute” multiple tasks at the same time.
Designing concurrent software isn’t that complicated but; demands another mindset than when we write software that does
one task after the other.
A typical example is a loop: suppose we have a sequence of numbers and we like to compute the square of each one. Most developers will loop over those numbers, get one number, calculate the square, store it in another list, and continue. It works, but we have also instructed the computer to do it in sequence. Especially when the task is a bit more complicated, the compiler does know whether the ‘next task’ depends on the current one, and can’t optimize it.
A better plan is to tell the compiler about different tasks. Most are independent: square a number. There is also one
that has to be run at the end: combine the results into a new list. And one is a bit funny: distribute the elements over
the “square tasks”. Clearly one has to start with this one, but it can be concurrent with many others too.
This is not a parallel algorithm. When not specifying the order, we allow parallel execution. We do not demand it,
sequential execution is allowed too.
Parallelism¶
Parallelism is about executing multiple tasks (seemingly) at the same time. We will on focus running many multiple
concurrent tasks (of the same program) on “as many cores as possible”. When we assume a thousand cores, we need a
thousand independent tasks (at least) to gain maximal speed up. A thousand at any moment!
It’s not only about doing a thousand tasks at the same time (that is not too complicated, for a computer) but also —
probably: mostly — about finishing a thousand times faster…
With many cores, multiple “program lines” can be executed at the same time, which can introduce unforeseen effects: changing the same variable, accessing the same memory, or competing for new, “free” memory. And when solving that, we introduce new hazards: like deadlocks and even livelocks.
Distributed¶
A special form of parallelism is Distributed-Computing: computing on many computers. Many experts consider this an independent field of expertise. Still –as Multi-Core is basically “many computers on a chip”– it’s an available, adjacent [3] theory, and we should use it, to design our “best ever language”.
Communication Efficiently¶
When multiple tasks run concurrently, they have to communicate to pass data and control progress. Unlike in a
sequential program – where the control is trivial, as is sharing data– this needs a bit of extra effort.
There are two main approaches: shared-data of message-passing; we will introduce them below.
Communication takes time, especially wall time [4] (or clock time), and may slow down computing. Therefore communication has to be efficient. This is an arduous problem and becomes harder when we have more communication, more concurrency, more parallelism, and/or those tasks are short living. Or better: it depends on the ratio of time-between-communications and the time-between-two-communications.
Messages¶
A more modern approach is Message-Passing: a task sends some information to another; this can be a message, some data, or an event. In all cases, there is a distinct sender and receiver – and apparently no common/shared memory– so no Critical-Sections [6] are needed; at least not explicitly. Messages are easier to use and more generic: they can be used in single-, multi-, and many-core systems. Even distributed systems are possible – then the message (and its data) is serialised, transmitted over a network, and deserialised.
As you may have noticed, there is an analogy between Message-Passing and Events (in an event-loop). They have separate histories but are quite similar in nature. Like a “message”, the “event” is also used to share data (& control) to isolated “tasks”.
Warning
Many people use the networking mental model when they think about Message-Passing, and wrongly assume there is always serialisation (and network) overhead. This is not needed for parallel cores as they typically have shared (physical) memory.
Then, we can use the message abstraction at developer-level, and let the compiler translate that into shared
memory instructions for the processor level.
Notice: As the compiler will insert the (low level) Semaphores, the risk that a developer forgets one is gone!
Messaging Aspects¶
There are many variants of messaging, mostly combinations of some fundamental aspects. Let mentions some basic ones.
(A)Synchronous¶
Synchronous messages resemble normal function calls. Typically a “question” is sent, the call awaits the answer message, and that answer is returned. This can be seen as a layer on top of the more fundamental send/receive calls. A famous example is RPC: the Remote Procedure Call.
Asynchronous messages are more basic: a task sends a message and continues. That message can be “data”, an “event”, a “command”, or a “query”. Only in the latter case, some response is essential. With async messages, there is no desire to get the answer immediately.
As an example: A task can send many queries (and/or other messages) to multiple destinations at once, then go into listen-mode, and handle the replies in the order they are received (which can be different than as sent). Typically, this speeds up (wall) time, and is only possible with async messages. Notice: the return messages need to carry an “ID” of the initial messages to keep track – often that is the query itself.
(Un)Buffered¶
Despite it is not truly a characteristic of the message itself, messages can be buffered, or not. It is about
the plumbing to transport the message: can this “connection” (see below) contain/save/store messages? When there is no
storage at all the writer and reader need to rendezvous: send and receive at the same (wall) time.
With a buffer (often depicted as a queue) multiple messages may be sent before they need to be picked up by the
receiver; the number depends on the size of the buffer.
Note: this is always asymmetric; messages need to be sent before they can be read.
Connected Channels (or not)¶
Messages can be sent over (pre-) connected channels or to freely addressable end-points. Some people use the term “connection-oriented” for those connected channels, others use the term “channel” more generic and for any medium that is transporting messages. I try to use “connected-channel” when is a pre-connected channel.
When using connected channels, one writes the message to the channel; there is no need to add the receiver to the
message. Also when reading, the sender is clear.
Clearly, the channel has to be set up before it can be used.
Without connected channels, each message needs a recipient; often that receiver is added (“carried”) to the message
itself.
A big advantage is, that one does not need to create channels and end-points first – which especially counts when a low
number (possible one) of messages are sent to the same receiver, and/or many receivers exist (which would lead to a huge
number of channels).
(Non-) Blocking¶
Both the writer and the reader can be blocking (or not); which is a facet of the function-call. A blocking reader it will always return when a message is available – and will pause until then. Equally, the write-call can block: pause until the message can be sent – e.g. the reader is available (rendezvous) or a message buffer is free.
When the call is non-blocking, the call will return without waiting and yield a flag whether it was successful or not. Then, the developer will commonly “cycle” to poll for a profitable call; and let the task do some other/background work as well.
Futures (or promises)¶
A modern variant of non-blocking makes use of “Futures”. The call will always return this opaque data structure
immediately. It may be a blank – but the procedure can continue. Eventually, that data will be filled in “by the
background”. It also contains a flag (like done
), so the programmer can check (using an if) [8] whether
the data is processed.
Uni/Bi-Directional, Many/Broad-cast¶
Messages can be sent to one receiver, to many, or even to everybody. Usually, this is modeled as a characteristic of the channel. At the same time, that channel can be used to send messages in one or two directions.
It depends on the context of the exact intent. For example in (TCP/IP) networking, ‘Broadcasting’ (all not point-to-point variants) focus on reducing the amount of data on the network itself. In distributed computing ‘Broadcasting’ is a parallel design pattern. Whereas the ‘Broadcast’ flag in TV steaming is completely different: is it allowed to save (record) a broadcast…
We use those teams with a functional aim. We consider the above-mentioned RCP connection as Unidirectional – even
the channel can carry the answer. When both endpoints can take the initiative to send messages, we call it
Bidirectional.
With only 2 endpoints, we call the connection Point-to-Point (p2p). When more endpoints are concerned, it’s
Broadcast when a message is sent to all others (on that channel), and Manycast when the user (the programmer) can
(somehow) select a subset.
Reliability & Order¶
Especially when studying “network messages”, we have to consider Reliability too. Many developers assume that a send
message is always received and that when multiple messages are sent, they are received in the same order. In most
traditional –single-core– applications this is always the chase. With networking applications, this is not always
true. Messages can get lost, received out of order, or even read twice. Although it is always possible to add a
“reliability layer”.
Such a layer makes writing the application easier but introduces overhead too. And therefore not always the right
solution.
In Castle, we have “active components”: many cores are running parallel, all doing a part of the overall (concurrent) program. This resembles a networking application – even while there is no real network – where at least three nodes are active.
This is a bit more complicated, so let us start with an example. Say, we have 3 components A
, B1
, and
B2
. All are connected to all others. We assume that messages are unbuffered, non-blocking, never got lost, and that
two messages over the same channel are never out-of-order. Sound simple, isn’t it?
Now state that A
send a message (m1) to B1
and then one (m2) to B1
. The “B components” will –on
receiving a message from A
– send a short message to the other one (m3 and m4). And that message triggers
(again both in B1
and B2
) to send an answer to A
; so m5 and m6.
Now the question is: in which order those answers (in A
) are received? The real answer is: You don’t know!
It’s clear that A
will get m5 and m6 – given that all messages (aka channels) are reliable. But there are many
ways those messages may receive in the opposite order. Presumably, even in more ways, than you can imagine. For example,
B1
might process m4 before it processes m1! This can happen when channel A->B1
is slow, or when B2
gets CPU-time before B1
, or…
When we add buffering, more connected components, etc this “network” acts less reliable than we might aspect (even
though each message is reliable). When we add some real-time demands (see below), the ability to use/model a solution
using an unreliable message becomes attractive …
It’s not that you should always favor unreliable, out-of-order messages. Without regard, No! We are designing a new
language, however –one that should run efficiently on thousands of core, in a real-time embedded system– then the
option to utilize them may be beneficial.
Hint
As a simple example to demonstrate the advantage of an “unreliable connection”, let us consider an audio (bidirectional)
connection, that is not 100% reliable.
When we use it “as is”, there will be a bit of noise, and even some hick-ups. For most people, this is acceptable,
when needed they will use phrases such as “Can you repeat that?”.
To make that connection reliable, we need checksums, low-level confirmation messages, and once in a while have to send a message again. This implies some buffering (at both sides), and so the audio stream will have a bit of delay. This is a common solution for unidirectional PODcasts, and such.
For a bidirectional conversation, however, this buffering is not satisfactory. It makes the slow, people have to wait
on each other and will interrupt one other.
Then, a faster conversation with a bit of noise is commonly preferred.
Process calculus & more¶
After studying many concurrent concepts, we need to address one more, before we can design the Castle language. That is
“How do we determine what is ‘best’ (ever)”? Can we calculate the performance of every aspect? The answer is no;
but there are formal systems that can help: Process-Calculus (or -Algebra).
Unfortunately, there are many of them. And I like to avoid the recursions-trap: study them all, find a meta-calculus
to determine the best, etc.
So let us give a quick overview. And recall, the term ‘process’ is pretty general: it denotes the behavior of a system, not the more limited practice most software developers use.
Basic rules
Deadlock
All formulas from wikipedia
Creation of a name
Create the name x
that acts as a (new) communication channel. This is done process P
.
Mathematical models¶
Many Process-Calculuses are invented around 1980. As often, those traditional ones focus on the issues that were current back then. And although they are still useful, they might be unaware of modern aspects of computing – like huge code bases, and over a thousand cores.
Petri Ne¶
Probably the oldest mathematical model to describe distributed systems is the Petri-Net, invented in 1962 – some claim
it even started in 1939(!). In its graphical representation, it looks like a network of circles (‘places’) and bars,
(‘actions’) connected by arrows (‘arcs’). It also contains ‘tokens’ – zero or more dots inside a circle. They can
flow through the network and kind of represent the (global) state.
There is an extensive, complex mathematical model behind this. Which makes Petri-Nets very powerful.
A drawback however is, that all tokens have to move to the next place at the same time. When using Petri-Nets as a calculus, this is not an issue. But it becomes impossible to execute that in a distributed (or Multi-Core) environment, or a base of a language.
Communicating sequential processes¶
CSP is probably the best-known formal language to describe (patterns in) concurrent systems. It started in 1978 as a kind of programming language and has evolved since then. Occam –the language to program the once famous Transputer– is based on CSP.
Also ‘Go’ (the language) is influenced by CSP. A sign the CSP isn’t too old.
Calculus of Communicating Systems¶
CCS is also quite old (1980) and quite useful to calculate deadlocks and livelocks
Algebra of Communicating Processes¶
Also, ACP dates back from 1982 and is a real algebra – it probably contrived the general term Process-Calculus. Like
any algebra, it has (transformation) “rules” [10].
By those rules, one can convert (“transform”) a process of concurrent and/or sequential actions into other, equivalent
processes – and prove they are the same (or nor). And look for patterns that should not (never) happen; like deadlocks
and livelocks.
I like this algebra aspect, as we can use it inside some Castle Tools (ToDo) to validate the design is sound.
Π-calculus¶
Pi-Calculus is a more recent (1992) addition to the family of Process-Calculuses. It allows some dynamic behavior;
like transmitting the names of the channel – which facilitates the growth & reconfiguration of the network during execution.
That expect, for example, is needed for the Sieving Prime components (ToDo).
It also shows some of the shortcomings of “traditional” models, as hinted above.
As it is a small (but expressive) “language”, that resembles λ-calculus a bit, it has some omissions too: no numbers, no functions, not even an if-statement (all quite fundamental for a programming language). It is based on names, which mirror both variables and channels.
The Actor Model¶
The Actor-Model is strictly speaking not a Process-Calculus, but it has many similarities. A big dissimilarity is its
inspiration; where a Process-Calculus are based on mathematics, the Actor-Model is inspirited by physics. See
Calculus-vs-Actors for more on their (dis)similarities.
The Actor-Model began in 1973, matured in the ’80s, and become fashionable when “cloud computing” started. There are
many “bold-on” actor packages for almost all popular languages. They focus mostly on robust, scalable, distributed
applications; less on the speed-up of a single program. Still, the ”Many-Core” concept we use for Castle
is closely related.
Being inspired by physics, which is concurrent by nature, the perception differs. An actor is local, “active”, and
independent. It can only act on the messages that it receives, sent new messages, and/or create new actors. It (typically)
has an internal state, but that is completely internal (or private, as developers call it).
There is no global state, no central synchronisation, no “shared memory”, and no (overall) orchestration. Everything is
decentral.
One can model many well-known software systems as an Actor-Model: like email, SOAP, and other web services. Also, interrupt-handling can be modeled with actors: An extern message triggers the “interrupt-handler actor” –async of the main code; another actor– which has to send data (aka a message) to the main actor.
Another interesting dissimilarity is that the Actor-Model, and the Actor-Model-Theory, are also influent by SW-Engineering and their languages. This probably made is also convenient to design new programming languages on this theory.
Note This is only conceptual. As stated in Castle is agnostic to the i... (RC_Agnostic_Concurrency) for the programmer, all details on how concurrency in implemented is a detail – that can differ in various environments! |
Tip
Unlike Process-Calculuses, there is only one Actor-Model!
Footnotes
There a two (main) differences between Distributed-Computing and Multi-Core. Firstly, all “CPUs” in
Distributed-Computing are active, independent, and asynchronous. There is no option to share a “core” (as
commonly/occasionally done in Multi-process/Threaded programming); nor is there “shared memory” (one can only send
messages over a network).
Secondly, collaboration with (network-based) messages is a few orders slower than (shared) memory communication. This
makes it harder to speed up; the delay of messaging shouldn’t be bigger than the acceleration when doing things in
parallel.
But that condition does apply to Multi-Core too. Although the (timing) numbers do differ.
As a reminder: We speak about CPU-time when we count the cycles that make a core busy; so when a core is waiting, no CPU-time is used. And we use wall-time when we time according to “the clock on the wall”.
The brittleness of Critical-Sections can be reduced by embedding (the) (shared-) variable in an OO abstraction. By
using getters and *setters, that control the access, the biggest risk is (mostly) gone. That does not, however,
prevent deadlocks or livelocks.
And still, all developers have to be disciplined to use that abstraction … always.
This is not completely correct; Message-Passing can be implemented on top of shared memory. Then, the implementation of this (usually) OO-abstraction contains the Critical-Sections; a bit as described in the footnote above.
And the overhead will grow when we add more cores. Firstly while more “others” have to wait (or spin), and secondly that the number of communications will grow with the number of cores too. As described in the sidebar within Keep those cores busy! solving this can give more overhead than the speed we are aiming for.
Remember: to be able to “fill in” that Future-object “by the background” some other thread or so is needed. And so, a
Critical-Section is needed. For the SW-developer the interface is simple: read a flag (e.g. .done()
. But using
that too often can result in a slow system.
Broadcasting is primarily known for “network messages”; where it has many variants – mostly related to the physical network abilities, and the need to save bandwidth. As an abstraction, they can be used in “software messages” (aka message passing) too.
Those ‘rules’ resembles the boolean algebra, that most developers know: NOT(x OR y) == NOT(x) AND NOT(y). See Wikipedia for examples of ACP.