Distributed Erlang

People say Erlang is a distributed programming language.

They say it with a straight face, like distribution is just… there. Built in. Transparent.

You send a message to a process on another machine the same way you send one to a process on your laptop. Pid is local or across the continent… magic!!

This is technically true. It is also the most dangerous sentence shilling Erlang you might commonly run into.

Erlang’s distribution is not a feature. It’s a scar.

A beautiful, fascinating, deeply educational scar from the 1980s that will teach you things about distributed systems you didn’t know you needed to learn.

Usually at 3am :)

To understand why, you have to understand what the BEAM actually is and how it got this way. Interest compounds.

A Tiny OS

Erlang was built at Ericsson in the late 1980s for telephone switches.

The requirements were brutal:

Handle millions of concurrent calls.
Never go down.
Hot-swap code while the system was running.
Survive hardware failures without dropping a connection.

You can’t bolt these things onto a language after the fact. They have to be baked into the runtime.

So the BEAM was designed from scratch as a tiny operating system.

Each Erlang process is not an OS thread. It’s a lightweight green thread managed entirely by the BEAM. Creating one takes microseconds and a few hundred bytes of memory. You can run millions on a laptop.

1> spawn(fun() -> io:format("Hello~n") end).
Hello
<0.41.0>

Processes don’t share memory either. Nothing is shared. If you want to communicate, you send a message instead.

Pid ! {self(), hello},
receive
  {Pid, Response} -> io:format("Got: ~p~n", [Response])
end.

This is the actor model. Each process is a tiny independent computer with its own state, its own mailbox, and no way to touch anything else.

The BEAM preemptively schedules them, giving each a fair slice of CPU time. Infinite loops in one process won’t freeze other processes. Magic.

The BEAM is not just a VM. It’s a guarantee.

No shared state. No global locks. No mutexes. Every process is an island, and messages are the only boats.

This design decision compounds.

Because processes are isolated, crashes are isolated. A process can die without taking down its neighbors.

Because the BEAM knows exactly which processes are linked to which, it can restart them automatically.

You kinda get fault tolerance for free because the runtime already has the information it needs to rebuild the system.

spawn_link(fun() -> exit(reason) end).
% the linked process dies too, and you catch the exit signal

Link enough processes together in a hierarchy and you get a supervision tree. As an emergent property, the machine puts itself back together while you watch.

You can kill -9 a BEAM process mid-computation and the system doesn’t care. It restarts the process. It retries the message. It keeps going. This is not a library. It’s the runtime.

Network Transparency

The BEAM already thinks in terms of isolated processes sending messages. A process on node A sending a message to a process on node B is… the same operation. The syntax is identical.

The Pid just looks different. You don’t have to care. The runtime handles serialization, transport, and delivery totally transparently.

% This works whether Pid is local or on a machine in another datacenter
RemotePid ! {self(), ping},
receive
  {RemotePid, pong} -> ok
end.

This is genuinely beautiful. Location transparency is a real thing and it works exactly as advertised.

People see this and think “oh, Erlang is a distributed language. Distribution is solved.”

Until they build a system on it and discover what Ericsson learned in the 1980s: transparency is a lie.

The BEAM makes distribution look like message passing. The network is not just an implementation detail. You have to worry about the network. You have to worry about the topology. You have to worry about the failure modes. You have to worry about physics.

The Mesh

Distributed Erlang forms a fully connected mesh. Every node connects to every other node.

With N nodes, that’s N-1 connections per node, for a total of N*(N-1)/2 connections. Heartbeats travel across every single one to verify liveness. As the cluster grows, heartbeat traffic grows quadratically.

I learned this one the hard way. I had maybe eight nodes in a cluster and things were fine. Added a ninth. Still fine. Added a tenth. Suddenly nodes started getting marked as down for no apparent reason.

I spent an evening staring at logs before I realized the heartbeat traffic had quietly crossed some threshold and the mesh was eating itself.

Historical wisdom says Erlang clusters top out around 50-100 nodes. WhatsApp reportedly ran clusters of 1,000+, but they used hidden node topologies to tame the mesh. You can’t just keep adding nodes and expect it to work.

You also can’t control which nodes connect to which. There’s no built-in way to say “these two should talk, these two shouldn’t.” You can’t throttle traffic between them either. A single noisy process floods the entire cluster and there’s no circuit breaker. The mesh is democratic to a fault.

The Mailbox

Locally, every process gets its own mailbox. Messages queue up per-process and the BEAM schedules delivery. This is one of the core design decisions. Isolation all the way down.

Across the network, it’s completely different.

All messages to a remote node hit a single FIFO queue: the node’s distribution-layer mailbox. One queue for the entire machine.

Think about what that means. You have a thousand processes on node A, all sending messages to node B. Every single one of those messages lands in the same queue.

If process 847 on node B is slow to drain its messages, process 12’s heartbeat gets stuck behind it. The heartbeat times out. Node A marks node B as down.

Node B is alive. It’s just… busy.

This is genuinely insane. The BEAM goes to extraordinary lengths to isolate every process from every other process locally, and then funnels all remote communication through a single choke point. The design that makes local Erlang bulletproof is the opposite of how distributed Erlang works.

I found out about this one after a deploy. A new feature was sending slightly larger messages than before. Not huge. Just bigger. And suddenly one node in the cluster kept getting marked as down every few minutes.

The logs showed heartbeats failing. The node itself was at 2% CPU. It took me hours to trace it back to a single slow process on that node whose mailbox was backing up the distribution queue for everyone.

Libraries like gen_rpc swap the built-in transport for a custom TCP-based protocol, sidestepping the single-mailbox problem entirely.

If you’re building anything serious on distributed Erlang, you should know about it.

The Partition

A network partition splits your cluster into isolated groups. This is not an “if.” It’s a “when.”

The cluster can’t tell whether a process on the other side of the partition crashed or is just unreachable. Both sides think they’re the only side. Both sides keep running.

If you have a single global process managing state (a common pattern in Erlang, because local singletons are safe and easy) you now have a split-brain. Two instances of your “singleton” running simultaneously. Each convinced the other is dead. Each making decisions on stale state.

Orders get processed twice. Balances diverge. Nobody notices until a customer calls.

Reconciling state after a partition is entirely your problem. Libraries like Swarm and Horde help manage process placement. Raft can help with consensus. Consistent hashing can pin a process to one node.

None of it solves reconciliation for you. The data diverged while the partition was active and you have to decide which version is truth.

Partitions are my least favorite distributed systems problem because they don’t feel real until they happen to you.

The system works perfectly for months. Then a network switch reboots and suddenly you’re explaining to your boss why orders got duplicated.

The fix isn’t code. The fix is understanding that the network is not reliable and designing for it from the start.

BOOM: Physics

After all of that, you still have to contend with the actual laws of the universe.

Messages take time to travel. On your laptop, latency is a rounding error. Across continents, it’s not. Your message is moving at the speed of light through fiber and it takes as long as it takes. There is no optimization for this.

Clocks drift. Every machine has its own idea of what time it is. You can reach for NTP, and you should, but designing for clock skew from the start helps more.

Causal ordering across nodes is genuinely hard and distributed Erlang gives you zero help with it. If event A happened before event B on different machines, you have to figure that out yourself.

Nodes need to find each other. The built-in discovery mechanism was designed in the 1980s for static IPs in a server rack.

In modern cloud environments with ephemeral IPs and auto-scaling groups, you’ll need libcluster with Kubernetes DNS, Consul, or similar strategies. The runtime still thinks it’s 1987.

The cookie that authenticates nodes is a plaintext file at ~/.erlang.cookie. There’s no built-in mechanism to distribute it securely. You handle that. Securing inter-node communication means layering TLS on top with something like gen_rpc.

It’s humbling, honestly. You spend all this time learning the BEAM’s elegant abstractions: processes, messages, supervisors… and then you hit the network layer and realize the elegance stops at the machine boundary.

Past that point, you’re on your own. The BEAM hands you a packet and a prayer.

A Distributed Language

“Is Erlang a distributed language?” Yes. And no.

The BEAM gives you the primitives for distribution:

Processes.
Message passing.
Links between processes.
Process monitoring.
Nodes that can talk to eachother.

These are genuinely powerful and they compose in ways that other languages can only dream of.

But distribution is not free. It’s not transparent. It’s not solved.

The BEAM gives you a head start and then steps back and lets you deal with the mesh, the mailbox, the partitions, and the speed of light on your own.

The scars from the 1980s are still there. They’re not bugs. They’re the design.

Know the sharp edges before they cut you.

Take heed the fallacies of distributed computing (which are all true). Read them. Then read them again.

Erlang’s distribution is the BEAM’s most honest feature.

It shows you exactly how the runtime thinks about the world, and then it shows you exactly where that worldview breaks.

On this page

I'm Offline

Lurkers

I'm a two hit wonder