Running on Multiple Machines
Pathway pipelines can be distributed across multiple machines. Each machine runs a process, and together they form a single logical computation. Workers on different machines communicate over TCP, exchanging data and progress information the same way co-located processes do.
This is useful when:
- The dataset or working state does not fit in the memory of a single machine.
- The computation is CPU-bound enough to saturate all cores on one host.
- You want to co-locate workers with partitioned data sources (e.g., Kafka brokers) to reduce network transfer.
How It Works
Every Pathway worker — regardless of which machine it runs on — executes the same dataflow on a different shard of the data. The workers discover each other through a fixed list of host:port addresses provided at startup. All processes must be started before any of them begins processing data: the pipeline waits until the full cluster is assembled. While the pipeline waits for all of its workers, you will see a "Preparing Pathway computation" log message.
This is different from the default single-machine multi-process mode (pathway spawn -n N), where Pathway automatically assigns ports on 127.0.0.1 and launches all processes itself. In the multi-machine mode, you are responsible for starting one process per machine and telling each process where all the others are.
Setting Up
1. Decide on addresses
Choose a host:port pair for each process. The port must be reachable from all other machines in the cluster. For example, with two machines:
| Process | Address |
|---|---|
| 0 | 192.168.1.10:9000 |
| 1 | 192.168.1.11:9000 |
2. Start the process on each machine
On machine 0:
pathway spawn \
--addresses 192.168.1.10:9000,192.168.1.11:9000 \
--process-id 0 \
python pipeline.py
On machine 1:
pathway spawn \
--addresses 192.168.1.10:9000,192.168.1.11:9000 \
--process-id 1 \
python pipeline.py
Both commands receive the same --addresses list. The --process-id flag tells each machine which entry in that list belongs to it — process 0 binds to 192.168.1.10:9000, process 1 binds to 192.168.1.11:9000.
The two commands can be started in any order. The process that starts first will wait for the others to connect before beginning computation.
Note that a single machine can host more than one process. In that case, use the same host with different ports for each process on that machine:
pathway spawn \
--addresses 192.168.1.10:9000,192.168.1.10:9001,192.168.1.11:9000 \
--process-id 0 \
python pipeline.py
pathway spawn \
--addresses 192.168.1.10:9000,192.168.1.10:9001,192.168.1.11:9000 \
--process-id 1 \
python pipeline.py
pathway spawn \
--addresses 192.168.1.10:9000,192.168.1.10:9001,192.168.1.11:9000 \
--process-id 2 \
python pipeline.py
Here processes 0 and 1 both run on 192.168.1.10, listening on ports 9000 and 9001 respectively, while process 2 runs on 192.168.1.11.
Please keep in mind that due to how the communication internally works, the list of workers must have them in the same order in all of the launched commands. Only the --process-id parameter must be varied, taking all values from 0 through the length of the list minus one.
3. Use threads for intra-machine parallelism
The --threads flag works independently of --addresses. To run two threads per machine with the two-machine setup above, add --threads 2 to both commands. This gives four total workers: two on each machine.
pathway spawn \
--addresses 192.168.1.10:9000,192.168.1.11:9000 \
--process-id 0 \
--threads 2 \
python pipeline.py
4. Add persistence (recommended)
When running across machines, data persistence is strongly recommended. If any process crashes, the whole cluster must be restarted. Persistence ensures the pipeline resumes from the last checkpoint rather than replaying from the beginning:
persistence_config = pw.persistence.Config(
backend=pw.persistence.Backend.s3(
bucket_name="my-bucket",
root_path="pathway-state/",
),
)
pw.run(persistence_config=persistence_config)
It is important to use a shared storage (S3, GCS, Azure Blob, NFS) so that all machines can read and write the same state.
License
Running Pathway on multiple machines requires a Pathway Scale or Pathway Enterprise license. You can obtain a free Pathway Scale license here. The page contains instructions for getting the license and using it in your pipeline.
Limitations
No dynamic scaling. The --addresses flag defines a fixed worker pool. Pathway's autoscaling mechanism (described in Dynamic Worker Scaling) is not available when a fixed address list is used. The number of processes is determined by the length of the --addresses list and cannot change at runtime.
All processes must start for the pipeline to begin. If one machine fails to start or takes too long, the others will wait indefinitely. There is no partial startup or degraded mode.
At-least-once delivery. As with all Pathway deployments, recovery after a crash replays data from the last committed checkpoint. Records written after the last checkpoint but before the crash may be processed again. Exactly-once semantics are available in the enterprise edition.
Same binary on all machines. All machines must run the same version of Pathway and the same pipeline code. Mismatched versions will cause a connection failure or undefined behavior.
Firewall and networking. Each machine must be able to reach all others on the specified ports. Pathway does not support NAT traversal or proxies between workers.
Conclusion
To run a Pathway pipeline across multiple machines:
- Choose one
host:portper process and ensure the ports are mutually reachable. - Start each process independently using
pathway spawn --addresses <list> --process-id <N>. - Use shared persistent storage to enable fast recovery after restarts.
- Do not mix
--addresseswith--processes— the process count is derived from the address list.
If you have any questions, feel free to reach out on Discord or open an issue on our GitHub.