Multi-Node Networking
godotz.ai fleets communicate over a Tailscale WireGuard mesh. Every node gets a stable 100.x.x.x address regardless of physical location — home lab, cloud VM, or co-lo. This guide covers Tailscale setup, node registration, ACL hardening, and cross-node orchestration.
1. Tailscale Installation
Install Tailscale on each node before adding it to the fleet.
# Linux (NixOS nodes get this via flake — skip manual install)
curl -fsSL https://tailscale.com/install.sh | sh
# Authenticate with a pre-auth key (non-interactive provisioning)
sudo tailscale up \
--authkey="${TAILSCALE_AUTHKEY}" \
--hostname="${NODE_NAME}" \
--accept-routes \
--ssh
On NixOS, declare it in the flake:
# modules/tailscale.nix
services.tailscale = {
enable = true;
authKeyFile = config.age.secrets.tailscale-authkey.path;
extraUpFlags = [ "--ssh" "--accept-routes" ];
};
Verify connectivity from the control plane:
tailscale status
# 100.64.0.1 control-01 linux -
# 100.64.0.2 worker-01 linux active; direct
# 100.64.0.3 worker-02 linux active; relay
2. Node Registration
Once Tailscale is up, register each node in the godotz.ai fleet manifest. This tells the control plane what roles each node can fulfill and what resources it has.
# fleet/nodes.yml
nodes:
- name: control-01
tailscale_ip: 100.64.0.1
roles: [control-plane, litellm-proxy, temporal, langfuse]
resources:
cpu: 8
ram_gb: 16
- name: worker-01
tailscale_ip: 100.64.0.2
roles: [glm-worker, kg-writer]
resources:
cpu: 4
ram_gb: 8
- name: worker-02
tailscale_ip: 100.64.0.3
roles: [glm-worker, mnemopi-replica]
resources:
cpu: 4
ram_gb: 8
Apply the manifest:
bd node sync --manifest fleet/nodes.yml
# Registered: control-01 (3 roles)
# Registered: worker-01 (2 roles)
# Registered: worker-02 (2 roles)
3. Deny-by-Default ACL Policy
The default Tailscale ACL grants every node access to every other node. For godotz.ai production fleets, lock this down to least-privilege.
{
"acls": [
{
"comment": "Control plane can reach all workers",
"action": "accept",
"src": ["tag:control-plane"],
"dst": ["tag:worker:*"]
},
{
"comment": "Workers can reach LiteLLM proxy only",
"action": "accept",
"src": ["tag:worker"],
"dst": ["tag:control-plane:7233,5432,6379,4000"]
},
{
"comment": "Operators can SSH to all nodes",
"action": "accept",
"src": ["group:operators"],
"dst": ["*:22"]
}
],
"tagOwners": {
"tag:control-plane": ["group:operators"],
"tag:worker": ["group:operators"]
},
"defaultAllow": false
}
Apply via Tailscale admin console or tailscale acl set:
# Export current policy for audit
tailscale acl get > acl-backup-$(date +%F).json
# Apply new deny-by-default policy
tailscale acl set < fleet/acl-policy.json
4. Port Reference
| Service | Port | Who Can Access |
|---|---|---|
| LiteLLM Proxy | 4000 | tag:worker, tag:control-plane |
| Temporal | 7233 | tag:control-plane only |
| Postgres | 5432 | tag:control-plane only |
| Redis | 6379 | tag:control-plane only |
| Langfuse | 3000 | group:operators |
| beads API | 8080 | tag:control-plane, tag:worker |
| ntfy (local) | 8090 | tag:control-plane only |
5. Cross-Node Orchestration
Supervisors (running on the control plane) dispatch tasks to worker nodes through the NATS message bus. Workers pull tasks using their registered roles as subscriptions.
# On worker node — start the OMP agent runtime
omp agent start \
--node-name worker-01 \
--control-plane 100.64.0.1:4000 \
--roles glm-worker,kg-writer \
--nats-url nats://100.64.0.1:4222
Supervisor dispatch example (LangGraph):
from langgraph.prebuilt import create_react_agent
supervisor = create_react_agent(
model="claude-opus-4-6", # Antigravity orchestrator
tools=[dispatch_to_worker], # Sends task to NATS subject
)
# Task gets routed to worker-01 or worker-02 based on role
result = supervisor.invoke({
"task": "summarize documents in batch",
"required_role": "glm-worker",
"preferred_model": "glm-4.5-air"
})
6. Failure Handling
Tailscale nodes that go offline are still reachable via relay (DERP) for up to 30 minutes. Configure godotz.ai to detect and reroute:
# config/fleet.yml
fleet:
node_timeout_seconds: 30
retry_on_node_failure: true
failover_strategy: round_robin # or: least_loaded, random
alert_on_node_down: true # fires ntfy urgent alert
Check current node health:
bd fleet health --watch
# Refreshes every 5s, highlights degraded nodes in red
Next Steps
- Fleet Setup — Control plane bootstrap
- Model Routing — Route tasks to the right model
- Security Gates — ACL and secret hardening