Multi-Node Networking

godotz.ai fleets communicate over a Tailscale WireGuard mesh. Every node gets a stable 100.x.x.x address regardless of physical location — home lab, cloud VM, or co-lo. This guide covers Tailscale setup, node registration, ACL hardening, and cross-node orchestration.


1. Tailscale Installation

Install Tailscale on each node before adding it to the fleet.

# Linux (NixOS nodes get this via flake — skip manual install)
curl -fsSL https://tailscale.com/install.sh | sh

# Authenticate with a pre-auth key (non-interactive provisioning)
sudo tailscale up \
  --authkey="${TAILSCALE_AUTHKEY}" \
  --hostname="${NODE_NAME}" \
  --accept-routes \
  --ssh

On NixOS, declare it in the flake:

# modules/tailscale.nix
services.tailscale = {
  enable = true;
  authKeyFile = config.age.secrets.tailscale-authkey.path;
  extraUpFlags = [ "--ssh" "--accept-routes" ];
};

Verify connectivity from the control plane:

tailscale status
# 100.64.0.1   control-01    linux  -
# 100.64.0.2   worker-01     linux  active; direct
# 100.64.0.3   worker-02     linux  active; relay

2. Node Registration

Once Tailscale is up, register each node in the godotz.ai fleet manifest. This tells the control plane what roles each node can fulfill and what resources it has.

# fleet/nodes.yml
nodes:
  - name: control-01
    tailscale_ip: 100.64.0.1
    roles: [control-plane, litellm-proxy, temporal, langfuse]
    resources:
      cpu: 8
      ram_gb: 16

  - name: worker-01
    tailscale_ip: 100.64.0.2
    roles: [glm-worker, kg-writer]
    resources:
      cpu: 4
      ram_gb: 8

  - name: worker-02
    tailscale_ip: 100.64.0.3
    roles: [glm-worker, mnemopi-replica]
    resources:
      cpu: 4
      ram_gb: 8

Apply the manifest:

bd node sync --manifest fleet/nodes.yml
# Registered: control-01 (3 roles)
# Registered: worker-01 (2 roles)
# Registered: worker-02 (2 roles)

3. Deny-by-Default ACL Policy

The default Tailscale ACL grants every node access to every other node. For godotz.ai production fleets, lock this down to least-privilege.

{
  "acls": [
    {
      "comment": "Control plane can reach all workers",
      "action": "accept",
      "src": ["tag:control-plane"],
      "dst": ["tag:worker:*"]
    },
    {
      "comment": "Workers can reach LiteLLM proxy only",
      "action": "accept",
      "src": ["tag:worker"],
      "dst": ["tag:control-plane:7233,5432,6379,4000"]
    },
    {
      "comment": "Operators can SSH to all nodes",
      "action": "accept",
      "src": ["group:operators"],
      "dst": ["*:22"]
    }
  ],
  "tagOwners": {
    "tag:control-plane": ["group:operators"],
    "tag:worker": ["group:operators"]
  },
  "defaultAllow": false
}

Apply via Tailscale admin console or tailscale acl set:

# Export current policy for audit
tailscale acl get > acl-backup-$(date +%F).json

# Apply new deny-by-default policy
tailscale acl set < fleet/acl-policy.json

4. Port Reference

ServicePortWho Can Access
LiteLLM Proxy4000tag:worker, tag:control-plane
Temporal7233tag:control-plane only
Postgres5432tag:control-plane only
Redis6379tag:control-plane only
Langfuse3000group:operators
beads API8080tag:control-plane, tag:worker
ntfy (local)8090tag:control-plane only

5. Cross-Node Orchestration

Supervisors (running on the control plane) dispatch tasks to worker nodes through the NATS message bus. Workers pull tasks using their registered roles as subscriptions.

# On worker node — start the OMP agent runtime
omp agent start \
  --node-name worker-01 \
  --control-plane 100.64.0.1:4000 \
  --roles glm-worker,kg-writer \
  --nats-url nats://100.64.0.1:4222

Supervisor dispatch example (LangGraph):

from langgraph.prebuilt import create_react_agent

supervisor = create_react_agent(
    model="claude-opus-4-6",     # Antigravity orchestrator
    tools=[dispatch_to_worker],   # Sends task to NATS subject
)

# Task gets routed to worker-01 or worker-02 based on role
result = supervisor.invoke({
    "task": "summarize documents in batch",
    "required_role": "glm-worker",
    "preferred_model": "glm-4.5-air"
})

6. Failure Handling

Tailscale nodes that go offline are still reachable via relay (DERP) for up to 30 minutes. Configure godotz.ai to detect and reroute:

# config/fleet.yml
fleet:
  node_timeout_seconds: 30
  retry_on_node_failure: true
  failover_strategy: round_robin  # or: least_loaded, random
  alert_on_node_down: true        # fires ntfy urgent alert

Check current node health:

bd fleet health --watch
# Refreshes every 5s, highlights degraded nodes in red

Next Steps