Multi-Node Networking

godotz.ai fleets communicate over a Tailscale WireGuard mesh. Every node gets a stable 100.x.x.x address regardless of physical location — home lab, cloud VM, or co-lo. This guide covers Tailscale setup, node registration, ACL hardening, and cross-node orchestration.

1. Tailscale Installation

Install Tailscale on each node before adding it to the fleet.

# Linux (NixOS nodes get this via flake — skip manual install)
curl -fsSL https://tailscale.com/install.sh | sh

# Authenticate with a pre-auth key (non-interactive provisioning)
sudo tailscale up \
  --authkey="${TAILSCALE_AUTHKEY}" \
  --hostname="${NODE_NAME}" \
  --accept-routes \
  --ssh

On NixOS, declare it in the flake:

# modules/tailscale.nix
services.tailscale = {
  enable = true;
  authKeyFile = config.age.secrets.tailscale-authkey.path;
  extraUpFlags = [ "--ssh" "--accept-routes" ];
};

Verify connectivity from the control plane:

tailscale status
# 100.64.0.1   control-01    linux  -
# 100.64.0.2   worker-01     linux  active; direct
# 100.64.0.3   worker-02     linux  active; relay

2. Node Registration

Once Tailscale is up, register each node in the godotz.ai fleet manifest. This tells the control plane what roles each node can fulfill and what resources it has.

# fleet/nodes.yml
nodes:
  - name: control-01
    tailscale_ip: 100.64.0.1
    roles: [control-plane, litellm-proxy, temporal, langfuse]
    resources:
      cpu: 8
      ram_gb: 16

  - name: worker-01
    tailscale_ip: 100.64.0.2
    roles: [glm-worker, kg-writer]
    resources:
      cpu: 4
      ram_gb: 8

  - name: worker-02
    tailscale_ip: 100.64.0.3
    roles: [glm-worker, mnemopi-replica]
    resources:
      cpu: 4
      ram_gb: 8

Apply the manifest:

bd node sync --manifest fleet/nodes.yml
# Registered: control-01 (3 roles)
# Registered: worker-01 (2 roles)
# Registered: worker-02 (2 roles)

3. Deny-by-Default ACL Policy

The default Tailscale ACL grants every node access to every other node. For godotz.ai production fleets, lock this down to least-privilege.

{
  "acls": [
    {
      "comment": "Control plane can reach all workers",
      "action": "accept",
      "src": ["tag:control-plane"],
      "dst": ["tag:worker:*"]
    },
    {
      "comment": "Workers can reach LiteLLM proxy only",
      "action": "accept",
      "src": ["tag:worker"],
      "dst": ["tag:control-plane:7233,5432,6379,4000"]
    },
    {
      "comment": "Operators can SSH to all nodes",
      "action": "accept",
      "src": ["group:operators"],
      "dst": ["*:22"]
    }
  ],
  "tagOwners": {
    "tag:control-plane": ["group:operators"],
    "tag:worker": ["group:operators"]
  },
  "defaultAllow": false
}

Apply via Tailscale admin console or tailscale acl set:

# Export current policy for audit
tailscale acl get > acl-backup-$(date +%F).json

# Apply new deny-by-default policy
tailscale acl set < fleet/acl-policy.json

4. Port Reference

Service	Port	Who Can Access
LiteLLM Proxy	4000	`tag:worker`, `tag:control-plane`
Temporal	7233	`tag:control-plane` only
Postgres	5432	`tag:control-plane` only
Redis	6379	`tag:control-plane` only
Langfuse	3000	`group:operators`
beads API	8080	`tag:control-plane`, `tag:worker`
ntfy (local)	8090	`tag:control-plane` only

5. Cross-Node Orchestration

Supervisors (running on the control plane) dispatch tasks to worker nodes through the NATS message bus. Workers pull tasks using their registered roles as subscriptions.

# On worker node — start the OMP agent runtime
omp agent start \
  --node-name worker-01 \
  --control-plane 100.64.0.1:4000 \
  --roles glm-worker,kg-writer \
  --nats-url nats://100.64.0.1:4222

Supervisor dispatch example (LangGraph):

from langgraph.prebuilt import create_react_agent

supervisor = create_react_agent(
    model="claude-opus-4-6",     # Antigravity orchestrator
    tools=[dispatch_to_worker],   # Sends task to NATS subject
)

# Task gets routed to worker-01 or worker-02 based on role
result = supervisor.invoke({
    "task": "summarize documents in batch",
    "required_role": "glm-worker",
    "preferred_model": "glm-4.5-air"
})

6. Failure Handling

Tailscale nodes that go offline are still reachable via relay (DERP) for up to 30 minutes. Configure godotz.ai to detect and reroute:

# config/fleet.yml
fleet:
  node_timeout_seconds: 30
  retry_on_node_failure: true
  failover_strategy: round_robin  # or: least_loaded, random
  alert_on_node_down: true        # fires ntfy urgent alert

Check current node health:

bd fleet health --watch
# Refreshes every 5s, highlights degraded nodes in red

Next Steps

Fleet Setup — Control plane bootstrap
Model Routing — Route tasks to the right model
Security Gates — ACL and secret hardening