Session Monitoring Runbook

Procedure for monitoring an active streaming session in real-time. Used during manual testing, incident investigation, or when validating infrastructure changes. Designed to be run by a human or handed to a Claude Code agent.

When to Use

Manual mic/audio testing on a body
Validating a new deployment (hydrabody, hydravoice, hydraneckwebrtc)
Investigating stream quality issues or session drops
Post-change verification of streaming pipeline

Setup

NECKWEBRTC_TOKEN="f467a5338adf17903131e525af422a6b7df31c3a2a864f3b13026a14974dfc35"
BODYSTATUS_TOKEN="b191130cb189b0b74337532e766220d9cda223a38bc7282a67f976167b9dfaeb"
HYDRACLUSTER_TOKEN="c21ff820b95c59c5301d797b58fa262240a127c81b45262f314022647623b76d"
RELAY="https://hydraneckwebrtc.experiencenet.com"
HYDRACLUSTER_BIN="/home/claude-user/hydracluster/bin/hydracluster"
HYDRACLUSTER_SERVER="https://hydracluster.experiencenet.com"
STREAMING_MONITOR="https://hydrastreamingmonitor.experiencenet.com"

Set these per session:

SESSION_ID="<from session creation>"
NODE_ID="<from hydracluster>"
BODY_NAME="<from hydracluster>"

Monitoring Loop

Poll every 10 seconds. Report body telemetry, relay session status, mic activity, ffplay stability, and anomalies.

LAST_FFPLAY_PID=""

for i in $(seq 1 60); do
    sleep 10

    # Body telemetry
    BODY=$(curl -sf -H "Authorization: Bearer $BODYSTATUS_TOKEN" \
      https://hydrabodystatus.experiencenet.com/api/v1/bodies | \
      jq -c ".[] | select(.name==\"$BODY_NAME\") | {gpu: .gpu_utilization_pct, vram: .gpu_memory_used_mb, streams: .stream_count}")

    # Relay session status
    SESS=$(curl -sf -H "Authorization: Bearer $NECKWEBRTC_TOKEN" \
      "$RELAY/api/v1/sessions" | \
      jq -c "[.[].sessions[]? | {id: .id[0:8], status}]")

    # Anomalies
    ANOM=$(curl -sf "$STREAMING_MONITOR/api/v1/body-anomalies" | \
      jq "[.[] | select(.body_name==\"$BODY_NAME\")] | length")

    echo "[$(date +%H:%M:%S)] $BODY relay=$SESS anomalies=$ANOM"

    # ffplay stability check (every 30s)
    if [ $((i % 3)) -eq 0 ]; then
        FFPLAY_PID=$($HYDRACLUSTER_BIN exec "$NODE_ID" \
          'Get-Process ffplay -ErrorAction SilentlyContinue | Select-Object -ExpandProperty Id' \
          --server "$HYDRACLUSTER_SERVER" --admin-token "$HYDRACLUSTER_TOKEN" \
          --timeout 10s --json 2>/dev/null | jq -r '.stdout' | tr -d '\r\n ')
        if [ -n "$LAST_FFPLAY_PID" ] && [ "$FFPLAY_PID" != "$LAST_FFPLAY_PID" ]; then
            echo "  *** ALERT: ffplay PID changed $LAST_FFPLAY_PID -> $FFPLAY_PID (crash/restart) ***"
        fi
        LAST_FFPLAY_PID="$FFPLAY_PID"
    fi

    # Mic log check (every 30s)
    if [ $((i % 3)) -eq 0 ]; then
        MIC=$(ssh root@46.225.220.240 \
          "journalctl -u hydraneckwebrtc.service --since '30 sec ago' --no-pager 2>/dev/null" 2>/dev/null | \
          grep -i "mic" | tail -1 | sed 's/.*hydraneckwebrtc\[.*\]: //')
        [ -n "$MIC" ] && echo "  mic: $MIC"
    fi

    # Alert conditions
    GPU=$(echo "$BODY" | jq -r '.gpu // 0')
    STREAMS=$(echo "$BODY" | jq -r '.streams // 0')
    MISMATCH=$(echo "$BODY" | jq -r '.gpu_mismatch_sec // 0')
    if [ "$MISMATCH" -gt 0 ]; then
        echo "  *** ALERT: GPU mismatch active for ${MISMATCH}s — hydrabody will kill at 180s ***"
    elif [ "$GPU" -gt 30 ] && [ "$STREAMS" -eq 0 ] && [ "$i" -gt 6 ]; then
        echo "  *** ALERT: GPU active with 0 streams — possible orphan ***"
    fi
    if [ "$ANOM" -gt 0 ]; then
        echo "  *** ALERT: body anomalies detected ***"
    fi
    if ! echo "$SESS" | grep -q "streaming"; then
        echo "  *** ALERT: session no longer streaming ***"
    fi
done

Post-Session Teardown Monitoring

Run after the user ends their session (or after deleting via API).

# Delete session if still active
curl -sf -X DELETE -H "Authorization: Bearer $NECKWEBRTC_TOKEN" \
    "$RELAY/api/v1/sessions/$SESSION_ID"
echo "Session deleted"

# Monitor for 60s
echo "Monitoring for orphans and respawns (60s)..."
for check in 15 30 45 60; do
    sleep 15
    BODY=$(curl -sf -H "Authorization: Bearer $BODYSTATUS_TOKEN" \
      https://hydrabodystatus.experiencenet.com/api/v1/bodies | \
      jq -c ".[] | select(.name==\"$BODY_NAME\") | {gpu: .gpu_utilization_pct, streams: .stream_count}")
    ANOM=$(curl -sf "$STREAMING_MONITOR/api/v1/body-anomalies" | \
      jq "[.[] | select(.body_name==\"$BODY_NAME\" and .type==\"orphan_stream\")] | length")
    echo "[${check}s] $BODY orphan_anomalies=$ANOM"
done

# Final state
GPU=$(curl -sf -H "Authorization: Bearer $BODYSTATUS_TOKEN" \
  https://hydrabodystatus.experiencenet.com/api/v1/bodies | \
  jq ".[] | select(.name==\"$BODY_NAME\") | .gpu_utilization_pct")
[ "$GPU" -lt 10 ] && echo "PASS: GPU idle" || echo "FAIL: GPU at ${GPU}%"

Alert Conditions

| Condition | Meaning | Action | |-----------|---------|--------| | ffplay PID changed | ffplay crashed and hydravoice restarted it | Check audio device access, VB-Cable status | | gpu_mismatch_sec > 0 | GPU active with no tracked session — hydrabody watchdog counting down | Resolves automatically at 180s; kill sooner via hydracluster exec if needed | | GPU > 30% with 0 streams (no gpu_mismatch_sec) | Orphan on older hydrabody (<v1.11.42) | Kill via hydracluster exec, upgrade hydrabody | | Body anomalies > 0 | Streaming monitor detected relaunch or orphan | Check recent_cleanups in hydrabodystatus | | Session not streaming | Relay lost the session | Check worker health, moonlight-web-stream process | | Mic connection closed | WebRTC mic dropped | User may need to refresh browser; check relay logs |

Stream Page (provider_status errors)

The /stream page body cards surface provider health at a glance. When a body has a non-empty provider_status that is not "running" (e.g. sunshine_api_unreachable), the card renders:

An amber provider error badge replacing the normal idle badge
An amber warning message showing the raw status string (e.g. ⚠ Provider unreachable — sunshine_api_unreachable)
An amber left border on the card instead of the default grey

This tells operators the body cannot serve streams before they attempt to assign one. The card still shows the log expander for investigation.

If the degraded body is also streaming (unusual, but possible after a partial failure), the streaming badge is preserved and the provider error badge appears alongside it.

| Alert | Meaning | Action | |-------|---------|--------| | provider error badge on idle card | Sunshine API not reachable from hydrabody | Check Sunshine service on body; use hydracluster exec to restart if needed | | provider error badge on streaming card | Body is streaming but Sunshine API is unresponsive | Session may be in a degraded state; monitor closely and consider stopping the stream |

Sessions Page

The /sessions table shows active and history sessions with the following columns:

Body — body name, with an amber Sunshine unreachable badge when provider_status: sunshine_api_unreachable
Head — resolved head node name (e.g. ipad-head-2); falls back to raw node ID if name not available
Experience, Started At, Duration, Body HB, Head HB, Client, Status, Logs

The page auto-refreshes every 5 s when at least one active session exists, and every 30 s when idle. The current rate is shown in the top-right corner.

Each row has a logs link that opens /sessions/{id}/logs, which fetches the session log payload from hydracluster (GET /api/v1/sessions/{id}/logs) and renders body and head log lines side-by-side. Use this when investigating a specific session without needing to exec onto the body.

Orphaned Stream Recovery

An orphaned body is one that reports stream_status=streaming but has no head connected. It blocks the district — new heads cannot pair with it until the stream is cleared.

In most cases no manual action is needed. hydrabody self-heals in two stages:

Grace period (~1 min): session ended cleanly but stream state not yet cleared.
GPU mismatch watchdog (~3 min): GPU still active with no Moonlight session — hydrabody kills the orphan at 180 s.

The /bodies page shows these in a "Self-Healing" panel at the top (blue, informational) with the expected auto-resolve time. Each body row also has an always-visible Force Stop button in the Actions column — use it only if the body is still stuck beyond the expected window, or follow the API steps below for reliable feedback during events.

Symptoms

A head (iPad, kiosk) is stuck on the experience selection screen and never starts streaming, even though a body is online in the district
The streaming monitor /bodies page shows a "Self-Healing" panel with the body name
The body row in the table shows an orange orphan chip (grace period) or yellow gpu orphan chip (GPU watchdog active) with an estimated auto-clear time; a Force Stop button is visible directly in the Actions column without expanding the row
Tap/click the row to expand it for more detail: the Active Session, GPU Watchdog, and Recent Cleanups blocks give full context inline
The district has no other available bodies (all others are offline or also orphaned)

Identify

Note: $HYDRACLUSTER_TOKEN is set in the Setup section at the top of this runbook.

curl -sf -H "Authorization: Bearer $HYDRACLUSTER_TOKEN" \
  "https://hydracluster.experiencenet.com/api/v1/nodes" | \
  jq '[.[] | select(.stream_status=="streaming") | {name, id, district}]'

Cross-reference with the streaming monitor /bodies page — the "Self-Healing" panel shows only bodies where stream_status=streaming and no head is actively connected. Each body row has a Force Stop button visible without expansion; tap the row to expand it for session detail, GPU watchdog countdown, and recent cleanup history.

Fix

NODE_ID="<node-id from above>"
curl -s -X DELETE \
  -H "Authorization: Bearer $HYDRACLUSTER_TOKEN" \
  "https://hydracluster.experiencenet.com/api/v1/nodes/$NODE_ID/stream"

Expected: {"status":"ok","exec":"exec succeeded","output":"{\"status\":\"ok\"}"}

If exec says failed or timed out: the DELETE cleared the cluster's in-memory state, but the signal did not reach hydrabody. The body's next heartbeat (within 30s) will overwrite the state back to streaming — try the DELETE again. To verify exec channel health:

$HYDRACLUSTER_BIN exec "$NODE_ID" "echo ok" \
  --server "$HYDRACLUSTER_SERVER" --admin-token "$HYDRACLUSTER_TOKEN"

Verify

Wait 30 seconds (one hydrabody heartbeat cycle), then:

curl -sf -H "Authorization: Bearer $HYDRACLUSTER_TOKEN" \
  "https://hydracluster.experiencenet.com/api/v1/nodes/$NODE_ID" | \
  jq '{stream_status}'

Expected: stream_status: "idle". The /bodies page also auto-refreshes every 30s.

Still streaming after 60 seconds? The exec channel may be down. The session watchdog will self-heal automatically once the body's next heartbeat lands (up to 60s after hydracluster detects silence). If the body is unreachable on the exec channel for more than 5 minutes, escalate to check network/WireGuard between the body and hydracluster.

Notes

The Force Stop button (visible per row in the Actions column, and in the "Self-Healing" panel) does the same DELETE call. Prefer the direct API call during events — it shows the exec result immediately.
The Force Stop button redirect is immediate — the page will still show orphaned until the next 30s auto-refresh even if the fix worked.
Clicking Force Stop is safe to retry: if exec fails, the body's own heartbeat keeps the state consistent.
Use the Create Issue button (per row in the Actions column, or File Issue in the "Self-Healing" panel) to log the incident on issues.experiencenet.com.

Agent Handoff

To hand this to a Claude Code agent, provide:

Monitor the active streaming session on [BODY_NAME]. Session ID is [SESSION_ID]. Follow the session monitoring runbook at hydrastreamingmonitor/docs/runbooks/session-monitoring.md. Poll for [duration]. Alert on ffplay PID changes, orphans, mic drops, and session status changes. When the user says they're done, run the post-session teardown monitoring.