Monitoring & health

A sign is healthy when the kiosk is connected, the assigned URL is reachable, and the device telemetry is within normal ranges. The dashboard surfaces all three on every sign detail page; this page is the reference for what each indicator means.

Heartbeats

Every claimed sign sends a heartbeat to the backend over the WebSocket every 5 seconds. Each heartbeat carries:

status — online / offline / error / maintenance
currentUrl — what the kiosk is currently displaying
deviceInfo — platform, OS, MAC, IP, Tailscale IP, hostname, screen resolution, free disk, RAM, CPU
appVersion — desktop sign version
uptime — seconds since the kiosk booted
cacheState — whether content is cached locally for offline operation

The backend writes the latest heartbeat into Redis (for fast reads) and a rolling window into PostgreSQL (for the uptime timeline).

Online / offline transitions

The dashboard transitions a sign between Online and Offline based on heartbeat freshness:

State	Trigger
Online	Heartbeat received within the last 15 seconds
Offline	No heartbeat for 15+ seconds (3 missed in a row)
Online (recovered)	Was offline, now received a heartbeat — flagged briefly as "recently reconnected"

Why 15 seconds? Heartbeats fire every 5 seconds, so 15 seconds is exactly 3 missed in a row — strong enough to filter transient packet loss, fast enough that you find out about a real disconnect within the venue's typical "is something wrong?" reaction time.

The backend stores each heartbeat in Redis with a 30 s TTL and runs a background sweep every 10 s to mark expired sign records offline. So a true offline transition can take up to ~25 s to surface — heartbeat goes silent at T=0, Redis entry expires at T+15 s, the sweep next runs by T+25 s. Notifications are deferred a further 60 seconds to give the sign a chance to reconnect — see Notifications for why.

Sign states

A sign is in exactly one state at a time. The dashboard renders each with a consistent color:

State	Color	Meaning
Online	Green	Connected, heartbeating, content displaying
Offline	Yellow	Heartbeat is stale (no signal for 30+ s) — likely a network or kiosk problem
Error	Red	Sign reported an explicit failure (e.g., couldn't load assigned URL)
Maintenance	Blue	Operator-controlled state. `Ctrl+Shift+Q` on the kiosk exits the sign app for maintenance (the watchdog won't relaunch while the `.maintenance` sentinel is present). The maintenance state on the dashboard is set by the dashboard itself, not by the kiosk's heartbeat. See Crash recovery.
Unlinked	Grey	Sign record exists but no physical device is linked yet

The color is consistent across the dashboard sign grid, the sign detail page, the mobile app, and notification badges.

State transitions are written to the Audit tab so you can answer "when did this sign go offline?" without grepping logs.

The "Monitoring" badge

A sign in monitoring mode shows a Monitoring badge on its dashboard card alongside the state color. The orthogonal mode field on the heartbeat carries 'monitoring' or 'active' — see Sign states → Orthogonal mode field. When the badge is showing:

The sign is healthy (heartbeat is current; sign is online)
The wall is intentionally dark — display hidden, audio muted
This is not a failure to escalate; the operator put the sign in this mode

Toggle off via the dashboard's Exit Monitoring button or Ctrl+Shift+M at the kiosk. See Remote control → Monitoring mode and Hotkeys.

Uptime tracking

The dashboard computes uptime two ways:

Per-sign uptime % for the lifetime of the event: online time / event time
Per-event uptime %: average across all signs in the event

You'll see both on the event detail page. A few patterns worth recognizing:

>99% is normal for a properly-deployed event
95-99% typically reflects venue Wi-Fi flapping rather than kiosk failure — the wall is up, the dashboard just sees the connection drop briefly
Below 95% suggests genuine trouble — either a network problem you can fix or a sign in a flaky state

Uptime resets at the start of an event, so historical events keep their stats and a new event starts fresh.

Device info

Every heartbeat carries device telemetry. The dashboard surfaces it on the Device info card:

Field	Source
Platform / OS version	`os.platform()` + `os.release()`
Hostname	`os.hostname()` — useful when you set custom names like `LOBBY-SIGN-01`
MAC	First non-internal NIC at first boot (stored, doesn't change)
Local IP	Current primary interface IP
Tailscale IP	`tailscale ip -4` if installed, blank otherwise
Screen resolution	Per the primary display
CPU / RAM / Free disk	Snapshot at heartbeat time
Sign app version	Build version of the desktop sign
Uptime	Seconds since the sign app launched (reset by Reboot app or Reboot device)

Telemetry is for triage, not surveillance — use it to answer "is this sign stuck?" or "did somebody reboot the device an hour ago?" not for performance dashboards.

Content reachability

Independent of the kiosk's connection to us, the kiosk monitors whether the assigned URL is reachable by doing an HTTP HEAD every 60 seconds. The dashboard reports this on a per-sign and per-event basis:

Reachable — last HEAD succeeded (2xx or 3xx)
Unreachable — last HEAD failed (timeout, 4xx, 5xx, DNS failure)

Reachability state is independent of the sign's online/offline state:

A sign can be Online but with Unreachable content — the kiosk reaches us, but its content origin is down
A sign can be Offline with Reachable content (last known) — the kiosk lost its WebSocket but the content URL was working at last check

When content goes Unreachable, the kiosk continues displaying the cached version and notifies subscribers (see Notifications). The wall doesn't blank — you have time to fix the content side without an audience seeing the failure.

Local diagnostics on the kiosk

Sometimes you want to look at health from the sign's side rather than the dashboard's. With keyboard access to the kiosk, press Ctrl + Shift + S to open the Status Dashboard overlay on the kiosk itself:

Connection status (WebSocket state, last heartbeat sent, last command received)
Sign ID, short code, MAC
Backend URL, WebSocket URL
IP addresses (LAN + Tailscale)
Cache status (items cached, size, last sync)
Recent error count

Press Esc to dismiss. This overlay is also what techs press when triaging a misbehaving sign in person — answers "is this device even reaching the backend?" without leaving the venue.

When to escalate

A few patterns and what to do about them:

Pattern	What it means	Action
One sign offline for >2 minutes	Kiosk-specific — the others are fine	Troubleshoot offline
Multiple signs offline at once	Network or backend issue	Check venue Wi-Fi first. Multiple signs across multiple venues going offline at the same time usually means a backend incident — we'll email an alert if so
All signs online, content unreachable	Your content URL is down	Fix content side, or assign a fallback URL
Sign cycling online/offline rapidly	Wi-Fi flapping or kiosk DNS issues	Network resilience
Sign in Error state	Content failed to load	Fetch logs, look for the failed URL or HTTP error

Reach for Troubleshooting for symptom-by-symptom playbooks.

What's next

Notifications — push-based alerts for the state transitions covered here
Remote control — the commands you'll typically pair with monitoring (fetch logs first, then refresh or reboot)
Reference → Sign states — exhaustive state transition table