fix(web): make deploys safe for the native better-sqlite3 dependency

Restore the dependency-change guard that got overwritten on main, and harden the deploy + worker shutdown so a flaky better-sqlite3 rebuild can no longer take the site down. Root cause of the recurring outages: tssbot-web is the only stack with a native module (better-sqlite3) that must be downloaded/compiled on every `npm ci`. The deploy ran `npm ci` unconditionally (the skip guard had been reverted), with no timeout, and `npm ci` deletes node_modules first -- so a single hung/failed native rebuild left the site unstartable, and a PM2 cluster restart on top wedged the daemon. webhook.cjs: - Restore the npm-ci skip guard: only reinstall when package.json / package-lock.json actually changed (previousHead captured before the pull), so code-only pushes never rebuild better-sqlite3. Defaults to installing on any uncertainty, and still installs if node_modules is incomplete. - Add per-step timeouts to run() (DEPLOY_STEP_TIMEOUT_MS, and a tighter DEPLOY_INSTALL_TIMEOUT_MS for npm ci) so a stalled step is killed instead of hanging for hours with node_modules already deleted. - Gate the deploy on better-sqlite3 actually loading (child-process load, not just require.resolve): force a reinstall when its native binary is missing, and abort before pm2 reload if it is still broken after install. server.cjs: - On shutdown, closeIdleConnections() + delayed closeAllConnections() so a worker stop/reload can't hang the full kill_timeout on idle keep-alives or a stuck upstream request. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-07-01 22:23:57 +00:00
parent 2c34e9ad4a
commit 1fee214785
2 changed files with 105 additions and 5 deletions
@@ -3508,6 +3508,15 @@ function shutdown() {
    process.exit(0)
  })

+  // server.close() only fires its callback once every socket is gone, and idle
+  // HTTP keep-alive sockets (held open by nginx/Cloudflare) never close on
+  // their own — so without this the worker hangs the full kill_timeout on every
+  // stop/reload, which is what wedges the PM2 cluster daemon. Close idle sockets
+  // immediately, let in-flight requests finish for a short grace period, then
+  // force the rest so shutdown completes well inside kill_timeout.
+  server.closeIdleConnections()
+  setTimeout(() => server.closeAllConnections(), 3000).unref()
+
  setTimeout(() => {
    console.error('Graceful shutdown timed out')
    process.exit(1)