fix(web): make deploys safe for the native better-sqlite3 dependency

Restore the dependency-change guard that got overwritten on main, and harden
the deploy + worker shutdown so a flaky better-sqlite3 rebuild can no longer
take the site down.

Root cause of the recurring outages: tssbot-web is the only stack with a
native module (better-sqlite3) that must be downloaded/compiled on every
`npm ci`. The deploy ran `npm ci` unconditionally (the skip guard had been
reverted), with no timeout, and `npm ci` deletes node_modules first -- so a
single hung/failed native rebuild left the site unstartable, and a PM2
cluster restart on top wedged the daemon.

webhook.cjs:
- Restore the npm-ci skip guard: only reinstall when package.json /
  package-lock.json actually changed (previousHead captured before the pull),
  so code-only pushes never rebuild better-sqlite3. Defaults to installing on
  any uncertainty, and still installs if node_modules is incomplete.
- Add per-step timeouts to run() (DEPLOY_STEP_TIMEOUT_MS, and a tighter
  DEPLOY_INSTALL_TIMEOUT_MS for npm ci) so a stalled step is killed instead of
  hanging for hours with node_modules already deleted.
- Gate the deploy on better-sqlite3 actually loading (child-process load, not
  just require.resolve): force a reinstall when its native binary is missing,
  and abort before pm2 reload if it is still broken after install.

server.cjs:
- On shutdown, closeIdleConnections() + delayed closeAllConnections() so a
  worker stop/reload can't hang the full kill_timeout on idle keep-alives or a
  stuck upstream request.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
2026-07-01 22:23:57 +00:00
parent 2c34e9ad4a
commit 1fee214785
2 changed files with 105 additions and 5 deletions
+9
View File
@@ -3508,6 +3508,15 @@ function shutdown() {
process.exit(0)
})
// server.close() only fires its callback once every socket is gone, and idle
// HTTP keep-alive sockets (held open by nginx/Cloudflare) never close on
// their own — so without this the worker hangs the full kill_timeout on every
// stop/reload, which is what wedges the PM2 cluster daemon. Close idle sockets
// immediately, let in-flight requests finish for a short grace period, then
// force the rest so shutdown completes well inside kill_timeout.
server.closeIdleConnections()
setTimeout(() => server.closeAllConnections(), 3000).unref()
setTimeout(() => {
console.error('Graceful shutdown timed out')
process.exit(1)