Restore the dependency-change guard that got overwritten on main, and harden
the deploy + worker shutdown so a flaky better-sqlite3 rebuild can no longer
take the site down.
Root cause of the recurring outages: tssbot-web is the only stack with a
native module (better-sqlite3) that must be downloaded/compiled on every
`npm ci`. The deploy ran `npm ci` unconditionally (the skip guard had been
reverted), with no timeout, and `npm ci` deletes node_modules first -- so a
single hung/failed native rebuild left the site unstartable, and a PM2
cluster restart on top wedged the daemon.
webhook.cjs:
- Restore the npm-ci skip guard: only reinstall when package.json /
package-lock.json actually changed (previousHead captured before the pull),
so code-only pushes never rebuild better-sqlite3. Defaults to installing on
any uncertainty, and still installs if node_modules is incomplete.
- Add per-step timeouts to run() (DEPLOY_STEP_TIMEOUT_MS, and a tighter
DEPLOY_INSTALL_TIMEOUT_MS for npm ci) so a stalled step is killed instead of
hanging for hours with node_modules already deleted.
- Gate the deploy on better-sqlite3 actually loading (child-process load, not
just require.resolve): force a reinstall when its native binary is missing,
and abort before pm2 reload if it is still broken after install.
server.cjs:
- On shutdown, closeIdleConnections() + delayed closeAllConnections() so a
worker stop/reload can't hang the full kill_timeout on idle keep-alives or a
stuck upstream request.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>