[FW] emails all at once

Foran, Will FORAN at pitt.edu
Mon Sep 8 17:41:11 EDT 2025


Sorry for the giant spam of emails.
I just restarted flywheel and it caught up with all the scans.

This is separate to whatever UPMC/Pitt cross talk network hangup prevented messages from sending last.

________________________________

For posterity (find this message in the list archive<https://list.pitt.edu/pipermail/flywheelgearlist/>)

Postmortem

Zeus (one of the computers in the cold room next to prisma 1) had some unplanned down time (power issue?). It hosts the virtual machines that run flywheel. It all came back up Friday but a flywheel bug currently requires manual intervention (via /raidzeus/src/hpc-env/flywheel/reset-for-pending-jobs.bash<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FNPACore%2Fhpc-env%2Fblob%2Fmaster%2Fflywheel%2Freset-for-pending-jobs.bash&data=05%7C02%7Cflywheelgearlist%40list.pitt.edu%7Cc9b7707df1074bf5ffc808ddef206e1f%7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1%7C0%7C638929644736768392%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=Zt7CI3yKEc9BV1Kx9B8eSi8ShSi0Qtw8UMMEmTecFP4%3D&reserved=0>) to get jobs out of a "pending" state. I tweaked cron on zeus so this hopefully wont happen again.

uptime

Gyrus, zeus, and cortex were all part of the same downtime? But cerebro2 stayed up.

( for host in mrrc-{zeus,cerebro2,cortex}; do
  ssh $host -- echo -e "${host/mrrc-/} \$(uptime -s)"
done
SSHPASS=$(pass gyrus2) sshpass -e ssh mrrc-gyrus2 -- echo -e "gyrus2 \$(uptime -s)" ) |
sort -k2 -t' '


host    uptime_date     uptime_time
cerebro2        2025-08-01      04:09:44
zeus    2025-09-05      04:14:02
gyrus2  2025-09-05      04:15:14
cortex  2025-09-05      08:11:57
automated fix

cron should run what's needed a few minutes after the systems come back online. 🤞 10 minutes is enough time to give the system before restarting FW.

ssh foranw at mrrc-zeus crontab -l |grep pending-jobs



@reboot sleep 600 && /raidzeus/src/hpc-env/flywheel/reset-for-pending-jobs.bash

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://list.pitt.edu/pipermail/flywheelgearlist/attachments/20250908/d9a56be6/attachment-0001.htm>


More information about the Flywheelgearlist mailing list