[FW] emails all at once
Foran, Will
FORAN at pitt.edu
Mon Sep 8 17:41:11 EDT 2025
Sorry for the giant spam of emails.
I just restarted flywheel and it caught up with all the scans.
This is separate to whatever UPMC/Pitt cross talk network hangup prevented messages from sending last.
________________________________
For posterity (find this message in the list archive<https://list.pitt.edu/pipermail/flywheelgearlist/>)
Postmortem
Zeus (one of the computers in the cold room next to prisma 1) had some unplanned down time (power issue?). It hosts the virtual machines that run flywheel. It all came back up Friday but a flywheel bug currently requires manual intervention (via /raidzeus/src/hpc-env/flywheel/reset-for-pending-jobs.bash<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FNPACore%2Fhpc-env%2Fblob%2Fmaster%2Fflywheel%2Freset-for-pending-jobs.bash&data=05%7C02%7Cflywheelgearlist%40list.pitt.edu%7Cc9b7707df1074bf5ffc808ddef206e1f%7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1%7C0%7C638929644736768392%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=Zt7CI3yKEc9BV1Kx9B8eSi8ShSi0Qtw8UMMEmTecFP4%3D&reserved=0>) to get jobs out of a "pending" state. I tweaked cron on zeus so this hopefully wont happen again.
uptime
Gyrus, zeus, and cortex were all part of the same downtime? But cerebro2 stayed up.
( for host in mrrc-{zeus,cerebro2,cortex}; do
ssh $host -- echo -e "${host/mrrc-/} \$(uptime -s)"
done
SSHPASS=$(pass gyrus2) sshpass -e ssh mrrc-gyrus2 -- echo -e "gyrus2 \$(uptime -s)" ) |
sort -k2 -t' '
host uptime_date uptime_time
cerebro2 2025-08-01 04:09:44
zeus 2025-09-05 04:14:02
gyrus2 2025-09-05 04:15:14
cortex 2025-09-05 08:11:57
automated fix
cron should run what's needed a few minutes after the systems come back online. 🤞 10 minutes is enough time to give the system before restarting FW.
ssh foranw at mrrc-zeus crontab -l |grep pending-jobs
@reboot sleep 600 && /raidzeus/src/hpc-env/flywheel/reset-for-pending-jobs.bash
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://list.pitt.edu/pipermail/flywheelgearlist/attachments/20250908/d9a56be6/attachment-0001.htm>
More information about the Flywheelgearlist
mailing list