How to run 40,000 HPC jobs on Gizmo overnight
From Dirk Petersen, via the scicomp-announce mailing list:
11:30 am—1:30 pm Monday, March 28th
During this brownbag we will go through an example we recently worked on with a research group at Fred Hutch. The task was to break up an R script into 40,000 small jobs and then distribute these jobs across the Gizmo cluster using more than 1,000 cores.
This tutorial will cover the following topics:
- How to use the ‘restart’ queue/partition
- How to use a scratch file system effectively
- How to have the system automatically re-run erroneous jobs
- How to build an analysis pipeline where multiple jobs depend on each other
- Demo of the centipede (sce) shell tool for simplified access to the Gizmo cluster
Using these techniques you will find it easier to run jobs in the large capacity restart queue on Gizmo. Using them will also prepare you for running jobs in the low cost Amazon Web Services spot market in the future.
Audience: Researchers who are familiar with writing code in R or Python and who have used Gizmo before.
Register via Eventbrite here. If the class is full, there will be an option to add yourself to the waiting list — please do so if you’re still interested!
You can try the code and tutorial now. The presentation is also available.
To clone the repository:
git clone https://github.com/FredHutch/slurm-examples