Efficient Resource Utilization on FASSE#
As members of our research group, we share the responsibility to ensure that our computational resources on the Slurm cluster are used efficiently. To promote fair and effective use, please take a moment to review the following guidelines on resource requests and usage.
Monitoring the Jobs Efficiency#
It is the responsibility of each lab member to ensure their computations are efficient. Your fair share is calculated based on the resources you request (excluding time), not on the resources actually used. Please consider the following scenario:
A request is made for 250 GB of memory and 24 CPU cores for a duration of 10 hours.
The application consumes 16 GB of memory and 1 CPU core over 1.5 hours.
Your lab will be charged for:
250 GB of Memory, 24 CPU cores for 1.5 hours.
As you can observe, you won’t be charged for 10 hours; however, your job is likely to receive lower priority due to its extended duration. In this scenario, requesting resources for 2 hours could result in quicker access compared to a 10-hour request.
Each partition node has different unit price. Please read the Trackable RESources(TRES) section of the Fairshare and Job Accounting.
A direct method to assess the efficiency of a job is to utilize the seff
command. When the job is completed, run:
seff <job id>
This measures your job’s efficiency based on %CPU and %Memory usage compared to the resources requested. Low values indicate an overestimation of needed resources, leading to inefficient resource utilization. This not only depletes resources unnecessarily but also increases wait times for upcoming lab jobs.
Lab moderators can use the sreport
command to see the usage of the resources by the team in the specific period of time, for example:
sreport cluster AccountUtilizationByUser account=dominici_lab Start=2024-03-21 End=2024-03-28
Best Practices in Using FASSE#
FASSE is for L3 data
Utilize FASSE exclusively for handling sensitive and L3 data. For other computations, such as those involving simulated data, please transition to Cannon.
Understanding your needs
Ensure you fully understand the resource requirements of your job before submission.
Conduct small-scale tests or pilot runs to assess the CPU and memory requirements. For instance:
For bootstrap analysis, initially run 10 iterations instead of 100. Afterwards, utilize the
seff <job-id>
command to evaluate job efficiency.
Monitoring your jobs
To monitor resource usage in real-time for a running job on a computation node, use
ssh node_name
followed byhtop
.Integrate timing into your code for various segments and opt for logging to review code efficiency more effectively.
Begin with lower memory allocations if unsure of requirements. Jobs exceeding memory limits will be terminated by SLURM.
Profile your code’s CPU and memory usage by initiating an interactive session with salloc or through Open OnDemand, then use ssh login_node and htop for real-time performance monitoring.
More resources for R: Measuring Performance
More resources for Python: The Python Profilers
Array jobs
If running an array-type job (for example bootstrap or multiple models) consider limiting the maximum number of simultaneous jobs. E.g.
--array=1-100%20
will limit to 20 array jobs running at once.
Test partitions
Use the
test
partition for testing the code and profiling activities. It does not affect you fairshare.
Communication
If you anticipate a large or unusual resource request, consider discussing it with the group. This can help ensure that your needs are met without adversely impacting others.
Other resources
FASSE (and Cannon) documentation provides a wealth of knowledge on best practices and available resources. Familiarize yourself with it to ensure efficient utilization.
Fairshare documentation: https://docs.rc.fas.harvard.edu/kb/fairshare/
FASSE partitions: https://docs.rc.fas.harvard.edu/kb/fasse/#SLURM_and_Partitions