This tutorial provides step-by-step instructions for using a server with GPU resources, managed by SLURM, to run scripts requiring GPU acceleration.
Prerequisites
- Access to the Server: Ensure you have login credentials and access to the server.
- Installed Software: Check for the required software, such as Python, CUDA, and any dependencies for your script.
- Environment Setup: Familiarity with SLURM workload manager for job scheduling.
- SSH Client: Install an SSH client (e.g., PuTTY, OpenSSH) to access the server.
Step 1: Connect to the Server
Use SSH to log into the server:1
ssh your_username@server_address
Replace your_username
with your username and server_address
with the server’s hostname or IP.
Step 2: Load Required Modules
Many servers use a module system to load software environments. Check available modules:1
module avail
Load the required modules for your script, such as Python or CUDA:1
2module load python/3.8
module load cuda/11.7
Step 3: Prepare Your Script
Transfer your script and data files to the server. You can use scp
or an SFTP client:1
scp your_script.py your_username@server_address:/path/to/destination
Navigate to the directory containing your script:1
cd /path/to/destination
Step 4: Create a SLURM Batch Script
A SLURM batch script specifies the resources and commands required to run your job. Create a file, e.g., run_gpu_job.sh
:
1 |
|
Save the script and make it executable:1
chmod +x run_gpu_job.sh
Step 5: Submit the Job
Submit the batch script to SLURM:1
sbatch run_gpu_job.sh
SLURM will return a job ID. You can check the job’s status using:
1
squeue -u your_username
1 | squeue -u your_username |
Step 6: Monitor the Job
While the job is running, you can monitor its progress:
- View Logs:
1
tail -f output_<jobID>.log
- Check GPU Usage:
1
2squeue -u your_username
nvidia-smi - or you can use
watch
to keep your connection activeto keep your terminal active1
watch -n 1 squeue -u x3002a15
Step 7: Cancel a Job (if needed)
To cancel a job, use:1
scancel <jobID>
Step 8: Retrieve Results
Once the job is complete, retrieve results from the server to your local machine:1
scp your_username@server_address:/path/to/results /local/destination
Tips for Efficient Usage
- Check Available Resources:This displays available partitions and their status.
1
sinfo
- Test Locally First: Test your script locally before submitting it to SLURM to minimize debugging time.
- Use Virtual Environments: Maintain a Python virtual environment to manage dependencies.
- Use GPUs Wisely: Request only the necessary number of GPUs to avoid over-allocation.
By following this tutorial, you can efficiently use a GPU-enabled server with SLURM to run your computationally intensive scripts. Always consult your server’s documentation for partition-specific settings and policies.