Using a GPU-Enabled Server for Running Scripts
This tutorial provides step-by-step instructions for using a server with GPU resources, managed by SLURM, to run scripts requiring GPU acceleration.
Prerequisites
- Access to the Server: Ensure you have login credentials and access to the server.
- Installed Software: Check for the required software, such as Python, CUDA, and any dependencies for your script.
- Environment Setup: Familiarity with SLURM workload manager for job scheduling.
- SSH Client: Install an SSH client (e.g., PuTTY, OpenSSH) to access the server.
Step 1: Connect to the Server
Use SSH to log into the server:
1 | ssh your_username@server_address |
Replace
your_username
with your username and server_address
with the server’s hostname or IP.
Step 2: Load Required Modules
Many servers use a module system to load software environments. Check available modules:
1 | module avail |
Load the required modules for your script, such as Python or CUDA:
1 | module load python/3.8 |
Step 3: Prepare Your Script
Transfer your script and data files to the server. You can use scp
or an SFTP client:
1 | scp your_script.py your_username@server_address:/path/to/destination |
Navigate to the directory containing your script:
1 | cd /path/to/destination |
Step 4: Create a SLURM Batch Script
A SLURM batch script specifies the resources and commands required to run your job. Create a file, e.g., run_gpu_job.sh
:
1 |
|
Save the script and make it executable:
1 | chmod +x run_gpu_job.sh |
Step 5: Submit the Job
Submit the batch script to SLURM:
1 | sbatch run_gpu_job.sh |
SLURM will return a job ID. You can check the job’s status using:
1
squeue -u your_username
1 | squeue -u your_username |
Step 6: Monitor the Job
While the job is running, you can monitor its progress:
- View Logs:
1
tail -f output_<jobID>.log
- Check GPU Usage:
1
2squeue -u your_username
nvidia-smi - or you can use
watch
to keep your connection activeto keep your terminal active1
watch -n 1 squeue -u x3002a15
Step 7: Cancel a Job (if needed)
To cancel a job, use:
1 | scancel <jobID> |
Step 8: Retrieve Results
Once the job is complete, retrieve results from the server to your local machine:
1 | scp your_username@server_address:/path/to/results /local/destination |
Tips for Efficient Usage
- Check Available Resources:This displays available partitions and their status.
1
sinfo
- Test Locally First: Test your script locally before submitting it to SLURM to minimize debugging time.
- Use Virtual Environments: Maintain a Python virtual environment to manage dependencies.
- Use GPUs Wisely: Request only the necessary number of GPUs to avoid over-allocation.
By following this tutorial, you can efficiently use a GPU-enabled server with SLURM to run your computationally intensive scripts. Always consult your server’s documentation for partition-specific settings and policies.
Related Issues not found
Please contact @shuan4638 to initialize the comment