Shuan Chen

PhD Student in KAIST CBE

0%

running-scripts

Using a GPU-Enabled Server for Running Scripts

This tutorial provides step-by-step instructions for using a server with GPU resources, managed by SLURM, to run scripts requiring GPU acceleration.


Prerequisites

  1. Access to the Server: Ensure you have login credentials and access to the server.
  2. Installed Software: Check for the required software, such as Python, CUDA, and any dependencies for your script.
  3. Environment Setup: Familiarity with SLURM workload manager for job scheduling.
  4. SSH Client: Install an SSH client (e.g., PuTTY, OpenSSH) to access the server.

Step 1: Connect to the Server

Use SSH to log into the server:

1
ssh your_username@server_address

Replace your_username with your username and server_address with the server’s hostname or IP.


Step 2: Load Required Modules

Many servers use a module system to load software environments. Check available modules:

1
module avail

Load the required modules for your script, such as Python or CUDA:
1
2
module load python/3.8
module load cuda/11.7


Step 3: Prepare Your Script

Transfer your script and data files to the server. You can use scp or an SFTP client:

1
scp your_script.py your_username@server_address:/path/to/destination

Navigate to the directory containing your script:
1
cd /path/to/destination


Step 4: Create a SLURM Batch Script

A SLURM batch script specifies the resources and commands required to run your job. Create a file, e.g., run_gpu_job.sh:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
#!/bin/bash
#SBATCH --job-name=my_gpu_job # Job name
#SBATCH --output=output_%j.log # Standard output and error log
#SBATCH --error=error_%j.log # Error log
#SBATCH --partition=amd_a100n # GPU partition name
#SBATCH --gres=gpu:1 # Number of GPUs
#SBATCH --ntasks=1 # Number of tasks
#SBATCH --cpus-per-task=4 # Number of CPU cores per task
#SBATCH --time=01:00:00 # Runtime (HH:MM:SS)
#SBATCH --mem=16G # Memory per node

# Load modules
module load python/3.8
module load cuda/11.7

# Activate virtual environment (if needed)
source /path/to/your/venv/bin/activate

# Run your script
python your_script.py

Save the script and make it executable:

1
chmod +x run_gpu_job.sh


Step 5: Submit the Job

Submit the batch script to SLURM:

1
sbatch run_gpu_job.sh

SLURM will return a job ID. You can check the job’s status using:

1
squeue -u your_username

Step 6: Monitor the Job

While the job is running, you can monitor its progress:

  • View Logs:
    1
    tail -f output_<jobID>.log
  • Check GPU Usage:
    1
    2
    squeue -u your_username
    nvidia-smi
  • or you can use watch to keep your connection active
    1
    watch -n 1 squeue -u x3002a15
    to keep your terminal active

Step 7: Cancel a Job (if needed)

To cancel a job, use:

1
scancel <jobID>


Step 8: Retrieve Results

Once the job is complete, retrieve results from the server to your local machine:

1
scp your_username@server_address:/path/to/results /local/destination


Tips for Efficient Usage

  1. Check Available Resources:
    1
    sinfo
    This displays available partitions and their status.
  2. Test Locally First: Test your script locally before submitting it to SLURM to minimize debugging time.
  3. Use Virtual Environments: Maintain a Python virtual environment to manage dependencies.
  4. Use GPUs Wisely: Request only the necessary number of GPUs to avoid over-allocation.

By following this tutorial, you can efficiently use a GPU-enabled server with SLURM to run your computationally intensive scripts. Always consult your server’s documentation for partition-specific settings and policies.