I will diagnose and fix your hpc cluster performance problems

India

I speak English
As an HPC Solutions Architect, I’ve configured seven HPC systems across India, integrating cutting-edge hardware and software for high-demand computational tasks. I specialize in optimizing Slurm for ...
About this Gig

Most HPC clusters run at 3040% of their actual capability.


Not because the hardware is wrong. Because the configuration was never tuned for the actual workload.


I've diagnosed this exact problem at research institutions, AI labs, and engineering teams. The fixes are almost always in software and configuration not hardware.


What the audit covers:


Slurm configuration gaps (DefMemPerCPU, cgroup, fairshare)

InfiniBand fabric health and link speed validation  

Storage throughput (Lustre/BeeGFS/NFS stripe config)

MPI process binding and NUMA topology

HPL efficiency vs theoretical peak

Node health and silent hardware fault detection


What you receive:


Written diagnosis with severity rating per finding

Exact fix for each issue commands included Before/after benchmark numbers

Priority order: what to fix first for maximum impact


What I need from you: SSH access to login node,  your cluster specs, and 2 hours of low-activity time for benchmarking.


Turnaround: 24-48 hours after access is granted.

Device:

Server

Operating system:

Linux

Other Support & IT Services I Offer

Related tags