I will diagnose and fix your hpc cluster performance problems
About this Gig
Most HPC clusters run at 3040% of their actual capability.
Not because the hardware is wrong. Because the configuration was never tuned for the actual workload.
I've diagnosed this exact problem at research institutions, AI labs, and engineering teams. The fixes are almost always in software and configuration not hardware.
What the audit covers:
Slurm configuration gaps (DefMemPerCPU, cgroup, fairshare)
InfiniBand fabric health and link speed validation
Storage throughput (Lustre/BeeGFS/NFS stripe config)
MPI process binding and NUMA topology
HPL efficiency vs theoretical peak
Node health and silent hardware fault detection
What you receive:
Written diagnosis with severity rating per finding
Exact fix for each issue commands included Before/after benchmark numbers
Priority order: what to fix first for maximum impact
What I need from you: SSH access to login node, your cluster specs, and 2 hours of low-activity time for benchmarking.
Turnaround: 24-48 hours after access is granted.
Device:
Server
Operating system:
Linux
