I will design professional grafana dashboards for kubernetes, linux, and hpc
High Performance Computing HPC and Linux Systems Engineer
About this Gig
About This Gig
Optimize your infrastructure! Get enterprise-grade visibility with custom Grafana dashboards designed by an AI & HPC Expert.
In AI and High-Performance Computing, performance is everything. I build advanced observability stacks for complex environments. Whether you manage an AI training cluster, Kubernetes (K8s), or a Linux HPC system, I provide the real-time insights you need.
What I Offer:
- HPC & AI Monitoring: Deep metrics for GPU utilization (NVIDIA/AMD), Slurm jobs, and InfiniBand.
- Kubernetes Observability: Full monitoring for K8s (GKE, EKS, AKS) focusing on resource health and scaling.
- Linux Mastery: Detailed dashboards for CPU, RAM, Disk I/O, and network throughput.
- Smart Alerting: Setup of Slack or Email alerts to catch bottlenecks early.
- Advanced PromQL: Expert Prometheus queries for high-speed data visualization.
Why Choose Me?
AI Specialist: I understand LLM training and AI inference workloads. HPC Performance: Dashboards optimized for massive data points. Modern Tech: Expert in Prometheus, Loki, and OpenTelemetry.
Lets transform your raw metrics into actionable performance today!
My Portfolio
FAQ
Can you monitor GPU usage for AI model training?
Yes! I specialize in tracking NVIDIA and AMD GPU metrics, including memory usage, temperature, and power consumption. This is essential for optimizing AI training clusters and ensuring your hardware is running efficiently.
Which data sources do you support?
I work with a wide range of data sources, including Prometheus, VictoriaMetrics, InfluxDB, Loki (for logs), and cloud-native tools like AWS CloudWatch and Google Stackdriver. I can also integrate custom AI/ML metric exporters.
Can you set up alerts for Slack or Email?
Absolutely. I configure intelligent alerting rules so you are notified immediately of high CPU/GPU load, pod crashes in Kubernetes, or job failures in your HPC cluster. I can also set up on-call routing.
Do you support HPC schedulers like Slurm?
Yes. I can build dashboards that visualize Slurm job queues, node availability, and partition health. This provides HPC administrators and researchers with a clear view of their cluster's utilization.
Do I need to provide the server for Grafana?
I can work with your existing setup or help you deploy a new instance on AWS, GCP, Azure, or Bare Metal. I also support Grafana Cloud if you prefer a managed solution.

