I will provide aiops and sre consulting for devops and cloud reliability

United States

I speak English

GPU Infrastructure LLMOps Engineer NVIDIA Kubernetes Neo Cloud

I build scalable NVIDIA GPU infrastructure for AI training and inference. I specialize in Kubernetes GPU clusters, LLM training/inference, and GPU observability. Services: • GPU cluster setup • Kube...
About this Gig

Are you shipping LLM products but struggling with GPU infrastructure, scaling, and reliability? I help teams build production-grade GPU platforms end-to-end.

What you get: Neo cloud GPU setup and cluster hardening Kubernetes GPU scheduling and autoscaling for LLM training and inference (vLLM/Ollama/Triton) MLOps/LLMOps CI/CD for models and data pipelines GPU monitoring and alerts using NVIDIA DCGM + Prometheus + Grafana Cost optimization, capacity planning, and observability best practices

Deliverables can include architecture review, deployment plan, and hands-on implementation depending on package tier.

Tools:

Docker

GitLab

Jenkins

GitHub

CircleCI

Frameworks:

Terraform

Ansible

Cloud Provider:

Amazon Web Services

Microsoft Azure

Programming language:

Bash

Python

Golang

Expertise:

Installation

Migration

Configuration