Senior DevOps / Platform Engineer
Build and maintain production Kubernetes infrastructure, CI/CD pipelines, and cloud-native deployment systems. Own the reliability and scalability of our multi-service platform.
About the Role
Key Responsibilities
Kubernetes & Container Orchestration
Design, deploy, and maintain production Kubernetes clusters across cloud environments (GKE, EKS, AKS).
Implement Helm charts and Kubernetes manifests for complex multi-service deployments.
Configure and optimize Kubernetes resources (deployments, services, ingress, secrets, configmaps).
Implement auto-scaling, resource limits, and capacity planning for production workloads.
Manage stateful workloads including databases, message queues, and persistent storage.
ML & Inference Infrastructure
Operate KubeRay clusters (RayCluster, RayJob, RayService) for distributed training and inference workloads.
Manage GPU scheduling, node pools, and driver stacks; integrate gang scheduling (Kueue, NVIDIA KAI) where needed.
Support model-serving deployments running vLLM, Ray Serve, or TGI on Kubernetes with horizontal autoscaling.
Design GPU utilisation dashboards and cost-per-token metrics for platform and AI teams.
Partner with ML and AI Full-Stack engineers on capacity planning and model rollout automation.
CI/CD & Automation
Build and maintain CI/CD pipelines using GitHub Actions, ArgoCD, or similar tools.
Implement GitOps workflows for infrastructure and application deployments.
Automate testing, security scanning, and quality gates in deployment pipelines.
Design blue-green, canary, and rolling deployment strategies for zero-downtime releases.
Build developer tooling and self-service infrastructure provisioning.
Observability & Reliability
Implement comprehensive monitoring with Prometheus, Grafana, and alerting systems.
Build distributed tracing and logging infrastructure (OpenTelemetry, ELK stack, or similar).
Define and monitor SLOs/SLIs for critical services and infrastructure.
Lead incident response, post-mortems, and reliability improvements.
Implement backup, disaster recovery, and business continuity procedures.
Infrastructure & Security
Manage cloud infrastructure using Infrastructure-as-Code (Terraform, Pulumi, or similar).
Implement network policies, security groups, and access controls.
Configure secrets management and secure credential handling.
Ensure compliance with security best practices and audit requirements.
Optimize cloud costs and resource utilization across environments.
Qualifications
Must-Have Technical Expertise
5+ years of DevOps/SRE/Platform engineering experience in production environments.
Expert-level Kubernetes knowledge including cluster administration and troubleshooting.
Strong experience with Helm, Kustomize, or similar Kubernetes templating tools.
Production experience with CI/CD systems (GitHub Actions, GitLab CI, Jenkins, ArgoCD).
Proficiency with Infrastructure-as-Code tools (Terraform, Pulumi, CloudFormation).
Proficiency with AI-assisted development tools (Cursor, Claude Code, GitHub Copilot, or similar).
Cloud & Systems Skills
Experience with major cloud providers (GCP, AWS, or Azure) and their managed services.
Strong Linux systems administration and troubleshooting skills.
Experience with container runtimes, networking, and storage systems.
Understanding of database operations (PostgreSQL, Redis) in containerized environments.
Knowledge of security best practices for cloud-native applications.
Preferred/Bonus
Hands-on experience with KubeRay, Ray on GKE, or equivalent distributed-compute platforms on Kubernetes.
Experience deploying data orchestration platforms (Dagster, Airflow) on Kubernetes.
Knowledge of service mesh technologies (Istio, Linkerd).
Familiarity with FinOps practices and cloud cost optimization, especially GPU spend.
Experience with multi-tenant infrastructure and environment isolation.
Strong Vietnamese and English communication skills.
Benefits
Competitive salary and performance incentives
Work on production infrastructure at scale
Advanced training in cloud-native technologies
Flexible work arrangements
A collaborative, innovative engineering team environment
Ready to Join Our Team?
We're excited to meet passionate engineers who want to build the future of AI. Apply now and let's create something amazing together.