Engineering
Featured

Senior DevOps / Platform Engineer

Hybrid · Ho Chi Minh City, Vietnam
Full Time
5+ years

Build and maintain production Kubernetes infrastructure, CI/CD pipelines, and cloud-native deployment systems. Own the reliability and scalability of our multi-service platform.

About the Role

We are seeking an experienced DevOps/Platform Engineer to build and maintain our production infrastructure. You will own Kubernetes deployments, CI/CD automation, observability systems, and cloud infrastructure across multiple environments. In this role, you will be responsible for the reliability, security, and scalability of a complex multi-service platform. You'll work closely with backend and data engineers to ensure smooth deployments, implement infrastructure-as-code, and build the operational foundation that enables rapid development and deployment.

Key Responsibilities

Kubernetes & Container Orchestration

Design, deploy, and maintain production Kubernetes clusters across cloud environments (GKE, EKS, AKS).

Implement Helm charts and Kubernetes manifests for complex multi-service deployments.

Configure and optimize Kubernetes resources (deployments, services, ingress, secrets, configmaps).

Implement auto-scaling, resource limits, and capacity planning for production workloads.

Manage stateful workloads including databases, message queues, and persistent storage.

ML & Inference Infrastructure

Operate KubeRay clusters (RayCluster, RayJob, RayService) for distributed training and inference workloads.

Manage GPU scheduling, node pools, and driver stacks; integrate gang scheduling (Kueue, NVIDIA KAI) where needed.

Support model-serving deployments running vLLM, Ray Serve, or TGI on Kubernetes with horizontal autoscaling.

Design GPU utilisation dashboards and cost-per-token metrics for platform and AI teams.

Partner with ML and AI Full-Stack engineers on capacity planning and model rollout automation.

CI/CD & Automation

Build and maintain CI/CD pipelines using GitHub Actions, ArgoCD, or similar tools.

Implement GitOps workflows for infrastructure and application deployments.

Automate testing, security scanning, and quality gates in deployment pipelines.

Design blue-green, canary, and rolling deployment strategies for zero-downtime releases.

Build developer tooling and self-service infrastructure provisioning.

Observability & Reliability

Implement comprehensive monitoring with Prometheus, Grafana, and alerting systems.

Build distributed tracing and logging infrastructure (OpenTelemetry, ELK stack, or similar).

Define and monitor SLOs/SLIs for critical services and infrastructure.

Lead incident response, post-mortems, and reliability improvements.

Implement backup, disaster recovery, and business continuity procedures.

Infrastructure & Security

Manage cloud infrastructure using Infrastructure-as-Code (Terraform, Pulumi, or similar).

Implement network policies, security groups, and access controls.

Configure secrets management and secure credential handling.

Ensure compliance with security best practices and audit requirements.

Optimize cloud costs and resource utilization across environments.

Qualifications

Must-Have Technical Expertise

5+ years of DevOps/SRE/Platform engineering experience in production environments.

Expert-level Kubernetes knowledge including cluster administration and troubleshooting.

Strong experience with Helm, Kustomize, or similar Kubernetes templating tools.

Production experience with CI/CD systems (GitHub Actions, GitLab CI, Jenkins, ArgoCD).

Proficiency with Infrastructure-as-Code tools (Terraform, Pulumi, CloudFormation).

Proficiency with AI-assisted development tools (Cursor, Claude Code, GitHub Copilot, or similar).

Cloud & Systems Skills

Experience with major cloud providers (GCP, AWS, or Azure) and their managed services.

Strong Linux systems administration and troubleshooting skills.

Experience with container runtimes, networking, and storage systems.

Understanding of database operations (PostgreSQL, Redis) in containerized environments.

Knowledge of security best practices for cloud-native applications.

Preferred/Bonus

Hands-on experience with KubeRay, Ray on GKE, or equivalent distributed-compute platforms on Kubernetes.

Experience deploying data orchestration platforms (Dagster, Airflow) on Kubernetes.

Knowledge of service mesh technologies (Istio, Linkerd).

Familiarity with FinOps practices and cloud cost optimization, especially GPU spend.

Experience with multi-tenant infrastructure and environment isolation.

Strong Vietnamese and English communication skills.

Benefits

Competitive salary and performance incentives

Work on production infrastructure at scale

Advanced training in cloud-native technologies

Flexible work arrangements

A collaborative, innovative engineering team environment

Ready to Join Our Team?

We're excited to meet passionate engineers who want to build the future of AI. Apply now and let's create something amazing together.