AI Engineer / Researcher
Train, fine-tune, and optimize open-source LLMs and multimodal models for domain-specific retrieval, reasoning, and workflow automation in production environments.
About the Role
Key Responsibilities
Model Development & Fine-tuning
Research, train, and fine-tune open-source LLMs (LLaMA, Mistral, Qwen, Gemma, etc.) for domain-specific tasks.
Train and adapt BERT and other Transformer encoder models (RoBERTa, DeBERTa, MPNet) for classification, retrieval, and embedding workloads.
Implement supervised fine-tuning (SFT), instruction tuning, and preference-based alignment (RLHF, DPO, ORPO).
Run distributed training jobs on Ray / KubeRay clusters with DeepSpeed or Hugging Face Accelerate.
Develop efficient data pipelines for model training: data cleaning, tokenization, chunking, and labeling.
Optimize models for RAG pipelines, grounding responses in canonical data and metadata.
Evaluation, Serving & Deployment
Evaluate models with LangSmith, custom benchmarks, and human-in-the-loop feedback loops.
Deploy optimized models to production environments (cloud, on-prem, or air-gapped setups) using vLLM, SGLang, or TGI.
Route and govern model traffic with liteLLM for multi-model, multi-provider serving patterns.
Collaborate with platform engineers on inference infrastructure: tensor parallelism, continuous batching, KV-cache tuning.
Maintain experiment tracking and ensure reproducibility across training runs.
Qualifications
Must-Have Technical Expertise
5+ years in applied ML/AI research or engineering, with at least 2 years focused on LLM or Transformer model work.
Strong background in PyTorch, Hugging Face Transformers, and tokenizers. Hands-on with both decoder (Llama-family) and encoder (BERT-family) architectures.
Proven ability to adapt open-source models to real-world, production-grade tasks.
Experience with distributed training using Ray / Ray Train, DeepSpeed, or Hugging Face Accelerate.
Working knowledge of an inference-serving stack (vLLM, SGLang, TGI) and of liteLLM for multi-provider routing.
Deep understanding of training efficiency tradeoffs: memory, throughput, and cost optimization.
Proficiency with AI-assisted development tools (Cursor, Claude Code, GitHub Copilot, or similar).
Research & Execution
Ability to balance rapid prototyping with rigorous benchmarking and reproducibility.
Strong analytical skills for experiment design and result interpretation.
Excellent documentation and communication of research findings.
Preferred/Bonus
Parameter-efficient fine-tuning techniques: LoRA, QLoRA, adapters.
Quantization techniques: 4-bit/8-bit inference, AWQ/GPTQ, GGUF/GGML optimizations.
Experience training or distilling domain-specific embedding models on top of BERT / MPNet / E5 backbones.
Multimodal training experience (vision + text for document understanding).
Experience running Ray jobs on KubeRay with gang scheduling and GPU-aware autoscaling.
Familiarity with continuous batching, PagedAttention, and tensor-parallel inference in vLLM.
Experiment tracking with Weights & Biases, MLflow, or similar tools.
Serving stacks: vLLM, TGI, TensorRT-LLM, Ray Serve, SGLang.
Strong Vietnamese and English communication skills.
Benefits
Competitive salary and performance incentives
Work on cutting-edge AI research with real-world applications
Access to compute resources for model training and experimentation
Advanced training and conference attendance opportunities
Flexible work arrangements
A collaborative environment that values research rigor and practical impact
Ready to Join Our Team?
We're excited to meet passionate engineers who want to build the future of AI. Apply now and let's create something amazing together.