Cuộn xuống

Viết nên trang của bạn
trong câu chuyện DatVietVAC

Senior Site Reliability Engineer (SRE)

Fulltime

222 Pasteur, Ward Xuan Hoa, HCM

VieON

Mô tả công việc:

  • Own and improve SLOs, SLIs, and error budgets for critical services across playback, login, subscription, recommendation, and API layers.
  • Build and maintain observability stacks (Prometheus, Grafana, OpenTelemetry, Datadog) to proactively detect and resolve issues.
  • Drive incident management, root cause analysis (RCA), and postmortem culture for service outages and performance degradation.
  • Automate repetitive operational tasks via IaC (Terraform), CI/CD (GitHub Actions), and scripting (Python/Bash/Golang).
  • Collaborate with backend, frontend, and data teams to design fault-tolerant, scalable infrastructure (GKE, Cloud Run, Cloud CDN, etc.).
  • Work closely with security and platform teams to ensure system hardening, compliance, and zero-trust principles.
  • Continuously assess infrastructure cost and performance trade-offs to optimize cloud spend (GCP preferred).
  • Contribute to the evolution of our deployment strategy (blue/green, canary, A/B), especially during high-traffic events (e.g. livestreams, premieres).

Yêu cầu công việc:

  • 5+ years of experience as SRE, DevOps, or Production Engineer in large-scale environments.
  • Strong knowledge of Linux internals, networking, and systems performance tuning.
  • Deep experience with Kubernetes, containers, and service mesh technologies (Istio or Linkerd).
  • Proficiency with cloud platforms (preferably GCP), including IAM, Compute, GKE, Cloud CDN, Cloud Logging.
  • Solid experience with monitoring, logging, and alerting stacks (e.g. Prometheus, Grafana, ELK, Loki, Datadog).
  • Strong scripting or programming skills in Python, Go, or Bash.
  • Familiarity with CI/CD, IaC, and GitOps tools (Terraform, Helm, ArgoCD, Cloud Build).
  • Clear communication skills and a calm, analytical approach to solving complex problems in high-pressure environments.

Nice to Have

  • Experience supporting real-time media systems or video streaming platforms.
  • Knowledge of multi-region HA, failover, and edge optimization strategies (especially for Asia-Pacific markets).
  • Familiarity with error budgets, chaos engineering, and resiliency testing.
  • Background in supporting platform services for experimentation (A/B), personalization, or user engagement.

Thông tin liên hệ:

Recruitment Team - recruitment@datvietvac.vn

BACK