The Role We're looking for a Senior DevOps Engineer who has operated production systems at serious scale and can now set the bar for how DevOps is practiced across a portfolio of consumer products. You will own the design and reliability of critical infrastructure spanning GCP/GKE, Azure DevOps and on-prem IDC — and you'll do it in an AInative way, using LLMs as a genuine engineering collaborator across architecture, automation and incident response. This is a hands-on role. You will still write Terraform, debug production incidents and review pipelines — but you will also set patterns, mentor engineers across PODs, and influence how 16+ products ship software.
What You'll Do
• Own the architecture and reliability of core platform services — GKE clusters, CI/CD, observability, edge delivery — across one or more PODs
• Design multi-region, zero-trust infrastructure on GCP (Mumbai + Delhi) with well-reasoned trade-offs on cost, latency and blast radius
• Lead production incident response, write sharp postmortems and drive the systemic fixes that keep the same incident from recurring
• Build and evolve Infrastructure-as-Code — Terraform modules, Helm charts, Argo CD workflows — that other engineers adopt by default
• Raise the CI/CD bar across Azure DevOps pipelines — build time, test coverage gates, progressive delivery, artifact governance
• Drive the observability strategy in New Relic and GCP — SLO/SLI definition, error budgets, meaningful alerts (not noise)
• Partner with security to keep zero-trust, IAM, secrets and WAF policies honest as the platform evolves
• Champion AI-native workflows — codify how the team uses LLMs for code review, RCA, runbook generation and documentation
• Mentor mid and junior engineers; set review standards and raise the technical judgment of the pod
• Contribute to hiring — interview candidates, calibrate loops, and help grow the team responsibly
What You Bring
• 7–10 years of DevOps / SRE / Platform Engineering experience, with at least 3 years operating production systems at consumer scale (10M+ users or equivalent throughput)
• Deep expertise in Kubernetes in production — not just kubectl — upgrades, autoscaling, multi-tenant cluster design, Istio/service mesh, GKE specifics • Strong command of GCP — IAM, VPC design, Workload Identity, Cloud SQL, Pub/Sub, Cloud Operations; equivalent AWS/Azure depth acceptable with demonstrable GCP fluency
• Expert-level Terraform — module design, state management, drift handling, policy-as-code • Production CI/CD experience on Azure DevOps (strongly preferred) or equivalent (GitHub Actions, GitLab CI, Jenkins)
• Solid programming ability in at least one of Python, Go or TypeScript — you write code that others can maintain • Strong Linux internals, networking (BGP, DNS, TLS, L7 load balancing) and troubleshooting skills
• Observability fluency — New Relic, Prometheus, OpenTelemetry — and the instinct for what to measure vs what to ignore • Excellent written communication — you can write a crisp design doc, a useful postmortem and a readable runbook
• A bias for automation and a low tolerance for toil
Nice to Have
• OTT / video streaming background — CDN, ABR, origin shielding, live vs VOD trade-offs
• Experience with DragonflyDB, Redis at scale, or high-throughput caching tiers
• Exposure to on-prem / hybrid cloud operations (IDC, colocation, private interconnect)
• Hands-on experience embedding LLMs into engineering workflows — Claude, MCP servers, spec-driven development, agentic CI/CD
• Security certifications (CKS, GCP Professional Cloud Security) or equivalent practical depth