Sai Krishna Pendurthi

Staff SRE & Technical FinOps Engineer for remote-first infrastructure teams

I help distributed engineering teams run reliable Cloud platforms, Virtualisation platforms, Kubernetes platforms, reduce cloud spend, and turn observability into faster incident response.

Open to 100% remote roles globally 8+ years in infrastructure and reliability India-based, async-friendly collaboration

Email Me View LinkedIn Download Resume

Experience across

Circles.co Zzazz Broadcom VMware

About Me

I am a Staff SRE and FinOps-focused infrastructure engineer with 8+ years of experience across Circles.co, Zzazz, Broadcom, VMware, and startup environments. I specialize in Kubernetes, cloud cost visibility, observability, incident response, and operational automation for teams that need systems to be reliable without slowing product delivery.

My edge is combining deep infrastructure work with product and business context. After completing ISB's Product Management program, I have become especially interested in reliability work that connects engineering decisions to customer experience, cloud spend, team velocity, and business outcomes. I work well with async, remote-first teams and bring clear communication to ambiguous operational problems.

Recruiter Snapshot

Current Role Staff SRE, Technical FinOps at Circles.co

Location India, remote-first

Target Roles SRE, Platform Engineering, Infrastructure, FinOps

Core Strengths AWS, GCP, Terraform, Docker, Linux, Python, Kubernetes, Observability, Cloud Cost, Incident Response

Remote Fit Async collaboration across distributed teams

Business Impact MTTR reduction, cost savings, migrations, toil reduction

Selected Impact

65%

MTTR Reduction

Defined runbooks, alert flows, and Slack-integrated incident response for faster recovery.

27%

Cloud Cost Savings

Implemented OpenCost and Kubernetes optimization to improve cost visibility and control.

Release Velocity

Led a zero-downtime Kubernetes migration that helped teams ship more confidently.

80%

Toil Reduction

Automated repeatable operational work so engineers could spend more time on product value.

Success Stories

Incident Recovery

Recovered Business-Critical MongoDB Systems

At Zzazz, led the recovery of MongoDB systems after critical indexes were deleted during a security incident. Rebuilt database indexes, restored application stability, coordinated recovery under pressure, and helped the business regain access to affected systems.

Restored database performance and application availability
Owned hands-on debugging and recovery during a high-pressure incident
Converted the incident into stronger operational safeguards

Cloud Migration

Led DigitalOcean to AWS Platform Migration

Headed the cloud-to-cloud migration from DigitalOcean to AWS at Zzazz, moving the platform toward stronger reliability, scalability, and long-term infrastructure control.

Planned and executed migration work across cloud environments
Improved platform scalability and operational maturity
Aligned infrastructure decisions with reliability and cost goals

More Case Studies

Production Saves Available to Discuss

I keep a deeper set of stories around outages, migrations, platform rebuilds, cost saves, and reliability wins that show how I operate when business-critical systems need ownership.

Skills & Technologies

Infrastructure

AWS / GCP
Kubernetes / Karpenter
Terraform
Docker / CI/CD
Linux Internals

FinOps & Cost

Cost Engineering
Kubernetes Cost Visibility
OpenCost
Capacity & Spend Optimization

Observability

Prometheus & Grafana
OpenTelemetry
Grafana Tempo & Loki
Incident Response (MTTR reduction)
Runbooks / On-call Practices

Experience

Staff SRE, FinOps

Circles.co May 2026 - Present

Working on SRE and FinOps for a global telecom technology environment, focused on reliable platforms, cloud cost visibility, operational efficiency, and collaboration across distributed engineering teams.

Site Reliability Engineer / Platform Lead

Zzazz February 2025 - April 2026

Joined as the first SRE and led platform engineering work for a remote engineering team. Built reliability practices from the ground up, reduced MTTR by 65% through runbooks, alert hygiene, and Slack-integrated incident workflows. Built observability with Prometheus, Grafana, Loki, Tempo, and OpenTelemetry. Cut cloud costs by 27% using OpenCost and Kubernetes optimization, and led a zero-downtime Kubernetes migration that improved release velocity by 3x while reducing operational toil by 80%.

Software Engineer 3

Broadcom Software December 2023 - August 2024

Worked on reliability, debugging, and infrastructure support for enterprise software environments after VMware's transition into Broadcom. Partnered across engineering and support teams to analyze complex production issues, improve system behavior, and keep customer-facing platforms stable during organizational and platform change.

Member of Technical Staff 3

VMware February 2023 - December 2023

Provided technical leadership for complex VMware vCenter and ESXi environments, focusing on root cause analysis, escalated debugging, and high-severity customer issues. Translated deep infrastructure findings into clear action plans for customers and internal teams.

Member of Technical Staff 2 / Support Engineer Level 2

VMware June 2021 - February 2023

Handled second-line escalations for vCenter and ESXi, using log analysis, reproduction, and systems debugging to resolve complex infrastructure problems. Built a strong foundation in enterprise reliability, customer communication, and incident ownership.