to lead the reliability and operational excellence agenda for our Enterprise Data Platforms spanning
GCP cloud-native systems
. This strategic leadership role will help instill Google's SRE principles across diverse data engineering teams, uplift our platform reliability posture, and spearhead the creation of a Centre of Excellence (CoE) for SRE. The ideal candidate will possess a deep understanding of modern SRE practices, demonstrate a proven ability to scale SRE capabilities in large enterprises, and evangelise a data-driven approach to resilience engineering.
Key Responsibilities:
Define and drive SRE strategy
for enterprise data platforms on
GCP
, aligning with business goals and reliability needs.
Act as a
trusted advisor
to platform teams, embedding
SRE mindset, best practices, and golden signals
into their SDLC and operational processes.
Set up and lead a
Site Reliability Engineering CoE
, delivering reusable tools,
runbooks
,
blueprints
, and
platform accelerators
to scale SRE adoption across the organisation.
Partner with product and platform owners to
prioritise and structure SRE backlogs
, formulate
roadmaps
, and help teams move from reactive ops to proactive reliability engineering.
Define and track
SLIs, SLOs, and error budgets
across critical data services, enabling data-driven decision making around availability and performance.
Drive
incident response maturity
, including chaos engineering, incident retrospectives, and blameless postmortems.
Foster a
reliability culture
through coaching, workshops, and cross-functional forums.
Build strategic relationships across engineering, data governance, security, and architecture teams to ensure
reliability is baked in
, not bolted on.
Required Qualifications:
Bachelor's or Master's degree in Computer Science, Engineering, or related discipline.
3+ years in SRE leadership
or SRE strategy roles.
Strong familiarity with
Google SRE principles
and practical experience applying them in complex enterprise settings.
Proven track record in
establishing and scaling SRE teams
.
Experience with
GCP services
like Cloud Build, GCS, CloudSQL, Cloud Functions, and GCP logging & monitoring.
Deep experience with
observability stacks
such as Prometheus, Grafana, Splunk, and GCP native solutions.
Skilled in
Infrastructure as Code
using tools like
Terraform
, and working knowledge of automation in CI/CD environments.
Key Competencies & Skills:
Strong leadership,
influence without authority
, and mentoring capabilities.
Hands-on scripting and automation skills in
Python
, with secondary languages like Go or Java a plus.
Familiarity with
incident and problem management frameworks
in enterprise environments.
Ability to define and execute a
platform-wide reliability roadmap
in alignment with architectural and business objectives.
Nice to Have:
Exposure to
secrets management
tools (e.g., HashiCorp Vault).
Experience with
tracing and APM tools
like Google Cloud Trace or Honeycomb.
Background in data governance, data pipelines, and security standards for data products.
Job Type: Permanent
Pay: ?4,000,000.00 - ?5,000,000.00 per year
Schedule:
Day shift
Monday to Friday
Experience:
SRE leadership or SRE strategy: 3 years (Required)
Google SRE principles : 3 years (Required)
GCP services : 3 years (Required)
Infrastructure as Code using tools like Terraform: 5 years (Required)
Python: 2 years (Required)
Work Location: Remote
Beware of fraud agents! do not pay money to get a job
MNCJobsIndia.com will not be responsible for any payment made to a third-party. All Terms of Use are applicable.