Advanced Engineer Site Reliability

Year    Bangalore, Karnataka, India

Job Description

About Albertsons Companies India
Albertsons Companies is a leading food and drug retailer in the United States. As of February 22, 2025, the Company operated 2,270 retail stores with 1,728 in-store pharmacies, 405 associated fuel centers, 22 dedicated distribution centers and 19 manufacturing facilities. Albertsons Companies India is a vital extension of the Albertsons Companies Inc. workforce and important to the next phase in the company & technology journey to support millions of customers & lives every day.

Position Title: Advanced Engineer Site Reliability
:
Key Responsibilities:

  • Design and maintain scalable monitoring solutions usingPrometheusandGrafana.
  • Implement and manageMimirfor long-term metrics storage and high-availability monitoring.
  • Configure and optimizeLokifor centralized log aggregation andTempofor distributed tracing.
  • Develop custom dashboards and alerts to monitor application health, performance, and SLAs.
  • Write and optimize queries usingPromQLfor metrics analysis andLogQLfor log exploration.
  • Perform root cause analysis and incident investigations using telemetry data from metrics, logs, and traces.
  • Develop automation scripts and tools usingPythonto support observability, alerting, and incident response workflows.
  • Integrate observability tools into CI/CD pipelines and deployment workflows.
  • Deploy and manage observability components onMicrosoft Azure, leveraging services such as Azure Kubernetes Service (AKS), Azure Monitor, and Azure Storage.
  • Collaborate with cloud and DevOps teams to ensure observability is embedded across all environments.
  • Participate in on-call rotations, incident response, and post-incident reviews.
  • Define and trackSLOs, SLIs, and error budgetsto drive service reliability improvements.
Must-Have Skills:
  • 35 years of experience inSRE, DevOps, or Infrastructure Engineeringroles.
  • Strong hands-on experience with:
  • Grafanafor visualization and alerting.
  • Prometheus for metrics collection and storage.
  • Python, Bash, PowerShell for scripting and automation.
  • Microsoft Azure Services, including AKS, Azure Monitor, and related infrastructure components OR GCP
  • Solid understanding ofmonitoring architectures,incident management, andperformance tuning.
  • SQL
Good-to-Have Skills:
  • Experience withInfrastructure-as-Code (IaC)tools like Terraform or Bicep.
  • Exposure toDevOpspractices and integrating observability into CI/CD pipelines.
  • Knowledge ofdistributed systems,microservices, andcloud-native architectures.
  • Experience withchaos engineeringandresilience testing.
  • Lokifor log aggregation andTempofor distributed tracing.
  • PromQLandLogQLfor querying metrics and logs.

Beware of fraud agents! do not pay money to get a job

MNCJobsIndia.com will not be responsible for any payment made to a third-party. All Terms of Use are applicable.


Job Detail

  • Job Id
    JD3772968
  • Industry
    Not mentioned
  • Total Positions
    1
  • Job Type:
    Full Time
  • Salary:
    Not mentioned
  • Employment Status
    Permanent
  • Job Location
    Bangalore, Karnataka, India
  • Education
    Not mentioned
  • Experience
    Year