About Company:
Leading company in the field of information technology, develops both hardware and software that work closely together. In IT, our client is known for creating complete systems rather than just individual devices. Its products use operating systems which are designed to be secure, stable, and easy to use.
Our client plays an important role in software development and computing. It provides tools which developers use to create apps. Manages a large digital platform that supports millions of applications worldwide.
They are also strong in IT security and privacy. Its systems use encryption, secure authentication and regular software updates to protect user data. Additionally, offering cloud computing services for data storage, backup, and synchronization across devices.
An Ideal Candidate:
We are seeking an experienced DevOps and Site Reliability Engineer to strengthen our engineering ecosystem that powers hundreds of continuously integrated and deployed projects.
This role is critical in maintaining the stability, scalability, and reliability of our systems while driving automation and standardization across environments and pipelines. You will be part of a core team that builds and maintains CI/CD pipelines (built on proprietary frameworks similar to Jenkins) integrated with Docker, Kubernetes, and cloud-native environments across AWS and GCP.
Alongside DevOps automation, you will own the production reliability charter-ensuring our infrastructure, deployments, and runtime environments operate with the highest levels of availability and performance.
Advanced knowledge of CI/CD systems (Jenkins, GitHub Actions, GitLab CI, or equivalents). Deep hands-on expertise with Docker, Kubernetes, Helm, and service mesh frameworks.
Proficiency in IaC tooling-Terraform, Pulumi, and cloud-native deployment stacks. Multi-cloud experience with AWS and GCP, including EKS, GKE, CloudRun, Lambda, and network/load-balancing components.
Scripting and automation skills in Python, Go, or Bash.
Observability stack: Prometheus, Grafana, ELK, OpenTelemetry, or Datadog.
Strong understanding of SRE principles: error budgets, SLO/SLI definition, incident management, and reliability modeling. Familiarity with configuration management (Ansible, Chef), and secrets management (Vault, AWS Secrets Manager).
Key Competencies:
Design, build, and maintain highly scalable CI/CD pipelines for code integration, container builds, automated testing, and artifact publishing. Standardize pipeline templates, branch protection, and GitOps workflows across the organization to ensure consistent quality and compliance. Automate multi-environment deployments in Kubernetes and hybrid-cloud setups (AWS, GCP). Integrate CI/CD processes with IaC frameworks such as Terraform and Pulumi to ensure complete environment reproducibility and security.
Champion operational excellence by defining and maintaining runbooks, operational guides, and production playbooks for all services. Implement best practices in service reliability: incident response automation, post-mortem analysis, change and release management workflows. Design and manage Kubernetes clusters supporting blueprints, multi-zone replication, and global server load balancing for high availability. Establish site reliability indicators and drive adherence to SLOs, SLIs, and SLAs across infrastructure and application layers.
Collaborate with engineering teams to embed observability into services-metrics, logs, traces, and alerts-to proactively identify performance degradations. Lead capacity planning, scaling strategies, and environment audits to anticipate demand spikes and operational bottlenecks.
Build and maintain secure infrastructure following DevSecOps principles and compliance requirements. Embed continuous security scans, dependency checks, and image hardening into CI/CD pipelines. Manage secrets, credentials, and policies through centralized identity and access management systems. Support zero-downtime deployments through canary and blue-green rollout strategies. Partner with development, product, and platform teams to drive a culture of reliability, automation, and operational transparency. Conduct operational readiness reviews and ensure all new services meet reliability and monitoring standards before production launch. Mentor and guide teams on DevOps, SRE, and GitOps principles, fostering a developer-first reliability culture. Continuously refine processes to reduce mean time to recovery (MTTR) and increase system uptime.
MNCJobsIndia.com will not be responsible for any payment made to a third-party. All Terms of Use are applicable.