Senior DevOps & Site Reliability Engineer (DevOps + SRE)
About the Role
We are seeking a highly experienced
Senior DevOps & Site Reliability Engineer
to support and scale our cloud-native, containerized IoT platform built on AWS. You will work closely with the Technical Manager to automate infrastructure, build CI/CD pipelines, manage large-scale deployments, and ensure the platform's reliability, security, and performance.
This role requires deep hands-on expertise in
AWS, Docker/Kubernetes, serverless workflows, infrastructure automation, scripting (Python), and IoT-scale distributed systems reliability
.
Key Responsibilities
DevOps Responsibilities
Design, implement, and maintain
CI/CD pipelines
using GitHub Actions, AWS CodePipeline, or GitLab CI.
Develop and automate deployment workflows following
DevOps strategy and best practices
.
Manage
Docker containerization
, including multi-stage builds, optimization, and image security.
Orchestrate containers using
Kubernetes (EKS)
or AWS
ECS
(Fargate/EC2).
Manage and optimize
ECR
for image storage and versioning.
Implement Infrastructure-as-Code using
AWS CDK, Terraform, or CloudFormation
.
Build automated workflows for backend, microservices, and IoT services deployment.
Support
serverless architectures
using AWS Lambda, Step Functions, EventBridge, etc.
Implement secure secrets management using AWS IAM, KMS, and Secrets Manager.
Handle configuration, environment management, and zero-downtime deployment strategies.
Site Reliability Engineering (SRE) Responsibilities
Build and maintain
monitoring, logging, tracing
pipelines using CloudWatch, Grafana, Prometheus, X-Ray, and OpenTelemetry.
Define and implement
SLIs, SLOs, error budgets
, and reliability dashboards.
Ensure high availability, resilience, and performance of all systems under production.
Conduct incident management, root cause analysis, and post-incident reviews.
Optimize cost, compute utilization, autoscaling policies, and failover strategies.