to lead the installation, automation, and operational reliability of a modern
open-source data and integration platform
. The platform underpins business-critical data pipelines and integrations built on technologies such as Apache Airflow, Apache NiFi, Apache Spark, Kafka, PostgreSQL, MQTT brokers, Docker, and Kubernetes.
This is a hands-on, senior individual contributor role with ownership across
infrastructure, reliability, security, automation, and operational excellence
, supporting deployments on both
private and public cloud environments
.
#
What is the role about?
Key Responsibilities
------------------------
Platform Installation, Configuration & Operations
Install, configure, upgrade, and maintain distributed open-source components including:
+ Apache Airflow, Apache NiFi, Apache Spark
+ Apache Kafka and its ecosystem
+ PostgreSQL
+ MQTT brokers Ensure platform stability, scalability, high availability, and fault tolerance.
Perform capacity planning, performance tuning, and lifecycle management of all components.
Containerization & Orchestration
Design, deploy, and operate containerized workloads using
Docker
.
Build and manage
production-grade Kubernetes clusters
.
Implement Kubernetes best practices for networking, storage, scaling, and security.
Package and manage platform services using
Helm
or equivalent tooling.
Infrastructure as Code & Automation
Design and maintain
Infrastructure as Code (IaC)
using
Terraform
for cloud and on-prem environments.
Build configuration management and automation workflows using
Ansible
.
Enable repeatable, environment-agnostic deployments across development, staging, and production.
Automate provisioning, configuration, upgrades, scaling, and recovery processes.
Cloud, Hybrid & Private Infrastructure
Deploy and operate workloads on
public cloud platforms (AWS, Azure, GCP)
and
private/on-prem infrastructure
.
Design hybrid architectures with secure connectivity between environments.
Optimize infrastructure design for resilience, performance, and cost efficiency.
Observability, Reliability & Incident Management
Design and implement comprehensive
monitoring, logging, and alerting
for infrastructure and applications.
Define, measure, and maintain
SLAs, SLIs, and SLOs
for critical platform services.
Own
incident response
, root cause analysis, and post-incident reviews.
Proactively identify risks, bottlenecks, and failure modes before they impact users.
Security & Secrets Management
Implement infrastructure and platform security best practices across containers, Kubernetes, and networks.
Manage
secrets and credentials
using tools such as Vault, Kubernetes Secrets, or cloud-native solutions.
Own
certificate lifecycle management
, including rotation and renewal.
Design and enforce network security controls, access policies, and
zero-trust principles
where applicable.
Support compliance with internal security and governance requirements.
Backup, Disaster Recovery & Data Protection
Design and implement
automated backup strategies
for Kafka, PostgreSQL, and other stateful services.
Own
disaster recovery planning and testing
, including restore validation.
Support
multi-cluster or cross-region strategies
where required.
Ensure data durability, integrity, and recoverability.
Cost & Resource Optimization
Implement infrastructure
cost monitoring and visibility
across environments.
Right-size clusters, storage, and compute resources to balance performance and cost.
Continuously optimize resource usage for cloud and hybrid deployments.
CI/CD & Release Engineering
Build and maintain CI/CD pipelines for platform and infrastructure components.
Enable safe deployment strategies such as rolling, blue-green, or canary deployments.
Support Git-based workflows and infrastructure promotion across environments.
Documentation, Enablement & Collaboration
Create and maintain
operational documentation, runbooks, and architectural diagrams
.
Enable self-service capabilities for engineering teams wherever possible.
Work closely with data engineers, backend engineers, and architects to support platform needs.
Reduce operational friction through automation, standardization, and tooling improvements.
#
Required skills and qualifications
5+ years of hands-on experience
in DevOps, Platform Engineering, or Site Reliability Engineering.
Strong experience operating
distributed, open-source systems in production
.
Proven expertise with:
+ Docker and Kubernetes
+ Terraform and Ansible
+ Linux systems and networking fundamentals Hands-on experience with Kafka, Spark, Airflow, NiFi, PostgreSQL, and messaging systems (including MQTT).
Experience supporting
business-critical platforms with uptime and reliability requirements
.
Strong scripting skills (Bash, Python, or equivalent).
Excellent troubleshooting and systems-level problem-solving skills.
#
Preferred skills and qualifications
Experience with GitOps tools such as ArgoCD or Flux.
Experience with observability stacks (Prometheus, Grafana, ELK/OpenSearch).
Familiarity with service meshes, ingress controllers, and API gateways.
Experience operating data-intensive or streaming platforms at scale.
Prior experience in hybrid or on-prem-first environments.
#
About us
Cuculus is the key to providing utilities to all, while protecting the world's precious resources. Jointly with our international partner network, we provide cutting-edge software and technology solutions to address utility challenges now and in the future. Cuculus will never tire of creating innovative technology and services that enable utilities and organisations to successfully transition to a new era of providing and managing electricity, water, and gas. The work we do is important for individuals, cities, and entire nations. Our work is serious, but we have fun with it, too.
Beware of fraud agents! do not pay money to get a job
MNCJobsIndia.com will not be responsible for any payment made to a third-party. All Terms of Use are applicable.