XenonStack is the fastest-growing data and AI foundry for agentic systems, enabling people and organizations to gain real-time and intelligent business insights.
Agentic Systems for AI Agents
: akira.ai
Vision AI Platform
: xenonstack.ai
Inference AI Infrastructure for Agentic Systems
: nexastack.ai
THE OPPORTUNITY
-------------------
We are seeking an
Agentic Infrastructure Observability Engineer
to design, implement, and maintain
visibility, monitoring, and assurance systems
for large-scale AI agent deployments.
This role focuses on
observability, telemetry, and evaluation pipelines
across multi-agent and multi-context workflows, ensuring AI systems are
measurable, trustworthy, and compliant
in enterprise and regulated environments.
If you're passionate about
SRE principles for AI
,
LLM evaluation
, and
agentic system transparency
, this role offers the chance to shape observability for the next generation of intelligent automation.
RESPONSIBILITIES
--------------------
Design and Implement Telemetry Pipelines
Build observability infrastructure to capture logs, metrics, traces, and behavioral data from AI agents, orchestration layers, and integrated tools.
Develop Evaluation Dashboards & KPIs
Track accuracy, latency, reliability, cost, token usage, and success rates for agentic workflows.
Enable Full-Stack Tracing
Build execution flow tracing for multi-agent, multi-tool pipelines, with attribution for each decision, prompt, and retrieval step.
Monitor Behavioral Reliability
Detect and flag hallucinations, decision drift, prompt degradation, or tool misuse in real time.
Integrate with Evaluation Frameworks
Work with LLM eval tools like
TruLens
,
Ragas
,
Arize AI
, and custom scoring systems for continuous quality monitoring.
Ensure Compliance & Auditability
Implement observability features for regulatory audits (e.g., PCI-DSS, GDPR), including secure logging of prompts, retrieved context, and decisions.
Cost & Resource Observability
Track model/API usage, compute cost, and token consumption to enable optimization decisions.
Collaborate Across Teams
Partner with AgentOps Engineers, AI Interaction Engineers, and Model Reliability teams to turn observability insights into operational improvements.
SKILLS & QUALIFICATIONS
----------------------------
Must-Have:
3-5 years in SRE, DevOps, AI infrastructure, or ML systems engineering.
Proficiency in Python and observability stacks (Prometheus, OpenTelemetry, Grafana, ELK, etc.).
Familiarity with
LLM architectures
,
multi-agent orchestration frameworks
(LangGraph, LangChain, AgentBridge), and context pipelines.
Experience with logging, tracing, and performance profiling for distributed systems.
Understanding of
LLM evaluation metrics
(factuality, coherence, toxicity, cost efficiency).
Knowledge of
privacy and compliance standards
for AI systems.
Good-to-Have:
Hands-on experience with LLM eval tools (TruLens, Ragas, Arize AI, Weights & Biases).
Familiarity with RAG, vector databases, and knowledge graph-based retrieval.
Experience in regulated industries (BFSI, healthcare, GRC).
Background in anomaly detection or behavioral monitoring for ML systems.
CAREER GROWTH & BENEFITS
-----------------------------
Continuous Learning & Growth
Training and certifications in AI observability, LLM evaluation, and Responsible AI.
Hands-on exposure to
enterprise-scale agentic infrastructure
.
Recognition & Rewards
Incentives for innovations in AI observability and monitoring.
Fast-track opportunities into
AI Reliability Architecture
or
Model Ops Leadership
roles.
Work Benefits & Well-Being
Comprehensive medical insurance and project-based allowances.
Cab facilities for women employees and special project perks.