Ai Evaluation Specialist Qa

Year    TS, IN, India

Job Description

Kore.ai is a pioneering force in enterprise AI transformation, empowering organizations through our comprehensive agentic AI platform. With innovative offerings across "AI for Service," "AI for Work," and "AI for Process," we're enabling over 400+ Global 2000 companies to fundamentally reimagine their operations, customer experiences and employee productivity.


Our end-to-end platform enables enterprises to build, deploy, manage, monitor, and continuously improve agentic applications at scale. We've automated over 1 billion interactions every year with voice and digital AI in customer service, and transformed employee experiences for tens of thousands of employees through productivity and AI-driven workflow automation.


Recognized as a leader by Gartner, Forrester, IDC, ISG, and Everest, Kore.ai has secured Series D funding of $150M, including strategic investment from NVIDIA to drive Enterprise AI innovation. Founded in 2014 and headquartered in Florida, we maintain a global presence with offices in India, UK, Germany, Korea, and Japan.


You can find full press coverage at https://kore.ai/press/

POSITION:

Senior AI Evaluation Specialist



POSITION SUMMARY:

We are seeking a

Senior AI Evaluation Specialist

to design and execute robust evaluation methodologies for Generative and Agentic AI systems. This role bridges

AI product quality, evaluation science, and responsible AI governance

-- ensuring every AI feature, agent, and model release is measured, benchmarked, and validated using standardised frameworks.


The ideal candidate combines a

QA mindset

,

ML evaluation rigour

, and

hands-on coding expertise

to benchmark LLMs, multi-agent workflows, and GenAI APIs, driving consistent, measurable, and safe AI product performance.

LOCATION: Hyderabad (Work from Office)



RESPONSIBILITIES:



1. AI Evaluation & Benchmarking




Build and maintain

end-to-end evaluation pipelines

for Generative and Agentic AI features (e.g., chat, reasoning agents, RAG workflows, summarization, classification).


Implement standardized evaluation frameworks such as

RAGAS, G-Eval, HELM, PromptBench, MT-Bench

, or

custom evaluation harnesses

.


Define and measure core

AI quality metrics

-- accuracy, groundedness, coherence, contextual recall, hallucination rate, and response time.


Create reproducible

benchmarks, leaderboards, and regression tracking

for models and agents across multiple releases or providers (OpenAI, Anthropic, Mistral, etc.).

2. Agentic AI Evaluation




Evaluate

multi-agent systems

and

autonomous AI workflows

, measuring task success rates, reasoning trace quality, and tool-use efficiency.


Assess

Agentic AI behaviors

such as planning accuracy, goal completion rate, context handoff success, and inter-agent communication reliability.


Validate

decision-making transparency

and

error recovery

mechanisms in autonomous agent frameworks (LangGraph, AutoGen, CrewAI, etc.).


Design agent-specific evaluation scenarios -- simulated environments, user-in-the-loop testing, and "mission-based" performance scoring.

3. Experimentation & Automation




Develop

Python-based evaluation scripts

to automate testing using OpenAI, Anthropic, and Hugging Face APIs.


Conduct large-scale comparative studies across

prompts, models, and fine-tuned variants

, analyzing quantitative and qualitative differences.


Integrate evaluations into CI/CD pipelines to enable

continuous AI quality monitoring

.


Visualize results using dashboards (Plotly, Streamlit, Dash, or Grafana).

4. Quality Governance & Reporting




Define and enforce

AI acceptance thresholds

before deployment.


Collaborate with

Responsible AI teams

to evaluate bias, fairness, safety, and privacy implications.


Produce detailed

evaluation reports and audit logs

for model releases and governance boards.


Present findings to Product, Data Science, and Executive stakeholders -- transforming metrics into actionable insights.

5. Collaboration & Continuous Improvement




Work closely with

Prompt Engineers, ML Scientists, and QA Engineers

to close the loop between testing and improvement.


Support Product teams in defining

evaluation-driven release criteria

.


Mentor junior evaluators in AI testing methodologies, benchmarking, and analysis.


Keep abreast of advances in

LLM evaluation research

,

Agentic AI frameworks

, and

tool-calling reliability testing

.

QUALIFICATIONS / SKILLS REQUIRED:



Category



Expected Expertise



Programming





Python (Pandas, NumPy, LangChain, LangGraph, OpenAI/Anthropic SDKs)

Evaluation Frameworks





RAGAS, HELM, G-Eval, MT-Bench, PromptBench, custom scoring pipelines

GenAI APIs





OpenAI GPT-4/5, Claude, Gemini, Mistral, Azure OpenAI

Agentic AI





Understanding of multi-agent orchestration, tool use, reasoning traces, and planning frameworks (AutoGen, CrewAI, LangGraph)

Metrics Knowledge





BLEU, ROUGE, cosine similarity, factuality, coherence, bias, toxicity, reasoning success rate

Data & Analytics





JSON parsing, prompt dataset curation, result visualization

Tooling





Git, Jupyter/Colab, Jira, Confluence, evaluation dashboards

Soft Skills





Analytical communication, documentation excellence, cross-team collaboration

EDUCATION QUALIFICATION:



Bachelor's or Master's degree in Computer Science, AI, Data Science, or related discipline. 5 to 10 years total experience with at least 3+ years in

AI evaluation, GenAI QA, or LLM quality analysis

. Strong understanding of

AI/ML model lifecycle

,

prompt engineering

, and

RAG or agentic architectures

. * Experience contributing to AI safety, reliability, or responsible AI initiatives.

Beware of fraud agents! do not pay money to get a job

MNCJobsIndia.com will not be responsible for any payment made to a third-party. All Terms of Use are applicable.


Job Detail

  • Job Id
    JD5039904
  • Industry
    Not mentioned
  • Total Positions
    1
  • Job Type:
    Full Time
  • Salary:
    Not mentioned
  • Employment Status
    Permanent
  • Job Location
    TS, IN, India
  • Education
    Not mentioned
  • Experience
    Year