AI Quality Analyst
Software Engineering, IT, Data Science, Quality Assurance
Spain
What You Will Be Doing
Architect Automated Evaluation Frameworks: Design, implement, and maintain scalable evaluation pipelines (Evals) for LLMs and agent graphs using modern tooling like LangSmith, DeepEval, Ragas, or Opik.
Curate Ground-Truth Benchmarks: Collaborate with domain experts to build, version, and sanitize robust gold-standard datasets, synthetic evaluation profiles, and edge-case testing matrices reflecting real-world business scenarios.
Own Non-Deterministic Quality Tracking: Define, monitor, and enforce quality KPIs across multi-agent workflows—specifically focusing on tool-calling accuracy, intent-recognition safety, structured output formatting, and context-retrieval (RAG) precision.
Mitigate and Quantify Systemic Risk: Lead rigorous failure and hallucination analyses on production outputs. Implement structured LLM-as-Judge patterns, validation metrics, and guardrail heuristics while actively ensuring the judge profiles remain free of baseline evaluation bias.
Enforce CI/CD Evaluation Gates: Partner directly with MLOps and Backend Engineering teams to integrate automated testing gates into our deployment pipelines, proactively preventing regressions or behavioral drifts from reaching production runtime environments.
Drive Optimization for Latency & Cost: Regularly analyze the efficiency of prompt templates, few-shot structures, and model selections (e.g., GPT, Claude, LLaMA) to ensure a highly calibrated balance between execution throughput, sub-second latency, and platform compute costs.
Who You Are
A Data-Savvy Automation Advocate: You possess strong software engineering fundamentals and concrete Python coding experience, allowing you to seamlessly script custom evaluation routines and query multi-tenant databases.
An Analytical Thinker with an AI Lens: You understand that testing non-deterministic LLMs requires a completely different mindset than traditional QA. You possess deep intuition for token behaviors, retrieval dynamics, prompt engineering nuances, and failure states.
Radically Autonomous & Collaborative: You do not wait around for static technical specifications. You independently coordinate syncs with AI leads, domain backend engineers, and product stakeholders to identify and patch system vulnerabilities.
Rigorously Quality-Oriented: You hold a low ego but maintain high standards for system stability. You are deeply passionate about separating market hype from practical, measurable production metrics.