Конспект 12: Evaluation и тестирование LLM

Введение в evaluation

Зачем тестировать LLM?

Проблемы без тестирования:
❌ Галлюцинации (выдумки модели)
❌ Неправильные ответы
❌ Предвзятость (bias)
❌ Безопасность (ругань, PII)
❌ Деградация в production

Решение: Систематическое тестирование
✅ Аттестовать качество
✅ Отслеживать регрессию
✅ Мониторить safety
✅ Оптимизировать стоимость

Жизненный цикл тестирования

Develop
    ↓
Unit Testing (отдельные компоненты)
    ↓
Integration Testing (вместе)
    ↓
Staging Testing (перед production)
    ↓
Production Monitoring
    ↓
Alert → Rollback if needed

Типы тестирования

1. Functional: Работает ли как ожидается?
2. Quality: Хорошее ли качество ответов?
3. Safety: Безопасно ли?
4. Fairness: Справедливо ли? (no bias)
5. Performance: Быстро ли? Дешево ли?

Типы метрик

Классификация метрик

Метрики:
├─ Automated (быстро, дешево)
│  ├─ Lexical (BLEU, ROUGE)
│  ├─ Semantic (Similarity, Embedding-based)
│  └─ Model-based (LLM-as-a-Judge)
│
├─ Human (точно, но медленно)
│  ├─ Binary (хорошо/плохо)
│  ├─ Rating scale (1-5)
│  └─ Comparative (A vs B)
│
└─ Hybrid (комбинация)
   └─ Human verification automated results

Метрики по задачам

Задача	Метрики	Пример
QA	Exact Match, F1, BLEU	Правильно ответил на вопрос?
Summarization	ROUGE-L, BERTScore	Резюме отражает ключевые идеи?
Classification	Accuracy, Precision, Recall, F1	Правильная ли категория?
Generation	BLEU, METEOR, ROUGE	Похоже ли на эталон?
Retrieval	nDCG, MRR, MAP	Релевантные ли результаты?

Automated metrics

Lexical Metrics (Словарные метрики)

BLEU (Bilingual Evaluation Understudy)

Используется для: Machine Translation, Text Generation

Как считается:

Сравнивает n-граммы (последовательности слов)
Между сгенерированным текстом и эталоном

Пример:
Эталон: "The cat is on the mat"
Generated: "The cat is on the mat"
BLEU = 1.0 (идеально)

Generated: "The cat on mat"
BLEU = 0.75 (1 слово отличается)

Код:

from nltk.translate.bleu_score import corpus_bleu

references = [[["the", "cat", "is", "on", "the", "mat"]]]
hypothesis = [["the", "cat", "is", "on", "the", "mat"]]

score = corpus_bleu(references, hypothesis)
print(f"BLEU: {score:.3f}")

Интерпретация:

< 0.10: Плохо
0.10-0.20: Низко
0.20-0.40: Среднее
0.40-0.60: Хорошо
> 0.60: Отличное

ROUGE (Recall-Oriented Understudy for Gisting Evaluation)

Используется для: Summarization

from rouge import Rouge

rouge = Rouge()

reference = "The cat sat on the mat"
summary = "The cat is on the mat"

scores = rouge.get_scores(summary, reference)
print(scores)  # {'rouge1': {'f': ..., 'p': ..., 'r': ...}, ...}

Варианты: - ROUGE-N: N-gram overlap - ROUGE-L: Longest common subsequence - ROUGE-W: Weighted longest common subsequence

Semantic Metrics (Семантические метрики)

BERTScore

Идея: Сравнивает смысл, не только слова

from bert_score import score

refs = ["The cat is on the mat"]
cands = ["There is a cat on the mat"]

P, R, F1 = score(cands, refs, lang="en")
print(f"BERTScore F1: {F1:.3f}")

Преимущества: - Учитывает синонимы - Лучше чем BLEU для парафраз - Коррелирует с human judgment

Cosine Similarity (на embeddings)

from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

# Embedding текстов
ref_embedding = get_embedding("The cat is on the mat")
cand_embedding = get_embedding("There is a cat on the mat")

similarity = cosine_similarity(
    [ref_embedding],
    [cand_embedding]
)[0][0]

print(f"Similarity: {similarity:.3f}")

Применение: - Semantic similarity search - RAG relevance scoring - Duplication detection

Model-based Metrics (LLM-as-a-Judge)

Идея: Используй LLM чтобы оценить результаты

def llm_judge(question: str, answer: str, reference: str) -> int:
    """
    Rate answer from 1-5 using LLM
    """
    prompt = f"""
    Question: {question}
    Reference answer: {reference}
    Candidate answer: {answer}

    Rate the candidate answer from 1-5:
    1 = Wrong
    3 = Partially correct
    5 = Correct and better

    Rating:
    """

    response = llm.predict(prompt)
    rating = int(response.strip().split()[-1])
    return rating

score = llm_judge(
    "What is 2+2?",
    "Four",
    "The answer is 4"
)

Преимущества: - Близко к human judgment - Гибко (можешь изменять критерии) - Быстро (чем human)

Недостатки: - Дорого (API вызовы) - Может быть предвзято - Сложно debug

Human evaluation

Типы human evaluation

Binary Rating (Хорошо/Плохо)

Аннотатор смотрит на ответ и выбирает:
☑ Хорошо (1)
☐ Плохо (0)

Быстро, просто, дешево
Но теряется детализация

Likert Scale (1-5)

Как хороша эта ответ?
1 - Очень плохо
2 - Плохо
3 - Нейтрально
4 - Хорошо
5 - Отлично

Больше информации
Но требует большего consensus

Comparative (A vs B)

Какой ответ лучше?
A vs B
☑ A
☐ B
☐ Одинаково

Часто используется для RLHF
Меньше субъективности

Процесс Human Evaluation

1. Определить критерии
   - Правильность
   - Полнота
   - Ясность
   - Безопасность

2. Создать инструкцию
   - Примеры
   - Edge cases
   - Когда хорошо/плохо

3. Набрать аннотаторов
   - Минимум 2 (для consensus)
   - Лучше 3-5

4. Аннотировать
   - 100-500 примеров достаточно

5. Считать metrics
   - Inter-rater agreement (Cohen's kappa)
   - Distribution of ratings

6. Анализировать
   - Где модель падает
   - Какие улучшения помогут

Инструмент: Scale AI, Argilla

Scale AI:
- Полностью управляемый сервис
- $0.50-2 за анноттацию
- Быстро и качественно

Argilla:
- Open source
- Можешь запустить локально
- Бесплатно, но нужны люди

Тестирование в production

A/B Testing

Сравнить две версии:

User 1 → Model A (Control)
User 2 → Model B (Treatment)
User 3 → Model A
...

После недели:
Model A: 85% success rate
Model B: 88% success rate

Statistical test → p-value = 0.02 (значимо!)
Deploy Model B

Реализация:

import random
from scipy import stats

# 1. Random assignment
def assign_variant(user_id):
    if hash(user_id) % 2 == 0:
        return "control"
    else:
        return "treatment"

# 2. Collect metrics
results = {
    "control": {"success": 85, "total": 100},
    "treatment": {"success": 88, "total": 100}
}

# 3. Statistical test
contingency_table = [
    [results["control"]["success"], results["control"]["total"] - results["control"]["success"]],
    [results["treatment"]["success"], results["treatment"]["total"] - results["treatment"]["success"]]
]

chi2, p_value, _, _ = stats.chi2_contingency(contingency_table)

if p_value < 0.05:
    print("Statistically significant!")
else:
    print("Not significant")

Regression Testing

Убедиться что новые изменения не сломали старое:

# Test suite
test_cases = [
    {"input": "What is 2+2?", "expected": ["4", "four"]},
    {"input": "What is AI?", "expected_keywords": ["artificial", "intelligence"]},
    {"input": "Hello", "should_not_contain": ["error", "failed"]},
]

def test_regression(model, test_cases):
    failures = []
    for test in test_cases:
        output = model.predict(test["input"])

        if not any(exp in output for exp in test.get("expected", [])):
            failures.append(f"Test failed: {test['input']}")

    return len(failures) == 0, failures

passed, failures = test_regression(new_model, test_cases)
if not passed:
    print(f"Regression detected: {failures}")
    # Rollback

Hallucination detection

Что такое hallucination?

Определение: Модель выдумывает информацию вместо того чтобы признать что она этого не знает

Примеры:

Q: Кто создал N8N?
Hallucination: "Иван Петров в 2010 году" (неправда!)
Правильный ответ: "Jan Hruska в 2017 году"

Q: Какая столица Австралии?
Hallucination: "Сидней" (неправда!)
Правильный ответ: "Канберра"

Техники detection

1. Fact Checking

def check_facts(claim: str, knowledge_base: List[str]) -> bool:
    """Check if claim is in knowledge base"""

    # Semantic search в knowledge base
    claim_embedding = get_embedding(claim)

    for fact in knowledge_base:
        fact_embedding = get_embedding(fact)
        similarity = cosine_similarity(claim_embedding, fact_embedding)

        if similarity > 0.8:
            return True  # Found supporting fact

    return False  # No supporting fact found

# Используй для RAG
context = retrieve_relevant_documents(question)
answer = llm.generate(question, context)

# Проверь факты в ответе
facts_in_answer = extract_facts(answer)
hallucinations = [f for f in facts_in_answer if not check_facts(f, context)]

if hallucinations:
    print(f"Hallucinations detected: {hallucinations}")

2. Confidence Scoring

def generate_with_confidence(question: str) -> tuple[str, float]:
    """
    Generate answer and estimate confidence
    """

    # Generate multiple times
    answers = []
    for _ in range(5):
        ans = llm.generate(question, temperature=0.8)
        answers.append(ans)

    # If all answers same → high confidence
    # If diverse answers → low confidence (hallucinating?)

    unique_answers = len(set(answers))
    confidence = 1 - (unique_answers / 5)

    final_answer = answers[0]

    return final_answer, confidence

answer, confidence = generate_with_confidence("What is N8N?")
if confidence < 0.5:
    print("Low confidence - likely hallucinating")

3. Uncertainty Estimation

def estimate_uncertainty(question: str) -> float:
    """
    Ask LLM to estimate its own uncertainty
    """

    prompt = f"""
    Question: {question}

    How confident are you in answering this question?
    Scale: 0 (very uncertain) to 1 (very certain)

    Confidence:
    """

    response = llm.predict(prompt)
    confidence = float(response.strip())

    return confidence

Bias и fairness testing

Типы bias

Gender bias:
- "A developer is known for his work" vs "A developer is known for her work"
- Модель может лучше ответить на первый

Racial bias:
- Имена с разными расами → разные ответы

Language bias:
- Может лучше работать с английским чем с русским

Age bias:
- Может быть предвзято к молодым/старым

Fairness Testing

def test_gender_fairness():
    """Test if model is gender biased"""

    test_cases = [
        ("A doctor went to his office", "A doctor went to her office"),
        ("The engineer fixed the car himself", "The engineer fixed the car herself"),
    ]

    for text1, text2 in test_cases:
        embedding1 = get_embedding(text1)
        embedding2 = get_embedding(text2)

        similarity = cosine_similarity(embedding1, embedding2)

        # Должны быть похожи, если нет bias
        if similarity < 0.9:
            print(f"Potential gender bias detected")
            print(f"  '{text1}' vs '{text2}'")
            print(f"  Similarity: {similarity:.3f}")

test_gender_fairness()

Инструменты для fairness

AI Fairness 360 (IBM):
- Comprehensive toolkit
- Supports many algorithms
- Open source

FairTest:
- Automatic bias detection
- Can find hidden biases

Инструменты и фреймворки

DeepEval

from deepeval.metrics import AnswerRelevancyMetric
from deepeval.test_case import LLMTestCase

metric = AnswerRelevancyMetric(
    threshold=0.7,
    model="gpt-4"
)

test_case = LLMTestCase(
    input="What is AI?",
    actual_output="AI is artificial intelligence",
    expected_output="AI stands for artificial intelligence"
)

assert metric.measure(test_case).passed

Ragas (RAG Assessment)

from ragas import evaluate
from ragas.metrics import context_precision, faithfulness

results = evaluate(
    dataset=ds,
    metrics=[context_precision, faithfulness],
    llm=llm,
    embeddings=embeddings
)

print(results)

MLflow

import mlflow

mlflow.set_experiment("LLM Evaluation")

with mlflow.start_run():
    # Log metrics
    mlflow.log_metric("accuracy", 0.92)
    mlflow.log_metric("f1_score", 0.88)

    # Log model
    mlflow.log_model(model, "model")

    # Log params
    mlflow.log_params({"temperature": 0.7})

Практические примеры

Пример 1: QA System Evaluation

from rouge import Rouge
import json

def evaluate_qa_system(test_data, model):
    """
    Evaluate Q&A system on test set
    """

    rouge = Rouge()
    metrics = {
        "exact_match": 0,
        "rouge_l_f": [],
        "bert_score_f": []
    }

    for example in test_data:
        question = example["question"]
        reference = example["answer"]

        # Generate answer
        predicted = model.answer(question)

        # Exact match
        if predicted.lower() == reference.lower():
            metrics["exact_match"] += 1

        # ROUGE
        scores = rouge.get_scores(predicted, reference)[0]
        metrics["rouge_l_f"].append(scores["rouge-l"]["f"])

        # BERTScore
        P, R, F1 = score([predicted], [reference], lang="en")
        metrics["bert_score_f"].append(F1.item())

    # Aggregate
    results = {
        "exact_match": metrics["exact_match"] / len(test_data),
        "rouge_l_f": sum(metrics["rouge_l_f"]) / len(metrics["rouge_l_f"]),
        "bert_score_f": sum(metrics["bert_score_f"]) / len(metrics["bert_score_f"])
    }

    return results

# Использование
results = evaluate_qa_system(test_data, my_model)
print(json.dumps(results, indent=2))

Пример 2: Hallucination Detection

def detect_hallucinations(answer: str, context: List[str], threshold=0.7):
    """
    Detect hallucinations in answer
    """

    # Extract facts from answer
    facts = extract_facts(answer)

    hallucinations = []
    for fact in facts:
        # Check if fact supported by context
        fact_embedding = get_embedding(fact)

        max_similarity = 0
        for ctx in context:
            ctx_embedding = get_embedding(ctx)
            sim = cosine_similarity(fact_embedding, ctx_embedding)
            max_similarity = max(max_similarity, sim)

        if max_similarity < threshold:
            hallucinations.append({
                "fact": fact,
                "max_similarity": max_similarity
            })

    return hallucinations

# Использование
context = ["N8N is an automation platform", "It was created in 2017"]
answer = "N8N was created by Jan Hruska in 2017"

hallucinations = detect_hallucinations(answer, context)
print(f"Detected {len(hallucinations)} hallucinations")

Лучшие практики

Do's ✅

Используй multiple metrics Automated + Human evaluation вместе
Тестируй на diverse dataset Include edge cases, different languages, etc.
Мониторь в production Track metrics continuously Alert на degradation
Версионируй тест сеты tests/ ├─ v1.0.json ├─ v1.1.json (добавлены edge cases) └─ v2.0.json (новая доменная область)
Документируй критерии Почему считается хорошим ответ? Примеры хороших и плохих ответов

Don'ts ❌

Не полагайся на одну метрику BLEU сам по себе недостаточен
Не забывай про statistical significance Нужно p-value < 0.05
Не игнорируй human feedback Automated metrics не идеальны
Не используй старый тест сет Модель может переучиться на нем

Чек-лист evaluation

Подготовка:
☑️ Test set создан (100+ примеров)
☑️ Evaluation metrics выбраны
☑️ Baseline установлен
☑️ Инструкции для аннотаторов написаны

Evaluation:
☑️ Automated metrics считаются
☑️ Human evaluation проведена
☑️ Inter-rater agreement checked
☑️ Results analyzed

Monitoring:
☑️ Metrics логируются
☑️ Alerts настроены на регрессию
☑️ Dashboard создана
☑️ Regular reviews scheduled

Дата создания: December 2025
Версия: 1.0
Автор: Pavel
Применение: LLM Testing, Quality Assurance, Production Monitoring