Конспект 14: Reasoning Models - o1, o3, DeepSeek-R1

Введение в Reasoning Models

Что такое Reasoning Models?

Определение: Модели которые могут думать пошагово, решать сложные задачи через явный reasoning процесс.

Отличие от обычных LLM:

Обычный LLM (GPT-4):
Question → Direct Answer
(быстро, но может ошибиться на сложном)

Reasoning Model (o1):
Question → Think Step-by-Step → Answer
(медленнее, но правильнее на сложном)

Зачем нужны reasoning models?

Задачи где reasoning важен:
✅ Math (доказательства, вычисления)
✅ Logic (дедукция, анализ)
✅ Code (сложная архитектура)
✅ Science (экспериментальный дизайн)
✅ Strategy (планирование)

Задачи где обычный LLM достаточно:
❌ Простые Q&A
❌ Копирайтинг
❌ Классификация
❌ Генерация текста
❌ Быстрые ответы

Жизненный цикл

2022: GPT-3.5 (no reasoning)
    ↓
2023: GPT-4 (better reasoning, but still direct)
    ↓
2024-Q3: o1 (explicit reasoning, chain-of-thought)
    ↓
2024-Q4: o1-preview (улучшенная версия)
    ↓
2025: o3 (высокий reasoning уровень)
         DeepSeek-R1 (open source reasoning)
    ↓
Future: o3-mini (дешевый reasoning)

История развития

Chain-of-Thought (CoT) - 2022

Идея: Попросить модель думать пошагово

❌ Без CoT:
Q: "Если у Маши 3 яблока и она купила еще 5, а потом отдала 2, сколько?"
A: "6"

✅ С CoT:
Q: "Пошагово: Маша имеет 3 яблока. Купила 5 больше: 3+5=8. 
    Отдала 2: 8-2=6"
A: "6"

Результат: Более точные ответы даже на обычных моделях

Scaling Laws и Reasoning - 2023

Открытие: Больше computation time = лучше reasoning

Модель
    ↓
Больше tokens для thinking
    ↓
Лучше итоговый результат
    ↓
Можно даже меньшая модель с больше thinking

OpenAI o1 Preview - 2024 September

Революция: Явный reasoning процесс внутри модели

Возможности o1:
- Думает прежде чем говорит
- Может исправлять свои ошибки
- Лучше на сложной математике
- Лучше на сложном коде

OpenAI o1

Характеристики o1

Модель: o1-preview (публично доступна)

Параметры:
- Reasoning: Явный процесс мышления
- Контекст: 128K токенов
- Output: До 4000 слов reasoning + ответ
- Скорость: Медленнее (10-100 сек)
- Цена: Дороже ($15/1M input, $60/1M output)
- Точность: Очень высокая на сложном

Когда o1 дает результаты

STEM задачи:

Math: +30% лучше чем GPT-4
- Calculus
- Linear algebra
- Probability
- Physics problems

Coding: +20% лучше чем GPT-4
- Algorithm design
- Complex debugging
- Architecture decisions

Логические задачи:

Logic puzzles: +50% better
Reasoning tasks: +40% better

API использование

import openai

client = openai.OpenAI(api_key="sk-...")

# o1 с reasoning
response = client.chat.completions.create(
    model="o1-preview",
    messages=[
        {
            "role": "user",
            "content": "Prove that there are infinitely many prime numbers"
        }
    ]
)

# Результат содержит процесс мышления
reasoning = response.choices[0].message.reasoning
answer = response.choices[0].message.content

print(f"Reasoning: {reasoning}")
print(f"Answer: {answer}")

Ограничения o1

❌ Нельзя использовать:
- System prompts (не поддерживает)
- Functions/tools (не может)
- Vision (не может видеть картинки)
- Temperature (всегда используется оптимальная)

❌ Медленнее на:
- Простые Q&A (зачем думать много?)
- Творческие задачи (не нужен reasoning)
- Быстрые ответы (слишком долго)

OpenAI o3

Что улучшено в o3?

Статус: Анонсирована December 2024, тестирование

Улучшения:
✅ Еще лучше на math (80th percentile на IMO)
✅ Еще лучше на coding
✅ Поддержка vision (картинки)
✅ Поддержка tools/functions
✅ Более гибкая (можно tuning reasoning level)
✅ Быстрее чем o1
✅ Дешевле чем o1 (expected)

Reasoning Levels

Новая фишка o3:

response = client.chat.completions.create(
    model="o3",
    messages=[...],
    reasoning_level="low"   # или "medium", "high"
)

Low (быстро, дешево):
- Базовый reasoning
- Для простых задач
- ~10 сек

Medium (баланс):
- Умеренный reasoning
- Для большинства задач
- ~30 сек

High (лучше качество):
- Глубокий reasoning
- Для очень сложных задач
- ~2 минуты

Expected Pricing o3

Примерная цена (не официально):

o3 Low: $10/1M input, $40/1M output
o3 Medium: $20/1M input, $80/1M output
o3 High: $50/1M input, $200/1M output

Сравнение:
GPT-4 Turbo: $10 input / $30 output
o1: $15 input / $60 output
o3 High: $50 input / $200 output

Но качество: o3 High > o1 > GPT-4

DeepSeek-R1

Открытый reasoning model

Статус: December 2024, полностью open source

Характеристики:
- Размер: 1.5B, 8B, 671B versions
- Лицензия: MIT (полностью бесплатно!)
- Reasoning: Similar to o1 но с меньше compute
- Локальное развертывание: Да!
- Fine-tuning: Возможно!

Производительность

Benchmarks (vs других моделей):

             Math    Code    Logic
GPT-4        80%     75%     80%
o1           95%     90%     95%
DeepSeek-R1  92%     88%     90%
Llama 3.1    70%     72%     65%

DeepSeek-R1: 92-95% качества o1
но:
- Open source (бесплатно)
- Можно локально развернуть
- Можно fine-tune

Использование DeepSeek-R1

Локально:

# Скачать через Hugging Face
huggingface-cli download deepseek-ai/DeepSeek-R1-Distill-Qwen-14B

# Или через Ollama
ollama pull deepseek-r1

# Использовать
ollama run deepseek-r1

Код:

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "deepseek-ai/DeepSeek-R1-Distill-Qwen-14B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

prompt = "Prove that sqrt(2) is irrational"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_length=2048)
print(tokenizer.decode(outputs[0]))

Размеры и требования

DeepSeek-R1-Distill-Qwen-1.5B:
- VRAM: 4GB
- Speed: Fast
- Quality: Good (~85% o1)

DeepSeek-R1-Distill-Qwen-7B:
- VRAM: 16GB
- Speed: Medium
- Quality: Very good (~90% o1)

DeepSeek-R1-Distill-Qwen-14B:
- VRAM: 24-32GB (with quantization: 12GB)
- Speed: Slow
- Quality: Excellent (~92% o1)

DeepSeek-R1-Distill-Qwen-32B:
- VRAM: 80GB+ (or 24GB with 4-bit)
- Speed: Very slow
- Quality: Near-o1 (~94%)

Как работают reasoning models

Architecture

Input Question
    ↓
Tokenize + Embed
    ↓
Reasoning Layer 1 (think about approach)
    ↓
Reasoning Layer 2 (work through problem)
    ↓
Reasoning Layer 3 (verify solution)
    ↓
Token generation (output answer)

Reasoning Token Budget

Идея: Выделить большой "thinking budget"

Обычный LLM:
Input: 100 tokens
Output: 100 tokens
Total: 200 tokens

Reasoning Model:
Input: 100 tokens
Reasoning: 5000-20000 tokens (думает!)
Output: 100 tokens
Total: 5100-20100 tokens

Больше computation ≈ лучше результат

Scaling Laws for Reasoning

Empirical findings:
- Doubling reasoning compute → +5-10% accuracy
- Более эффективно чем просто больше параметров
- Sweet spot для o1: ~10K-50K reasoning tokens
- Sweet spot для o3: depends on reasoning_level

Когда использовать reasoning models

Матрица решений

                    Complexity
                        ↑
                        |
    o3 High ●           |
                        |
    o1 ●                |
                        |  
GPT-4 ●───────────────┼─────────→ Importance
                        |
        Low   Medium   High

Сравнительная таблица

Задача	Модель	Причина
Простой Q&A	GPT-4	Сверхсложность неэффективна
Классификация	GPT-4	Не нужен reasoning
Копирайтинг	GPT-4	Creativity, не logic
Простая математика	GPT-4	Достаточно
Сложная математика	o1/o3	Нужен reasoning
Алгоритм design	o1/o3	Нужна дедукция
Bug hunting	o1/o3	Нужен анализ
Доказательства	o3 High	Максимум reasoning
Научное исследование	o3 High	Сложные гипотезы
Локальное развертывание	DeepSeek-R1	Open source

Cost-Benefit Analysis

Пример: Coding task

GPT-4:
- Time: 5 sec
- Cost: $0.001
- Quality: 70%
- Total: Fast, cheap, okayish

o1:
- Time: 30 sec
- Cost: $0.01
- Quality: 95%
- Total: Slow, durable, excellent

DeepSeek-R1 (local):
- Time: 120 sec
- Cost: $0 (VRAM only)
- Quality: 92%
- Total: Slow, free, excellent

Cost vs Quality

Price comparison (per 1000 requests)

Task: Math problem solving (5 queries, 200 tokens each)

GPT-4:
- Cost: $0.001 × 1000 = $1.00
- Quality: 70%
- Total: $1.43 per successful

o1:
- Cost: $0.05 × 1000 = $50
- Quality: 95%
- Total: $52.63 per successful

DeepSeek-R1 (local RTX 4090):
- Cost: ~$0.50 (electricity)
- Quality: 92%
- Total: $0.54 per successful

Вывод:
- For production: o1 дороже но надежнее
- For cost-sensitive: DeepSeek-R1 лучше

Когда платить за o1/o3?

ROI calculation:

Cost of error:
- Bad code suggestion: Wasted developer time ($100+)
- Wrong math: Lost calculation ($1000+)
- Scientific error: Retracted paper (reputation)

Value of correctness:
- Math homework: High (need right answer)
- Code review: Very high (bugs are expensive)
- Research: Critical (credibility matters)

If error_cost > reasoning_cost × 10:
    → Use o1/o3
else:
    → Use GPT-4 or DeepSeek locally

Практические примеры

Пример 1: Math Problem Solver

import openai

class MathSolver:
    def __init__(self):
        self.client = openai.OpenAI()

    def solve_math_problem(self, problem: str, complexity: str = "medium") -> dict:
        """
        Solve math problem with reasoning

        Complexity: low (GPT-4), medium (o1), high (o3-high)
        """

        if complexity == "low":
            model = "gpt-4-turbo"
            use_reasoning = False
        elif complexity == "medium":
            model = "o1-preview"
            use_reasoning = True
        else:
            model = "o3"  # When available
            use_reasoning = True

        response = self.client.chat.completions.create(
            model=model,
            messages=[
                {
                    "role": "user",
                    "content": f"Solve this step-by-step: {problem}"
                }
            ]
        )

        return {
            "model": model,
            "reasoning_used": use_reasoning,
            "answer": response.choices[0].message.content,
            "thinking": getattr(response.choices[0].message, 'reasoning', None)
        }

# Использование
solver = MathSolver()

# Simple problem → GPT-4
result = solver.solve_math_problem("What is 2+2?", "low")

# Complex problem → o1
result = solver.solve_math_problem(
    "Prove that there are infinitely many primes",
    "high"
)

print(result)

Пример 2: Code Review with Reasoning

class CodeReviewer:
    def __init__(self):
        self.client = openai.OpenAI()

    def review_code(self, code: str, focus: str = "bugs") -> dict:
        """
        Review code with explicit reasoning

        Focus: bugs, performance, security, design
        """

        prompt = f"""
        Review this code for {focus}:

        ```python
        {code}
        ```

        Provide:
        1. Step-by-step analysis
        2. Issues found
        3. Recommendations
        """

        response = self.client.chat.completions.create(
            model="o1-preview",  # Use reasoning for complex review
            messages=[{"role": "user", "content": prompt}]
        )

        return {
            "reasoning_process": getattr(
                response.choices[0].message, 'reasoning', None
            ),
            "review": response.choices[0].message.content,
            "focus": focus
        }

# Использование
reviewer = CodeReviewer()

code = '''
def find_max(arr):
    max_val = arr[0]
    for i in range(len(arr)):
        if arr[i] > max_val:
            max_val = arr[i]
    return max_val
'''

review = reviewer.review_code(code, "performance")
print(review["review"])

Пример 3: Hybrid Approach (Cost Optimization)

class IntelligentSolver:
    """
    Route to appropriate model based on complexity
    """

    def __init__(self):
        self.client = openai.OpenAI()

    def estimate_complexity(self, problem: str) -> str:
        """
        Quick check: is this simple or complex?
        """

        response = self.client.chat.completions.create(
            model="gpt-4-turbo",  # Fast classifier
            messages=[{
                "role": "user",
                "content": f"Rate complexity of this problem (1-10): {problem}"
            }]
        )

        rating = int(response.choices[0].message.content.strip())

        if rating <= 3:
            return "simple"
        elif rating <= 7:
            return "medium"
        else:
            return "complex"

    def solve(self, problem: str) -> dict:
        """
        Smart routing to optimize cost/quality
        """

        complexity = self.estimate_complexity(problem)

        if complexity == "simple":
            model = "gpt-4-turbo"
            cost = 0.001
        elif complexity == "medium":
            model = "o1-preview"
            cost = 0.01
        else:
            model = "o3"  # When available
            cost = 0.05

        response = self.client.chat.completions.create(
            model=model,
            messages=[{"role": "user", "content": problem}]
        )

        return {
            "complexity": complexity,
            "model": model,
            "estimated_cost": cost,
            "answer": response.choices[0].message.content
        }

# Использование
solver = IntelligentSolver()
result = solver.solve("What is 2+2?")
# → Use GPT-4 (simple), cheap

result = solver.solve("Prove Fermat's Last Theorem")
# → Use o3 (complex), expensive but necessary

Лучшие практики

Do's ✅

Используй reasoning models для сложного Math, logic, code architecture
Используй hybrid approach для экономии Easy → GPT-4 Hard → o1/o3
Тестируй локально с DeepSeek-R1 Перед развертыванием в production
Используй reasoning_level для o3 Low: быстро, дешево High: медленно, дорого, лучше
Мониторь качество улучшений Is the extra cost worth the quality gain?

Don'ts ❌

Не используй o1/o3 для простого Overengineering, waste of money
Не игнорируй reasoning output Процесс thinking часто более ценен чем ответ
Не полагайся только на reasoning Даже o3 может ошибиться Verify результаты!
Не забывай про timeout o1/o3 могут быть медленными Установи timeout для UX

Чек-лист использования

Выбор модели:
☑️ Определена сложность задачи
☑️ Оценена стоимость vs качество
☑️ Выбрана правильная модель
☑️ API ключи настроены

Реализация:
☑️ Input валидирован
☑️ Error handling есть
☑️ Timeout установлен
☑️ Логирование включено

Мониторинг:
☑️ Отслеживается качество
☑️ Отслеживается стоимость
☑️ Алерты на ошибки
☑️ Regular review results

Дата создания: December 2025
Версия: 1.0
Автор: Pavel
Применение: Complex Problem Solving, Math, Code Analysis, Research