ThreatHunt/PHASE5_LLM_ARCHITECTURE.md

# Phase 5: Distributed LLM Routing Architecture

## Overview

Phase 5 introduces a sophisticated distributed Large Language Model (LLM) routing system that intelligently classifies tasks and routes them to specialized models across multiple GPU nodes (GB10 devices). This architecture enables efficient utilization of computational resources and optimal model selection based on task requirements.

## Architecture Components

The system consists of four containerized components that work together to provide intelligent, scalable LLM processing:

### 1. Router Agent (LLM Classifier + Policy Engine)

**Module**: `app/core/llm_router.py`

The Router Agent is responsible for:
- **Request Classification**: Analyzes incoming requests to determine the task type
- **Model Selection**: Routes requests to the most appropriate specialized model
- **Policy Enforcement**: Applies routing rules based on configured policies

**Task Types & Model Routing:**

| Task Type | Model | Use Case |
|-----------|-------|----------|
| `general_reasoning` | DeepSeek | Complex analysis and reasoning |
| `multilingual` | Qwen72 / Aya | Translation and multilingual tasks |
| `structured_parsing` | Phi-4 | Structured data extraction |
| `rule_generation` | Qwen-Coder | Code and rule generation |
| `adversarial_reasoning` | LLaMA 3.1 | Threat and adversarial analysis |
| `classification` | Granite Guardian | Pure classification tasks |

**Classification Logic:**
```python
from app.core.llm_router import get_llm_router

router = get_llm_router()
routing_decision = router.route_request({
    "prompt": "Analyze this threat...",
    "task_hints": ["threat", "adversary"]
})
# Routes to LLaMA 3.1 for adversarial reasoning
```

### 2. Job Scheduler (GPU Load Balancer)

**Module**: `app/core/job_scheduler.py`

The Job Scheduler manages:
- **Node Selection**: Determines which GB10 device is available
- **Resource Monitoring**: Tracks GPU VRAM and compute utilization
- **Parallelization Decisions**: Determines if jobs should be distributed
- **Serial Chaining**: Handles multi-step reasoning workflows

**GPU Node Configuration:**

**GB10 Node 1** (`gb10-node-1:8001`)
- **Total VRAM**: 80 GB
- **Models Loaded**: DeepSeek, Qwen72
- **Primary Use**: General reasoning and multilingual tasks

**GB10 Node 2** (`gb10-node-2:8001`)
- **Total VRAM**: 80 GB
- **Models Loaded**: Phi-4, Qwen-Coder, LLaMA 3.1, Granite Guardian
- **Primary Use**: Specialized tasks (parsing, coding, classification, threat analysis)

**Scheduling Strategies:**

1. **Single Node Execution**
   - Default for simple requests
   - Selected based on lowest compute utilization
   - Requires sufficient VRAM for model

2. **Parallel Execution**
   - Distributes work across multiple nodes
   - Used for batch processing or high-priority jobs
   - Automatic load balancing

3. **Serial Chaining**
   - Multi-step dependent operations
   - Sequential execution with context passing
   - Used for complex reasoning workflows

4. **Queued Execution**
   - When all nodes are at capacity
   - Priority-based queue management
   - Automatic dispatch when resources available

**Example Usage:**
```python
from app.core.job_scheduler import get_job_scheduler, Job

scheduler = get_job_scheduler()
job = Job(
    job_id="threat_analysis_001",
    model="llama31",
    priority=1,
    estimated_vram_gb=10,
    requires_parallel=False,
    requires_chaining=False,
    payload={"prompt": "..."}
)

scheduling_decision = await scheduler.schedule_job(job)
# Returns node assignment and execution mode
```

### 3. LLM Pool (OpenAI-Compatible Endpoints)

**Module**: `app/core/llm_pool.py`

The LLM Pool provides:
- **Unified Interface**: OpenAI-compatible API for all models
- **Endpoint Management**: Tracks availability and health
- **Parallel Execution**: Simultaneous multi-model requests
- **Error Handling**: Graceful fallback on failures

**Available Endpoints:**

| Model | Endpoint | Node | Specialization |
|-------|----------|------|----------------|
| DeepSeek | `http://gb10-node-1:8001/deepseek` | Node 1 | General reasoning |
| Qwen72 | `http://gb10-node-1:8001/qwen72` | Node 1 | Multilingual |
| Phi-4 | `http://gb10-node-2:8001/phi4` | Node 2 | Structured parsing |
| Qwen-Coder | `http://gb10-node-2:8001/qwen-coder` | Node 2 | Code generation |
| LLaMA 3.1 | `http://gb10-node-2:8001/llama31` | Node 2 | Adversarial reasoning |
| Granite Guardian | `http://gb10-node-2:8001/granite-guardian` | Node 2 | Classification |

**Example Usage:**
```python
from app.core.llm_pool import get_llm_pool

pool = get_llm_pool()

# Single model call
result = await pool.call_model(
    model_name="llama31",
    prompt="Analyze this threat pattern...",
    parameters={"temperature": 0.7, "max_tokens": 2048}
)

# Multiple models in parallel
results = await pool.call_multiple_models(
    model_names=["llama31", "deepseek"],
    prompt="Complex threat analysis...",
    parameters={"temperature": 0.7}
)
```

### 4. Merger Agent (Result Synthesizer)

**Module**: `app/core/merger_agent.py`

The Merger Agent provides:
- **Result Combination**: Intelligently merges outputs from multiple models
- **Strategy Selection**: Multiple merging strategies for different use cases
- **Quality Assessment**: Evaluates and ranks responses
- **Consensus Building**: Determines agreement across models

**Merging Strategies:**

1. **Consensus** (`MergeStrategy.CONSENSUS`)
   - Takes majority vote for classifications
   - Selects most common response
   - Best for: Classification tasks, binary decisions

2. **Weighted** (`MergeStrategy.WEIGHTED`)
   - Weights results by confidence scores
   - Selects highest confidence response
   - Best for: When models provide confidence scores

3. **Concatenate** (`MergeStrategy.CONCATENATE`)
   - Combines all responses sequentially
   - Preserves all information
   - Best for: Comprehensive analysis requiring multiple perspectives

4. **Best Quality** (`MergeStrategy.BEST_QUALITY`)
   - Selects highest quality response based on metrics
   - Considers length, completeness, formatting
   - Best for: Text generation, detailed explanations

5. **Ensemble** (`MergeStrategy.ENSEMBLE`)
   - Synthesizes insights from all models
   - Creates comprehensive summary
   - Best for: Complex analysis requiring synthesis

**Example Usage:**
```python
from app.core.merger_agent import get_merger_agent, MergeStrategy

merger = get_merger_agent()

# Multiple model results
results = [
    {"model": "llama31", "response": "...", "confidence": 0.9},
    {"model": "deepseek", "response": "...", "confidence": 0.85}
]

# Merge with consensus strategy
merged = merger.merge_results(results, strategy=MergeStrategy.CONSENSUS)
```

## API Endpoints

### Process LLM Request
```http
POST /api/llm/process
```

Processes a request through the complete routing system.

**Request Body:**
```json
{
  "prompt": "Analyze this threat pattern for indicators of compromise",
  "task_hints": ["threat", "adversary"],
  "requires_parallel": false,
  "requires_chaining": false,
  "parameters": {
    "temperature": 0.7,
    "max_tokens": 2048
  }
}
```

**Response:**
```json
{
  "job_id": "job_123_4567",
  "status": "completed",
  "routing": {
    "task_type": "adversarial_reasoning",
    "model": "llama31",
    "endpoint": "llama31",
    "priority": 1
  },
  "scheduling": {
    "job_id": "job_123_4567",
    "execution_mode": "single",
    "node": {
      "node_id": "gb10-node-2",
      "endpoint": "http://gb10-node-2:8001/llama31"
    }
  },
  "result": {
    "choices": [...]
  },
  "execution_mode": "single"
}
```

### List Available Models
```http
GET /api/llm/models
```

Returns all available LLM models in the pool.

**Response:**
```json
{
  "models": [
    {
      "model_name": "deepseek",
      "node_id": "gb10-node-1",
      "endpoint_url": "http://gb10-node-1:8001/deepseek",
      "is_available": true
    },
    ...
  ],
  "total": 6
}
```

### List GPU Nodes
```http
GET /api/llm/nodes
```

Returns status of all GPU nodes.

**Response:**
```json
{
  "nodes": [
    {
      "node_id": "gb10-node-1",
      "hostname": "gb10-node-1",
      "vram_total_gb": 80,
      "vram_used_gb": 25,
      "vram_available_gb": 55,
      "compute_utilization": 0.35,
      "status": "available",
      "models_loaded": ["deepseek", "qwen72"]
    },
    ...
  ],
  "available_count": 2
}
```

### Update Node Status (Admin Only)
```http
POST /api/llm/nodes/status
```

Updates GPU node status metrics.

**Request Body:**
```json
{
  "node_id": "gb10-node-1",
  "vram_used_gb": 30,
  "compute_utilization": 0.45,
  "status": "available"
}
```

### Get Routing Rules
```http
GET /api/llm/routing/rules
```

Returns current routing rules for task classification.

### Test Classification
```http
POST /api/llm/test-classification
```

Tests task classification without executing the request.

## Usage Examples

### Example 1: Threat Analysis with Adversarial Reasoning

```python
import httpx

async def analyze_threat():
    async with httpx.AsyncClient() as client:
        response = await client.post(
            "http://localhost:8000/api/llm/process",
            headers={"Authorization": f"Bearer {token}"},
            json={
                "prompt": "Analyze this suspicious PowerShell script for malicious intent...",
                "task_hints": ["threat", "adversary", "malicious"],
                "parameters": {"temperature": 0.3}  # Lower temp for analysis
            }
        )
        result = response.json()
        print(f"Model used: {result['routing']['model']}")
        print(f"Analysis: {result['result']}")
```

### Example 2: Code Generation for YARA Rules

```python
async def generate_yara_rule():
    async with httpx.AsyncClient() as client:
        response = await client.post(
            "http://localhost:8000/api/llm/process",
            headers={"Authorization": f"Bearer {token}"},
            json={
                "prompt": "Generate a YARA rule to detect this malware family...",
                "task_hints": ["code", "rule", "generate"],
                "parameters": {"temperature": 0.5}
            }
        )
        result = response.json()
        # Routes to Qwen-Coder automatically
        print(f"Generated rule: {result['result']}")
```

### Example 3: Parallel Processing for Batch Analysis

```python
async def batch_analysis():
    async with httpx.AsyncClient() as client:
        response = await client.post(
            "http://localhost:8000/api/llm/process",
            headers={"Authorization": f"Bearer {token}"},
            json={
                "prompt": "Analyze these 50 log entries for anomalies...",
                "task_hints": ["classify", "anomaly"],
                "requires_parallel": True,
                "batch_size": 50
            }
        )
        result = response.json()
        # Automatically parallelized across both nodes
        print(f"Execution mode: {result['execution_mode']}")
```

### Example 4: Serial Chaining for Multi-Step Analysis

```python
async def chained_analysis():
    async with httpx.AsyncClient() as client:
        response = await client.post(
            "http://localhost:8000/api/llm/process",
            headers={"Authorization": f"Bearer {token}"},
            json={
                "prompt": "First extract IOCs, then classify threats, finally generate response plan",
                "task_hints": ["parse", "classify", "generate"],
                "requires_chaining": True,
                "operations": ["extract", "classify", "generate"]
            }
        )
        result = response.json()
        # Executed serially with context passing
        print(f"Chain result: {result['result']}")
```

## Integration with Existing Features

### Integration with Threat Intelligence (Phase 4)

The distributed LLM system enhances threat intelligence analysis:

```python
from app.core.threat_intel import get_threat_analyzer
from app.core.llm_pool import get_llm_pool

async def enhanced_threat_analysis(host_id):
    # Step 1: Traditional ML analysis
    analyzer = get_threat_analyzer()
    ml_result = analyzer.analyze_host(host_data)

    # Step 2: LLM-based deep analysis if score is concerning
    if ml_result["score"] > 0.6:
        pool = get_llm_pool()
        llm_result = await pool.call_model(
            "llama31",
            f"Deep analysis of threat with score {ml_result['score']}: {host_data}",
            {"temperature": 0.3}
        )

        return {
            "ml_analysis": ml_result,
            "llm_analysis": llm_result,
            "recommendation": "quarantine" if ml_result["score"] > 0.8 else "investigate"
        }
```

### Integration with Automated Playbooks (Phase 4)

LLM routing can trigger automated responses:

```python
from app.core.playbook_engine import get_playbook_engine

async def llm_triggered_playbook(threat_analysis):
    if threat_analysis["result"]["severity"] == "critical":
        engine = get_playbook_engine()
        await engine.execute_playbook(
            playbook={
                "actions": [
                    {"type": "isolate_host", "params": {"host_id": host_id}},
                    {"type": "send_notification", "params": {"message": "Critical threat detected"}},
                    {"type": "create_case", "params": {"title": "Auto-generated from LLM analysis"}}
                ]
            },
            context=threat_analysis
        )
```

## Deployment

### Docker Compose Configuration

Add LLM node services to `docker-compose.yml`:

```yaml
services:
  # Existing services...

  llm-node-1:
    image: vllm/vllm-openai:latest
    ports:
      - "8001:8001"
    environment:
      - NVIDIA_VISIBLE_DEVICES=0,1
    volumes:
      - ./models:/models
    command: >
      --model /models/deepseek
      --host 0.0.0.0
      --port 8001
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 2
              capabilities: [gpu]

  llm-node-2:
    image: vllm/vllm-openai:latest
    ports:
      - "8002:8001"
    environment:
      - NVIDIA_VISIBLE_DEVICES=2,3
    volumes:
      - ./models:/models
    command: >
      --model /models/phi4
      --host 0.0.0.0
      --port 8001
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 2
              capabilities: [gpu]
```

### Environment Variables

Add to `.env`:

```bash
# Phase 5: LLM Configuration
LLM_NODE_1_URL=http://gb10-node-1:8001
LLM_NODE_2_URL=http://gb10-node-2:8001
LLM_ENABLE_PARALLEL=true
LLM_MAX_PARALLEL_JOBS=4
LLM_DEFAULT_TIMEOUT=60
```

## Performance Considerations

### Resource Allocation

- **DeepSeek**: ~40GB VRAM (high priority)
- **Qwen72**: ~35GB VRAM (medium priority)
- **Phi-4**: ~15GB VRAM (fast inference)
- **Qwen-Coder**: ~20GB VRAM
- **LLaMA 3.1**: ~25GB VRAM
- **Granite Guardian**: ~10GB VRAM (classification only)

### Load Balancing

The scheduler automatically:
- Monitors VRAM usage on each node
- Tracks compute utilization (0.0-1.0)
- Routes requests to less loaded nodes
- Queues jobs when capacity is reached

### Optimization Tips

1. **Use task_hints**: Helps router select optimal model faster
2. **Enable parallelization**: For batch jobs over 10 items
3. **Monitor node status**: Use `/api/llm/nodes` endpoint
4. **Set appropriate temperatures**: Lower (0.3) for analysis, higher (0.7) for generation
5. **Leverage caching**: Repeated prompts hit cache layer

## Security

- All LLM endpoints require authentication
- Admin-only node status updates
- Tenant isolation maintained
- Audit logging for all LLM requests
- Rate limiting per user/tenant

## Future Enhancements

- [ ] Model fine-tuning pipeline
- [ ] Custom model deployment
- [ ] Advanced caching layer
- [ ] Multi-region deployment
- [ ] Real-time model swapping
- [ ] Automated model selection via meta-learning
- [ ] Integration with external model APIs (OpenAI, Anthropic)
- [ ] Cost tracking and optimization

## Conclusion

Phase 5 provides a production-ready distributed LLM routing architecture that intelligently manages computational resources while optimizing for task-specific model selection. The system integrates seamlessly with existing threat hunting capabilities to provide enhanced analysis and automated decision-making.