Implement Phase 5: Distributed LLM Routing Architecture

Co-authored-by: mblanke <9078342+mblanke@users.noreply.github.com>
2026-03-01 14:00:20 -05:00 · 2025-12-09 18:01:37 +00:00
parent abe97ab26c
commit a6fe219a33
8 changed files with 1829 additions and 5 deletions
--- a/PHASE5_LLM_ARCHITECTURE.md
+++ b/PHASE5_LLM_ARCHITECTURE.md
@@ -0,0 +1,578 @@
+# Phase 5: Distributed LLM Routing Architecture
+
+## Overview
+
+Phase 5 introduces a sophisticated distributed Large Language Model (LLM) routing system that intelligently classifies tasks and routes them to specialized models across multiple GPU nodes (GB10 devices). This architecture enables efficient utilization of computational resources and optimal model selection based on task requirements.
+
+## Architecture Components
+
+The system consists of four containerized components that work together to provide intelligent, scalable LLM processing:
+
+### 1. Router Agent (LLM Classifier + Policy Engine)
+
+**Module**: `app/core/llm_router.py`
+
+The Router Agent is responsible for:
+- **Request Classification**: Analyzes incoming requests to determine the task type
+- **Model Selection**: Routes requests to the most appropriate specialized model
+- **Policy Enforcement**: Applies routing rules based on configured policies
+
+**Task Types & Model Routing:**
+
+| Task Type | Model | Use Case |
+|-----------|-------|----------|
+| `general_reasoning` | DeepSeek | Complex analysis and reasoning |
+| `multilingual` | Qwen72 / Aya | Translation and multilingual tasks |
+| `structured_parsing` | Phi-4 | Structured data extraction |
+| `rule_generation` | Qwen-Coder | Code and rule generation |
+| `adversarial_reasoning` | LLaMA 3.1 | Threat and adversarial analysis |
+| `classification` | Granite Guardian | Pure classification tasks |
+
+**Classification Logic:**
+```python
+from app.core.llm_router import get_llm_router
+
+router = get_llm_router()
+routing_decision = router.route_request({
+    "prompt": "Analyze this threat...",
+    "task_hints": ["threat", "adversary"]
+})
+# Routes to LLaMA 3.1 for adversarial reasoning
+```
+
+### 2. Job Scheduler (GPU Load Balancer)
+
+**Module**: `app/core/job_scheduler.py`
+
+The Job Scheduler manages:
+- **Node Selection**: Determines which GB10 device is available
+- **Resource Monitoring**: Tracks GPU VRAM and compute utilization
+- **Parallelization Decisions**: Determines if jobs should be distributed
+- **Serial Chaining**: Handles multi-step reasoning workflows
+
+**GPU Node Configuration:**
+
+**GB10 Node 1** (`gb10-node-1:8001`)
+- **Total VRAM**: 80 GB
+- **Models Loaded**: DeepSeek, Qwen72
+- **Primary Use**: General reasoning and multilingual tasks
+
+**GB10 Node 2** (`gb10-node-2:8001`)
+- **Total VRAM**: 80 GB
+- **Models Loaded**: Phi-4, Qwen-Coder, LLaMA 3.1, Granite Guardian
+- **Primary Use**: Specialized tasks (parsing, coding, classification, threat analysis)
+
+**Scheduling Strategies:**
+
+1. **Single Node Execution**
+   - Default for simple requests
+   - Selected based on lowest compute utilization
+   - Requires sufficient VRAM for model
+
+2. **Parallel Execution**
+   - Distributes work across multiple nodes
+   - Used for batch processing or high-priority jobs
+   - Automatic load balancing
+
+3. **Serial Chaining**
+   - Multi-step dependent operations
+   - Sequential execution with context passing
+   - Used for complex reasoning workflows
+
+4. **Queued Execution**
+   - When all nodes are at capacity
+   - Priority-based queue management
+   - Automatic dispatch when resources available
+
+**Example Usage:**
+```python
+from app.core.job_scheduler import get_job_scheduler, Job
+
+scheduler = get_job_scheduler()
+job = Job(
+    job_id="threat_analysis_001",
+    model="llama31",
+    priority=1,
+    estimated_vram_gb=10,
+    requires_parallel=False,
+    requires_chaining=False,
+    payload={"prompt": "..."}
+)
+
+scheduling_decision = await scheduler.schedule_job(job)
+# Returns node assignment and execution mode
+```
+
+### 3. LLM Pool (OpenAI-Compatible Endpoints)
+
+**Module**: `app/core/llm_pool.py`
+
+The LLM Pool provides:
+- **Unified Interface**: OpenAI-compatible API for all models
+- **Endpoint Management**: Tracks availability and health
+- **Parallel Execution**: Simultaneous multi-model requests
+- **Error Handling**: Graceful fallback on failures
+
+**Available Endpoints:**
+
+| Model | Endpoint | Node | Specialization |
+|-------|----------|------|----------------|
+| DeepSeek | `http://gb10-node-1:8001/deepseek` | Node 1 | General reasoning |
+| Qwen72 | `http://gb10-node-1:8001/qwen72` | Node 1 | Multilingual |
+| Phi-4 | `http://gb10-node-2:8001/phi4` | Node 2 | Structured parsing |
+| Qwen-Coder | `http://gb10-node-2:8001/qwen-coder` | Node 2 | Code generation |
+| LLaMA 3.1 | `http://gb10-node-2:8001/llama31` | Node 2 | Adversarial reasoning |
+| Granite Guardian | `http://gb10-node-2:8001/granite-guardian` | Node 2 | Classification |
+
+**Example Usage:**
+```python
+from app.core.llm_pool import get_llm_pool
+
+pool = get_llm_pool()
+
+# Single model call
+result = await pool.call_model(
+    model_name="llama31",
+    prompt="Analyze this threat pattern...",
+    parameters={"temperature": 0.7, "max_tokens": 2048}
+)
+
+# Multiple models in parallel
+results = await pool.call_multiple_models(
+    model_names=["llama31", "deepseek"],
+    prompt="Complex threat analysis...",
+    parameters={"temperature": 0.7}
+)
+```
+
+### 4. Merger Agent (Result Synthesizer)
+
+**Module**: `app/core/merger_agent.py`
+
+The Merger Agent provides:
+- **Result Combination**: Intelligently merges outputs from multiple models
+- **Strategy Selection**: Multiple merging strategies for different use cases
+- **Quality Assessment**: Evaluates and ranks responses
+- **Consensus Building**: Determines agreement across models
+
+**Merging Strategies:**
+
+1. **Consensus** (`MergeStrategy.CONSENSUS`)
+   - Takes majority vote for classifications
+   - Selects most common response
+   - Best for: Classification tasks, binary decisions
+
+2. **Weighted** (`MergeStrategy.WEIGHTED`)
+   - Weights results by confidence scores
+   - Selects highest confidence response
+   - Best for: When models provide confidence scores
+
+3. **Concatenate** (`MergeStrategy.CONCATENATE`)
+   - Combines all responses sequentially
+   - Preserves all information
+   - Best for: Comprehensive analysis requiring multiple perspectives
+
+4. **Best Quality** (`MergeStrategy.BEST_QUALITY`)
+   - Selects highest quality response based on metrics
+   - Considers length, completeness, formatting
+   - Best for: Text generation, detailed explanations
+
+5. **Ensemble** (`MergeStrategy.ENSEMBLE`)
+   - Synthesizes insights from all models
+   - Creates comprehensive summary
+   - Best for: Complex analysis requiring synthesis
+
+**Example Usage:**
+```python
+from app.core.merger_agent import get_merger_agent, MergeStrategy
+
+merger = get_merger_agent()
+
+# Multiple model results
+results = [
+    {"model": "llama31", "response": "...", "confidence": 0.9},
+    {"model": "deepseek", "response": "...", "confidence": 0.85}
+]
+
+# Merge with consensus strategy
+merged = merger.merge_results(results, strategy=MergeStrategy.CONSENSUS)
+```
+
+## API Endpoints
+
+### Process LLM Request
+```http
+POST /api/llm/process
+```
+
+Processes a request through the complete routing system.
+
+**Request Body:**
+```json
+{
+  "prompt": "Analyze this threat pattern for indicators of compromise",
+  "task_hints": ["threat", "adversary"],
+  "requires_parallel": false,
+  "requires_chaining": false,
+  "parameters": {
+    "temperature": 0.7,
+    "max_tokens": 2048
+  }
+}
+```
+
+**Response:**
+```json
+{
+  "job_id": "job_123_4567",
+  "status": "completed",
+  "routing": {
+    "task_type": "adversarial_reasoning",
+    "model": "llama31",
+    "endpoint": "llama31",
+    "priority": 1
+  },
+  "scheduling": {
+    "job_id": "job_123_4567",
+    "execution_mode": "single",
+    "node": {
+      "node_id": "gb10-node-2",
+      "endpoint": "http://gb10-node-2:8001/llama31"
+    }
+  },
+  "result": {
+    "choices": [...]
+  },
+  "execution_mode": "single"
+}
+```
+
+### List Available Models
+```http
+GET /api/llm/models
+```
+
+Returns all available LLM models in the pool.
+
+**Response:**
+```json
+{
+  "models": [
+    {
+      "model_name": "deepseek",
+      "node_id": "gb10-node-1",
+      "endpoint_url": "http://gb10-node-1:8001/deepseek",
+      "is_available": true
+    },
+    ...
+  ],
+  "total": 6
+}
+```
+
+### List GPU Nodes
+```http
+GET /api/llm/nodes
+```
+
+Returns status of all GPU nodes.
+
+**Response:**
+```json
+{
+  "nodes": [
+    {
+      "node_id": "gb10-node-1",
+      "hostname": "gb10-node-1",
+      "vram_total_gb": 80,
+      "vram_used_gb": 25,
+      "vram_available_gb": 55,
+      "compute_utilization": 0.35,
+      "status": "available",
+      "models_loaded": ["deepseek", "qwen72"]
+    },
+    ...
+  ],
+  "available_count": 2
+}
+```
+
+### Update Node Status (Admin Only)
+```http
+POST /api/llm/nodes/status
+```
+
+Updates GPU node status metrics.
+
+**Request Body:**
+```json
+{
+  "node_id": "gb10-node-1",
+  "vram_used_gb": 30,
+  "compute_utilization": 0.45,
+  "status": "available"
+}
+```
+
+### Get Routing Rules
+```http
+GET /api/llm/routing/rules
+```
+
+Returns current routing rules for task classification.
+
+### Test Classification
+```http
+POST /api/llm/test-classification
+```
+
+Tests task classification without executing the request.
+
+## Usage Examples
+
+### Example 1: Threat Analysis with Adversarial Reasoning
+
+```python
+import httpx
+
+async def analyze_threat():
+    async with httpx.AsyncClient() as client:
+        response = await client.post(
+            "http://localhost:8000/api/llm/process",
+            headers={"Authorization": f"Bearer {token}"},
+            json={
+                "prompt": "Analyze this suspicious PowerShell script for malicious intent...",
+                "task_hints": ["threat", "adversary", "malicious"],
+                "parameters": {"temperature": 0.3}  # Lower temp for analysis
+            }
+        )
+        result = response.json()
+        print(f"Model used: {result['routing']['model']}")
+        print(f"Analysis: {result['result']}")
+```
+
+### Example 2: Code Generation for YARA Rules
+
+```python
+async def generate_yara_rule():
+    async with httpx.AsyncClient() as client:
+        response = await client.post(
+            "http://localhost:8000/api/llm/process",
+            headers={"Authorization": f"Bearer {token}"},
+            json={
+                "prompt": "Generate a YARA rule to detect this malware family...",
+                "task_hints": ["code", "rule", "generate"],
+                "parameters": {"temperature": 0.5}
+            }
+        )
+        result = response.json()
+        # Routes to Qwen-Coder automatically
+        print(f"Generated rule: {result['result']}")
+```
+
+### Example 3: Parallel Processing for Batch Analysis
+
+```python
+async def batch_analysis():
+    async with httpx.AsyncClient() as client:
+        response = await client.post(
+            "http://localhost:8000/api/llm/process",
+            headers={"Authorization": f"Bearer {token}"},
+            json={
+                "prompt": "Analyze these 50 log entries for anomalies...",
+                "task_hints": ["classify", "anomaly"],
+                "requires_parallel": True,
+                "batch_size": 50
+            }
+        )
+        result = response.json()
+        # Automatically parallelized across both nodes
+        print(f"Execution mode: {result['execution_mode']}")
+```
+
+### Example 4: Serial Chaining for Multi-Step Analysis
+
+```python
+async def chained_analysis():
+    async with httpx.AsyncClient() as client:
+        response = await client.post(
+            "http://localhost:8000/api/llm/process",
+            headers={"Authorization": f"Bearer {token}"},
+            json={
+                "prompt": "First extract IOCs, then classify threats, finally generate response plan",
+                "task_hints": ["parse", "classify", "generate"],
+                "requires_chaining": True,
+                "operations": ["extract", "classify", "generate"]
+            }
+        )
+        result = response.json()
+        # Executed serially with context passing
+        print(f"Chain result: {result['result']}")
+```
+
+## Integration with Existing Features
+
+### Integration with Threat Intelligence (Phase 4)
+
+The distributed LLM system enhances threat intelligence analysis:
+
+```python
+from app.core.threat_intel import get_threat_analyzer
+from app.core.llm_pool import get_llm_pool
+
+async def enhanced_threat_analysis(host_id):
+    # Step 1: Traditional ML analysis
+    analyzer = get_threat_analyzer()
+    ml_result = analyzer.analyze_host(host_data)
+    
+    # Step 2: LLM-based deep analysis if score is concerning
+    if ml_result["score"] > 0.6:
+        pool = get_llm_pool()
+        llm_result = await pool.call_model(
+            "llama31",
+            f"Deep analysis of threat with score {ml_result['score']}: {host_data}",
+            {"temperature": 0.3}
+        )
+        
+        return {
+            "ml_analysis": ml_result,
+            "llm_analysis": llm_result,
+            "recommendation": "quarantine" if ml_result["score"] > 0.8 else "investigate"
+        }
+```
+
+### Integration with Automated Playbooks (Phase 4)
+
+LLM routing can trigger automated responses:
+
+```python
+from app.core.playbook_engine import get_playbook_engine
+
+async def llm_triggered_playbook(threat_analysis):
+    if threat_analysis["result"]["severity"] == "critical":
+        engine = get_playbook_engine()
+        await engine.execute_playbook(
+            playbook={
+                "actions": [
+                    {"type": "isolate_host", "params": {"host_id": host_id}},
+                    {"type": "send_notification", "params": {"message": "Critical threat detected"}},
+                    {"type": "create_case", "params": {"title": "Auto-generated from LLM analysis"}}
+                ]
+            },
+            context=threat_analysis
+        )
+```
+
+## Deployment
+
+### Docker Compose Configuration
+
+Add LLM node services to `docker-compose.yml`:
+
+```yaml
+services:
+  # Existing services...
+  
+  llm-node-1:
+    image: vllm/vllm-openai:latest
+    ports:
+      - "8001:8001"
+    environment:
+      - NVIDIA_VISIBLE_DEVICES=0,1
+    volumes:
+      - ./models:/models
+    command: >
+      --model /models/deepseek
+      --host 0.0.0.0
+      --port 8001
+    deploy:
+      resources:
+        reservations:
+          devices:
+            - driver: nvidia
+              count: 2
+              capabilities: [gpu]
+  
+  llm-node-2:
+    image: vllm/vllm-openai:latest
+    ports:
+      - "8002:8001"
+    environment:
+      - NVIDIA_VISIBLE_DEVICES=2,3
+    volumes:
+      - ./models:/models
+    command: >
+      --model /models/phi4
+      --host 0.0.0.0
+      --port 8001
+    deploy:
+      resources:
+        reservations:
+          devices:
+            - driver: nvidia
+              count: 2
+              capabilities: [gpu]
+```
+
+### Environment Variables
+
+Add to `.env`:
+
+```bash
+# Phase 5: LLM Configuration
+LLM_NODE_1_URL=http://gb10-node-1:8001
+LLM_NODE_2_URL=http://gb10-node-2:8001
+LLM_ENABLE_PARALLEL=true
+LLM_MAX_PARALLEL_JOBS=4
+LLM_DEFAULT_TIMEOUT=60
+```
+
+## Performance Considerations
+
+### Resource Allocation
+
+- **DeepSeek**: ~40GB VRAM (high priority)
+- **Qwen72**: ~35GB VRAM (medium priority)
+- **Phi-4**: ~15GB VRAM (fast inference)
+- **Qwen-Coder**: ~20GB VRAM
+- **LLaMA 3.1**: ~25GB VRAM
+- **Granite Guardian**: ~10GB VRAM (classification only)
+
+### Load Balancing
+
+The scheduler automatically:
+- Monitors VRAM usage on each node
+- Tracks compute utilization (0.0-1.0)
+- Routes requests to less loaded nodes
+- Queues jobs when capacity is reached
+
+### Optimization Tips
+
+1. **Use task_hints**: Helps router select optimal model faster
+2. **Enable parallelization**: For batch jobs over 10 items
+3. **Monitor node status**: Use `/api/llm/nodes` endpoint
+4. **Set appropriate temperatures**: Lower (0.3) for analysis, higher (0.7) for generation
+5. **Leverage caching**: Repeated prompts hit cache layer
+
+## Security
+
+- All LLM endpoints require authentication
+- Admin-only node status updates
+- Tenant isolation maintained
+- Audit logging for all LLM requests
+- Rate limiting per user/tenant
+
+## Future Enhancements
+
+- [ ] Model fine-tuning pipeline
+- [ ] Custom model deployment
+- [ ] Advanced caching layer
+- [ ] Multi-region deployment
+- [ ] Real-time model swapping
+- [ ] Automated model selection via meta-learning
+- [ ] Integration with external model APIs (OpenAI, Anthropic)
+- [ ] Cost tracking and optimization
+
+## Conclusion
+
+Phase 5 provides a production-ready distributed LLM routing architecture that intelligently manages computational resources while optimizing for task-specific model selection. The system integrates seamlessly with existing threat hunting capabilities to provide enhanced analysis and automated decision-making.