βCan we use ChatGPT?β is the question. βNo, but we can deploy our own LLMβ is the answer that satisfies both innovation and compliance teams.
Banking, healthcare, defense, and government organizations face a fundamental constraint: sensitive data cannot leave the network perimeter. This does not mean they cannot use AI β it means they need on-premises LLM infrastructure.
Why On-Premises
| Requirement | Cloud API | On-Premises |
|---|---|---|
| Data residency | β Data leaves your network | β Data stays on your hardware |
| Air-gapped deployment | β Requires internet | β Fully isolated |
| Regulatory compliance | β οΈ Depends on provider | β Full control |
| Latency | β οΈ Variable (50-500ms) | β Predictable (under 50ms) |
| Cost at scale | β οΈ Token-based, scales linearly | β Fixed GPU cost, scales sub-linearly |
| Model customization | β οΈ Limited fine-tuning | β Full fine-tuning, custom models |
Architecture Pattern: The Enterprise LLM Stack
βββββββββββββββββββββββββββββββββββββββββββββββββββ
β Application Layer β
β Chatbot β Doc Search β Code Assistant β Agent β
βββββββββββββββββββββββββββββββββββββββββββββββββββ€
β API Gateway β
β Rate limiting β Auth β Audit logging β Routing β
βββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Inference Layer β
β vLLM / NIM / TGI β
β Model: Llama 3.1 70B (8Γ H100) β
β Autoscaling: KEDA + custom metrics β
βββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Platform Layer β
β Kubernetes β GPU Operator β Network Operator β
βββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Infrastructure β
β Bare metal β 8Γ H100 per node β InfiniBand β
β Shared storage β Redundant power β HSM β
βββββββββββββββββββββββββββββββββββββββββββββββββββModel Selection for Enterprises
Not every model works on-premises. Key criteria:
| Model | License | Parameters | Min GPUs | Quality |
|---|---|---|---|---|
| Llama 3.1 70B | Meta Community | 70B | 2Γ A100 80GB | Excellent |
| Llama 3.1 8B | Meta Community | 8B | 1Γ A100 | Good for simple tasks |
| Mistral Large | Apache 2.0 | 123B | 4Γ A100 80GB | Excellent |
| Qwen 2.5 72B | Apache 2.0 | 72B | 2Γ A100 80GB | Excellent |
| DeepSeek-R1 distill 70B | MIT | 70B | 2Γ A100 80GB | Strong reasoning |
| Granite 3.1 | Apache 2.0 | 8B | 1Γ A100 | IBM enterprise focus |
Recommendation: Start with Llama 3.1 70B. Best quality-to-cost ratio, broad community support, and Metaβs community license is acceptable for most enterprise use.
Air-Gapped Deployment
True air-gapped environments have no internet access. Every dependency must be pre-staged:
Pre-Stage Model Weights
# On an internet-connected machine
huggingface-cli download meta-llama/Llama-3.1-70B-Instruct \
--local-dir ./llama-70b-instruct
# Transfer to air-gapped environment
rsync -av ./llama-70b-instruct/ airgap-server:/models/llama-70b-instruct/Pre-Stage Container Images
# Pull and save images
docker pull vllm/vllm-openai:latest
docker save vllm/vllm-openai:latest | gzip > vllm-openai.tar.gz
# Transfer and load on air-gapped nodes
docker load < vllm-openai.tar.gzDeploy with Local Registry
apiVersion: apps/v1
kind: Deployment
metadata:
name: llm-inference
spec:
replicas: 1
template:
spec:
containers:
- name: vllm
image: registry.internal:5000/vllm-openai:latest # Local registry
args:
- "--model=/models/llama-70b-instruct"
- "--tensor-parallel-size=8"
- "--max-model-len=8192"
- "--gpu-memory-utilization=0.9"
resources:
limits:
nvidia.com/gpu: 8
volumeMounts:
- name: models
mountPath: /models
readOnly: true
volumes:
- name: models
persistentVolumeClaim:
claimName: model-storageSecurity Architecture
Network Segmentation
Internet ββXββ (air gap) ββXββ LLM Cluster
β
βββ API Gateway (mTLS)
β βββ Rate limiting
β βββ Audit logging
β βββ Token auth
β
βββ Inference pods (isolated VLAN)
β
βββ Model storage (encrypted, HSM keys)Audit Trail
Every request to the LLM must be logged:
{
"timestamp": "2026-04-08T14:30:00Z",
"user_id": "jsmith@corp.com",
"model": "llama-3.1-70b-instruct",
"prompt_tokens": 150,
"completion_tokens": 89,
"latency_ms": 1200,
"classification": "internal",
"department": "legal",
"pii_detected": false
}Do not log prompt content by default β that creates a secondary data liability. Log metadata for operational monitoring, and content only when explicitly required by compliance policy.
Cost Model
On-premises LLM infrastructure is a capital expenditure:
| Component | Cost | Amortization |
|---|---|---|
| 8Γ H100 80GB server | $250,000-350,000 | 3-5 years |
| InfiniBand networking | $30,000-50,000 | 5 years |
| Rack, power, cooling | $20,000-40,000/year | Ongoing |
| Storage (NVMe, 10TB) | $10,000-20,000 | 3 years |
| Total Year 1 | $310,000-460,000 | |
| Ongoing/year | $40,000-70,000 |
Compare to API costs: at $15/million tokens (GPT-4o), an enterprise processing 10M tokens/day spends $150/day = $54,750/year. On-premises breaks even at ~20M tokens/day.
For more detailed modeling, use the GPU Cost Calculator.
Related Resources
- NVIDIA NIM Multinode Inference
- NVIDIA Run:ai Distributed Inference
- NVIDIA GPU Operator
- FinOps for AI
- RHEL AI Tutorial
- InstructLab Fine-Tuning
- M365 Enhanced Data Encryption
About the Author
I am Luca Berton, AI and Cloud Advisor. I design on-premises AI infrastructure for regulated enterprises. Book a consultation.