Enterprise LLM Deployment: On-Premises

“Can we use ChatGPT?” is the question. “No, but we can deploy our own LLM” is the answer that satisfies both innovation and compliance teams.

Banking, healthcare, defense, and government organizations face a fundamental constraint: sensitive data cannot leave the network perimeter. This does not mean they cannot use AI — it means they need on-premises LLM infrastructure.

Why On-Premises

Requirement	Cloud API	On-Premises
Data residency	❌ Data leaves your network	✅ Data stays on your hardware
Air-gapped deployment	❌ Requires internet	✅ Fully isolated
Regulatory compliance	⚠️ Depends on provider	✅ Full control
Latency	⚠️ Variable (50-500ms)	✅ Predictable (under 50ms)
Cost at scale	⚠️ Token-based, scales linearly	✅ Fixed GPU cost, scales sub-linearly
Model customization	⚠️ Limited fine-tuning	✅ Full fine-tuning, custom models

Architecture Pattern: The Enterprise LLM Stack

┌─────────────────────────────────────────────────┐
│                  Application Layer                │
│  Chatbot │ Doc Search │ Code Assistant │ Agent    │
├─────────────────────────────────────────────────┤
│                  API Gateway                      │
│  Rate limiting │ Auth │ Audit logging │ Routing   │
├─────────────────────────────────────────────────┤
│                  Inference Layer                   │
│  vLLM / NIM / TGI                                │
│  Model: Llama 3.1 70B (8× H100)                 │
│  Autoscaling: KEDA + custom metrics              │
├─────────────────────────────────────────────────┤
│                  Platform Layer                    │
│  Kubernetes │ GPU Operator │ Network Operator     │
├─────────────────────────────────────────────────┤
│                  Infrastructure                    │
│  Bare metal │ 8× H100 per node │ InfiniBand      │
│  Shared storage │ Redundant power │ HSM           │
└─────────────────────────────────────────────────┘

Model Selection for Enterprises

Not every model works on-premises. Key criteria:

Model	License	Parameters	Min GPUs	Quality
Llama 3.1 70B	Meta Community	70B	2× A100 80GB	Excellent
Llama 3.1 8B	Meta Community	8B	1× A100	Good for simple tasks
Mistral Large	Apache 2.0	123B	4× A100 80GB	Excellent
Qwen 2.5 72B	Apache 2.0	72B	2× A100 80GB	Excellent
DeepSeek-R1 distill 70B	MIT	70B	2× A100 80GB	Strong reasoning
Granite 3.1	Apache 2.0	8B	1× A100	IBM enterprise focus

Recommendation: Start with Llama 3.1 70B. Best quality-to-cost ratio, broad community support, and Meta’s community license is acceptable for most enterprise use.

Air-Gapped Deployment

True air-gapped environments have no internet access. Every dependency must be pre-staged:

Pre-Stage Model Weights

# On an internet-connected machine
huggingface-cli download meta-llama/Llama-3.1-70B-Instruct \
  --local-dir ./llama-70b-instruct

# Transfer to air-gapped environment
rsync -av ./llama-70b-instruct/ airgap-server:/models/llama-70b-instruct/

Pre-Stage Container Images

# Pull and save images
docker pull vllm/vllm-openai:latest
docker save vllm/vllm-openai:latest | gzip > vllm-openai.tar.gz

# Transfer and load on air-gapped nodes
docker load < vllm-openai.tar.gz

Deploy with Local Registry

apiVersion: apps/v1
kind: Deployment
metadata:
  name: llm-inference
spec:
  replicas: 1
  template:
    spec:
      containers:
        - name: vllm
          image: registry.internal:5000/vllm-openai:latest  # Local registry
          args:
            - "--model=/models/llama-70b-instruct"
            - "--tensor-parallel-size=8"
            - "--max-model-len=8192"
            - "--gpu-memory-utilization=0.9"
          resources:
            limits:
              nvidia.com/gpu: 8
          volumeMounts:
            - name: models
              mountPath: /models
              readOnly: true
      volumes:
        - name: models
          persistentVolumeClaim:
            claimName: model-storage

Security Architecture

Network Segmentation

Internet ──X── (air gap) ──X── LLM Cluster
                                    │
                                    ├── API Gateway (mTLS)
                                    │     └── Rate limiting
                                    │     └── Audit logging
                                    │     └── Token auth
                                    │
                                    ├── Inference pods (isolated VLAN)
                                    │
                                    └── Model storage (encrypted, HSM keys)

Audit Trail

Every request to the LLM must be logged:

{
  "timestamp": "2026-04-08T14:30:00Z",
  "user_id": "jsmith@corp.com",
  "model": "llama-3.1-70b-instruct",
  "prompt_tokens": 150,
  "completion_tokens": 89,
  "latency_ms": 1200,
  "classification": "internal",
  "department": "legal",
  "pii_detected": false
}

Do not log prompt content by default — that creates a secondary data liability. Log metadata for operational monitoring, and content only when explicitly required by compliance policy.

Cost Model

On-premises LLM infrastructure is a capital expenditure:

Component	Cost	Amortization
8× H100 80GB server	$250,000-350,000	3-5 years
InfiniBand networking	$30,000-50,000	5 years
Rack, power, cooling	$20,000-40,000/year	Ongoing
Storage (NVMe, 10TB)	$10,000-20,000	3 years
Total Year 1	$310,000-460,000
Ongoing/year	$40,000-70,000

Compare to API costs: at $15/million tokens (GPT-4o), an enterprise processing 10M tokens/day spends $150/day = $54,750/year. On-premises breaks even at ~20M tokens/day.

For more detailed modeling, use the GPU Cost Calculator.

About the Author

I am Luca Berton, AI and Cloud Advisor. I design on-premises AI infrastructure for regulated enterprises. Book a consultation.

Enterprise LLM Deployment: On-Premises

Why On-Premises

Architecture Pattern: The Enterprise LLM Stack

Model Selection for Enterprises

Air-Gapped Deployment

Pre-Stage Model Weights

Pre-Stage Container Images

Deploy with Local Registry

Security Architecture

Network Segmentation

Audit Trail

Cost Model

About the Author

Related Articles

Embodied AI Infrastructure for the Physical World

Is Your Website Ready for AI Agents?

AI Governance in Practice: Findings Remediation and Agent Identity

What Delivering Enterprise Copilot Assessments Actually Looks Like

Why On-Premises

Architecture Pattern: The Enterprise LLM Stack

Model Selection for Enterprises

Air-Gapped Deployment

Pre-Stage Model Weights

Pre-Stage Container Images

Deploy with Local Registry

Security Architecture

Network Segmentation

Audit Trail

Cost Model

Related Resources

About the Author

Related Articles

Embodied AI Infrastructure for the Physical World

Is Your Website Ready for AI Agents?

AI Governance in Practice: Findings Remediation and Agent Identity

What Delivering Enterprise Copilot Assessments Actually Looks Like