🎀 Speaking at KubeCon EU 2026 Lessons Learned Orchestrating Multi-Tenant GPUs on OpenShift AI View Session
🎀 Speaking at Red Hat Summit 2026 GPUs take flight: Safety-first multi-tenant Platform Engineering with NVIDIA and OpenShift AI Learn More
Luca Berton
AI

SPDX Model Lineage Tracking in RHEL AI

Luca Berton β€’
#rhel-ai#spdx#model-lineage#ai-governance#compliance#provenance#sbom#regulatory

πŸ“˜ Book Reference: This article is based on Chapters 4 and 8 of Practical RHEL AI, covering SPDX lineage tracking for enterprise AI governance and compliance.

Introduction

As AI systems become subject to increasing regulatory scrutiny, organizations need robust methods to track model provenance. SPDX (Software Package Data Exchange) provides a standardized format for documenting the lineage of AI modelsβ€”from training data to deployed artifacts.

Practical RHEL AI introduces SPDX lineage tracking as a key governance feature, ensuring enterprises can:

What is SPDX?

SPDX is an open standard for communicating software bill of materials (SBOM) information. For AI, it extends to:

ComponentSPDX Documentation
Base ModelOrigin, license, version
Training DataSources, licenses, transformations
Fine-tuningParameters, taxonomy, timestamps
Deployed ModelChecksums, environment, dependencies

Creating Model SPDX Documents

Basic Structure

{
  "spdxVersion": "SPDX-2.3",
  "dataLicense": "CC0-1.0",
  "SPDXID": "SPDXRef-DOCUMENT",
  "name": "granite-enterprise-v1-model-sbom",
  "documentNamespace": "https://example.com/models/granite-enterprise-v1",
  "creationInfo": {
    "created": "2024-01-15T10:30:00Z",
    "creators": [
      "Organization: Acme Corp",
      "Tool: RHEL AI InstructLab-1.0"
    ]
  },
  "packages": []
}

Documenting Base Model

{
  "SPDXID": "SPDXRef-BaseModel",
  "name": "granite-3b-instruct",
  "versionInfo": "2024.1",
  "supplier": "Organization: IBM",
  "downloadLocation": "https://huggingface.co/ibm/granite-3b-instruct",
  "filesAnalyzed": false,
  "licenseConcluded": "Apache-2.0",
  "licenseDeclared": "Apache-2.0",
  "copyrightText": "Copyright 2024 IBM",
  "checksums": [
    {
      "algorithm": "SHA256",
      "checksumValue": "a1b2c3d4e5f6..."
    }
  ]
}

Documenting Training Data

{
  "SPDXID": "SPDXRef-TrainingData",
  "name": "enterprise-taxonomy-v1",
  "versionInfo": "1.0.0",
  "supplier": "Organization: Acme Corp",
  "downloadLocation": "NOASSERTION",
  "filesAnalyzed": true,
  "licenseConcluded": "LicenseRef-Proprietary",
  "copyrightText": "Copyright 2024 Acme Corp",
  "comment": "Internal taxonomy with 500 skill definitions",
  "annotations": [
    {
      "annotationType": "OTHER",
      "annotator": "Tool: InstructLab",
      "annotationDate": "2024-01-10T14:00:00Z",
      "comment": "Synthetic data generated: 50,000 examples"
    }
  ]
}

Documenting Fine-tuned Model

{
  "SPDXID": "SPDXRef-FineTunedModel",
  "name": "granite-enterprise-v1",
  "versionInfo": "1.0.0",
  "supplier": "Organization: Acme Corp",
  "downloadLocation": "registry.acme.com/models/granite-enterprise-v1",
  "licenseConcluded": "LicenseRef-Proprietary",
  "checksums": [
    {
      "algorithm": "SHA256",
      "checksumValue": "f6e5d4c3b2a1..."
    }
  ],
  "externalRefs": [
    {
      "referenceCategory": "OTHER",
      "referenceType": "mlflow-run-id",
      "referenceLocator": "runs:/abc123/artifacts/model"
    }
  ]
}

Relationship Mapping

Document how components relate:

{
  "relationships": [
    {
      "spdxElementId": "SPDXRef-FineTunedModel",
      "relatedSpdxElement": "SPDXRef-BaseModel",
      "relationshipType": "DERIVED_FROM"
    },
    {
      "spdxElementId": "SPDXRef-FineTunedModel",
      "relatedSpdxElement": "SPDXRef-TrainingData",
      "relationshipType": "GENERATED_FROM"
    },
    {
      "spdxElementId": "SPDXRef-FineTunedModel",
      "relatedSpdxElement": "SPDXRef-DeepSpeedConfig",
      "relationshipType": "BUILD_TOOL_OF"
    }
  ]
}

Relationship Visualization

flowchart TB
    subgraph Lineage["Model Lineage Graph"]
        Base["Base Model<br/>Granite"]
        Data["Training Data<br/>Taxonomy"]
        
        Base -->|DERIVED_FROM| FineTuned["Fine-tuned<br/>Enterprise Model"]
        Data -->|GENERATED_FROM| FineTuned
    end

Automated SPDX Generation

InstructLab Integration

Generate SPDX documents during training:

from instructlab.spdx import SPDXGenerator

# Initialize generator
spdx_gen = SPDXGenerator(
    organization="Acme Corp",
    namespace_base="https://acme.com/models"
)

# After training completes
spdx_document = spdx_gen.generate(
    model_name="granite-enterprise-v1",
    base_model="granite-3b-instruct",
    training_data_path="./taxonomy",
    output_model_path="./output/model",
    training_config="./ds_config.json"
)

# Save document
spdx_gen.save(spdx_document, "./sbom/model-sbom.spdx.json")

CI/CD Integration

# .gitlab-ci.yml
stages:
  - train
  - document
  - validate
  - deploy

generate_sbom:
  stage: document
  script:
    - python generate_spdx.py
    - spdx-tools verify sbom/model-sbom.spdx.json
  artifacts:
    paths:
      - sbom/model-sbom.spdx.json

validate_lineage:
  stage: validate
  script:
    - python validate_lineage.py sbom/model-sbom.spdx.json
    - |
      if [ $? -ne 0 ]; then
        echo "Lineage validation failed"
        exit 1
      fi

Compliance Requirements

EU AI Act Alignment

The EU AI Act requires documentation of:

SPDX addresses these through:

{
  "SPDXID": "SPDXRef-RiskAssessment",
  "name": "risk-assessment-v1",
  "annotations": [
    {
      "annotationType": "REVIEW",
      "annotator": "Person: Jane Smith (Risk Officer)",
      "annotationDate": "2024-01-20T09:00:00Z",
      "comment": "Risk level: LIMITED. Approved for production use in customer service scenarios."
    }
  ]
}

Industry Standards Mapping

StandardSPDX Support
EU AI ActFull
NIST AI RMFPartial
ISO/IEC 42001Full
SOC 2Partial

Querying Lineage

Python Query Library

from spdx_tools.spdx.parser import parse_from_file
from spdx_tools.spdx.model import RelationshipType

def get_model_ancestry(spdx_file, model_id):
    """Trace model back to its origins."""
    document = parse_from_file(spdx_file)
    
    ancestry = []
    current = model_id
    
    while current:
        package = find_package(document, current)
        ancestry.append(package)
        
        # Find DERIVED_FROM relationship
        parent = find_relationship(
            document, 
            current, 
            RelationshipType.DERIVED_FROM
        )
        current = parent
    
    return ancestry

# Example usage
ancestry = get_model_ancestry(
    "sbom/model-sbom.spdx.json",
    "SPDXRef-FineTunedModel"
)

for model in ancestry:
    print(f"- {model.name} v{model.version_info}")

Output

Model Ancestry:
- granite-enterprise-v1 v1.0.0
- granite-3b-instruct v2024.1
- granite-base-3b v1.0.0 (IBM Research)

Best Practices

From Chapter 8

  1. Generate SPDX at every training run - Automate in CI/CD
  2. Include checksums - Enable verification of model weights
  3. Document synthetic data - Track generated training examples
  4. Link to MLflow/W&B runs - Connect to experiment tracking
  5. Review relationships - Ensure DERIVED_FROM chains are complete
  6. Store SPDX with models - Co-locate documentation and artifacts

Storage Pattern

models/
β”œβ”€β”€ granite-enterprise-v1/
β”‚   β”œβ”€β”€ model.safetensors
β”‚   β”œβ”€β”€ config.json
β”‚   β”œβ”€β”€ tokenizer.json
β”‚   └── sbom/
β”‚       β”œβ”€β”€ model-sbom.spdx.json
β”‚       β”œβ”€β”€ training-data-sbom.spdx.json
β”‚       └── dependencies-sbom.spdx.json

This article covers material from:


πŸ“š Master AI Governance & Compliance

Need bulletproof model provenance tracking?

Practical RHEL AI covers SPDX and governance comprehensively:

πŸ“œ Prove Your Model’s Provenance

Practical RHEL AI gives you the tools to track, audit, and prove compliance for every model you deploy.

Learn More β†’Buy on Amazon β†’
← Back to Blog