Quick Read
Summary
Takeaways
- ❖The shift to large-scale ML models necessitates reliance on third-party training services, creating a trust problem.
- ❖Existing verification methods (cryptographic proofs, statistical tests) are impractical or insecure for large-scale ML.
- ❖Hardware non-determinism, particularly from GPUs and floating-point arithmetic, causes different training runs to produce non-identical models even with identical inputs.
- ❖The proposed 'optimistic verifiable training' method controls GPU non-determinism by recording and sharing intermediate rounding decisions.
- ❖This allows an auditor to replicate a trainer's exact model, enabling verification via Merkle tree hash comparisons.
- ❖The method is significantly faster and more storage-efficient than prior cryptographic or statistical approaches.
Insights
1Trust Deficit in Third-Party ML Training
The growing computational demands of large-scale ML models force users to rely on third-party services (e.g., OpenAI API, Amazon SageMaker) for training. This abstraction means users have limited visibility into the training process, creating a trust problem where malicious trainers could introduce data poisoning or backdoors without detection.
The speaker highlights the shift from users training models on their own GPUs to delegating to services, showing simple API calls that abstract away the training complexity. (, )
2Limitations of Prior Verification Approaches
Cryptographic proof-based systems are too slow and memory-intensive for large-scale ML, often requiring fixed-point arithmetic that harms model performance. Statistical tests, while less resource-intensive, require frequent saving of full model weights (massive storage) and are still vulnerable to sophisticated attacks like backdoor injection with minimal data alteration.
Examples show cryptographic proofs taking 72 seconds and 350MB per epoch for logistic regression, and 22 minutes per batch for a small VGG11. Statistical tests require frequent full model weight saves and are easily spoofed. (, )
3Hardware Non-Determinism as a Core Challenge
Even with identical data, hyperparameters, and deterministic algorithms, training the same model on different GPUs (or even different runs on the same GPU without specific flags) results in fundamentally different models. This is due to variations in GPU architecture (computation units, parallel operation partitioning, memory caches) leading to different operation orderings, which, combined with floating-point non-associativity, cause errors to accumulate.
Experiments show identical training setups on different NVIDIA GPUs (A40GP, Titan XP, RTXTI) yield models with similar aggregate metrics (accuracy, perplexity) but drastically different individual predictions and token likelihoods. (, )
4Optimistic Verifiable Training via Rounding Logs
The proposed solution leverages a 'verification game' where a client, suspicious of a trainer's model, asks an auditor to re-train. To ensure identical results despite hardware non-determinism, the trainer stores 'rounding decisions' (up, down, or none) made during intermediate computations (e.g., convolution, linear layers). The auditor copies these decisions when their own computation results fall within a 'logging region' near a rounding boundary, effectively forcing deterministic outcomes. This enables the use of Merkle trees of model checkpoint hashes for efficient, step-by-step discrepancy detection.
The speaker details the verification game, the role of Merkle trees, and the mechanism of storing and copying rounding decisions for intermediate computations. (, , )
5Efficiency and Security of the Approach
Empirical results demonstrate that this method successfully eliminates GPU-induced non-determinism, allowing for identical training runs. It achieves significantly faster training speeds compared to cryptographic proof-based systems and offers over a 100x reduction in storage cost compared to statistical test methods by only storing rounding decisions (not full model weights). The security relies on the fact that an adversarial trainer can only manipulate rounding logs when the auditor's values are also close to a rounding boundary, which occurs very infrequently.
The method eliminated all GPU non-determinism across three Nvidia GPUs for ResNet-50 and GPT-2 training. Speed comparisons show significant gains, and storage is reduced by only logging rounding decisions, which are needed for a tiny fraction of intermediate computations. (, , , )
Bottom Line
Only a minuscule percentage of intermediate computations (e.g., 2e-6% for ResNet-50) actually require rounding decision correction to ensure identical training runs across different GPUs.
This implies that the storage cost for rounding logs could be dramatically reduced if there were a more precise way to predict when these critical divergences occur, rather than logging all potential instances.
Develop predictive models or analytical bounds for identifying floating-point divergence points in ML computations, enabling highly optimized, sparse logging of rounding decisions and further reducing storage overhead for verifiable training.
The problem of non-determinism extends beyond model training to LLM inference, where a single divergent token generation can cascade into entirely different outputs.
This makes verifiable inference a significant challenge for companies building trusted, decentralized LLM inference services, as ensuring consistent output across distributed hardware is critical.
Adapt and apply the 'rounding log' approach to LLM inference pipelines to ensure deterministic token generation, enabling trusted and verifiable decentralized inference for large language models.
Opportunities
Premium Verifiable ML Training Service
Offer a specialized ML model training service that, for a premium, provides clients with a 'rounding log' alongside their trained model. This log enables clients to audit the training process independently using a third-party auditor, ensuring the model's integrity and adherence to specifications.
Independent ML Model Auditing Service
Establish a third-party service specializing in auditing machine learning models. This service would receive a client's data, training specifications, the trained model, and the trainer's rounding log, then re-run the training to verify its integrity and pinpoint any discrepancies using the Merkle tree-based verification game.
Deterministic LLM Inference Platform
Develop a platform specifically designed for trusted, decentralized LLM inference that guarantees deterministic token generation. This would address the challenge of non-determinism in sequential token generation, crucial for applications requiring high consistency and verifiability in AI outputs.
Key Concepts
Merkle Tree for Verification
A data structure where each leaf node is a hash of a data block (e.g., model checkpoint), and each non-leaf node is a hash of its children. This allows for efficient verification of data integrity and pinpointing discrepancies through a binary search, without needing to store the full data at each step.
Floating-Point Non-Associativity
Floating-point arithmetic operations (like addition) are not associative due to precision limitations. The order in which numbers are summed can lead to slightly different results, which accumulate over many operations (e.g., during ML training), causing divergence in model weights across different hardware or execution paths.
Lessons
- ML service providers should investigate implementing mechanisms to control hardware non-determinism, such as logging and sharing rounding decisions, to offer verifiable training options to clients.
- Clients delegating large-scale ML training should inquire about the verifiability of the training process and consider engaging independent auditors or using services that provide verifiable logs.
- Researchers should focus on optimizing the storage cost of verifiable training by developing methods to predict when critical floating-point divergences occur, rather than logging all intermediate computations.
Quotes
"If it was possible for two different parties to be able to train a model and get identical results then we wouldn't need a statistical test and we wouldn't be vulnerable to attacks."
"Everything else is identical like the initialization, the data being trained on, the data ordering, all that has changed is the GPU type that model training was carried on over and still that led to pretty fundamentally different models."
Q&A
Recent Questions
Related Episodes

This Startup Catches Fraud at Scale
"Variance, an AI startup, emerged from three years of stealth with a $21 million Series A to reveal how its AI agents automate complex fraud and compliance reviews for Fortune 500 companies and marketplaces, replacing slow human processes with self-healing, dynamic systems."

Is AI Hiding Its Full Power? With Geoffrey Hinton
"AI pioneer Geoffrey Hinton explains the foundational mechanics of neural networks, reveals AI's emergent capacity for deception and self-preservation, and outlines the profound, unpredictable societal shifts ahead."

Our latest reports on robots
"Rapid advancements in AI are transforming industries from manufacturing and defense to scientific research and art, raising profound questions about human labor, ethics, and the future of intelligence."

Tom Griffiths on The Laws of Thought | Mindscape 343
"Cognitive scientist Tom Griffiths explores the historical quest for the 'laws of thought,' revealing how logic, probability, and neural networks offer distinct yet complementary frameworks for understanding human and artificial intelligence, especially concerning resource constraints and inductive biases."