First Principles Series · Part 2
A fair comparison of World Models and JEPA (since I couldn't find one)
The math, the costs, the production case. The article I would have wanted to read a month ago.
TL;DR for the busy reader
World Models predict the future in pixel space. They are expensive to train and waste capacity on visual detail, but they produce simulators you can plan with. JEPA predicts the future in latent space. It is cheaper, more sample-efficient, harder to interpret, and dominant for perceptual tasks at the edge.
The most concrete public proof is Nexar's BADAS 2.0 for collision anticipation: built on V-JEPA 2 (300M params), it reaches 99.4% Average Precision and 91.3% Early Warning Recall on Nexar's long-tail benchmark, against 94.0% / 48.3% for a fine-tuned NVIDIA Cosmos-Reason2-2B on the same data. The smallest BADAS variant (22M params) still beats Cosmos with 91× fewer parameters.
The cost side is just as informative: serious JEPA pre-training runs around $86k, fine-tuning a public checkpoint around $2.3k, an RL world model on a single GPU around $150–700. The compute moat is mostly in pre-training, which Meta and NVIDIA are running for you. Most teams should fine-tune, not pre-train.
If you follow the AI debate, you have heard about World Models and JEPA. They are two of the most discussed architectures in robotics and autonomous systems. The discussion is often confused, though: the terms are used interchangeably, the differences are not made clear, and the technical details that would let you understand what these systems actually do are almost always missing.
This article is an attempt to clear that up. We will start from the problem both architectures try to solve, then see how they solve it in fundamentally different ways, with concrete numerical examples. After a public production case (BADAS 2.0) that grounds the discussion, we will spend a long section on the representation collapse problem, which is the actual technical heart of JEPA and which is usually dismissed in one line. We will close with the part that is most often left out of similar articles: the hardware, VRAM, and cost numbers you need to actually train these models, and what to do with all of this on Monday morning.
The problem: predicting what comes next
Think about how your brain works when you cross a street. You do not just react to what you see right now. You imagine where the car will be in two seconds. You predict what happens if you speed up. You simulate alternative scenarios in your head. This capacity for mental simulation is what World Models and JEPA try to give to machines.
Why is it important? Because a system that can only react to present stimuli is structurally limited. A robot that grasps an object needs to predict where its hand will be 100 milliseconds from now. A self-driving car needs to anticipate the behaviour of other vehicles. An AI agent that plans needs to reason about the consequences of its actions before acting.
The technical problem is: how do you teach a machine to predict the future when you do not have "ground-truth future" labels? The answer both architectures give is the same. Train by self-supervision. Show the model a sequence, hide part of it, ask the model to guess the hidden part. If it succeeds, it has learned something about the structure of the world.
The two architectures diverge on what to hide and what to ask as the prediction. World Models hide future frames and ask the model to reconstruct them pixel by pixel. JEPA hides patches of an image or video and asks the model to predict only the latent representation of those patches. The difference looks subtle and is actually deep.
The distinction that governs everything else
World Models: predict future pixels → optimise for visual fidelity.
JEPA: predict latent representations → optimise for semantic structure.
This looks like a minor design choice. It changes the compute profile, the data efficiency, the generalisation properties, and the tasks each architecture is suited for.
World Models: simulating the world internally
The concept of World Model was formalised by Ha and Schmidhuber in 2018 in a paper of the same name [1]. The idea is to build an internal simulator of the world that lets an agent "imagine" what happens if it takes a given action, without having to actually take it.
The fundamental equation
A World Model is, in its simplest form, a function that takes the current state of the world and an action, and produces the next state:
Notation used throughout this article
xt — raw observation at time t (e.g., a 256×256×3 image, so 196,608 numbers).
st — state, i.e., the compressed latent representation of the world at time t (e.g., a vector of 256 numbers, produced by the encoder).
at — action taken at time t.
rt — reward signal at time t (RL setting only).
eφ, fθ, dψ — encoder (params φ), dynamics (params θ), decoder (params ψ).
x̂, ŝ — predicted versions of x and s.
Two notational notes worth flagging up front. In Dreamer's RSSM, the state st is itself decomposed into two parts, written as the pair (ht, zt) following the paper's convention. The h is the deterministic recurrent state, the z is the stochastic categorical component. This Dreamer-specific z is therefore a sub-component of the latent state, not the full state st. In the JEPA section later on, we use z (with subscripts like zctx, ztgt) for JEPA latents, following the JEPA literature's convention. These are conceptually the analogue of s in World Models but the literature names them z. Both notations are kept deliberately, to make this article match what the reader will see in the original papers.
The observation xt is too large to manipulate directly. A 256×256 RGB image has 196,608 numbers. A 1-second video at 30 fps has nearly 6 million. World Models therefore work in a compressed latent space — the state st — produced by an encoder, with three distinct components.
The three components
World Model architecture
Encoder compresses, Dynamics evolves, Decoder reconstructs. Training is driven by pixel-wise MSE on the reconstructed frame.
Encoder eφ. Compresses the high-dimensional observation into a compact latent representation. Typically a CNN followed by an MLP for images, a transformer for video. A 256×256 RGB image (196,608 numbers) becomes a vector of 256 or 512 numbers. This compression forces the model to discard noise.
Dynamics Model fθ. The core of the World Model. It takes the latent state st and the action at and predicts st+1. Typically an RNN (GRU, LSTM) or a transformer. Dreamer uses a variant called RSSM (Recurrent State-Space Model) that is worth looking at in detail.
Decoder dψ. Reconstructs pixels from the predicted latent state. Its main role is during training: the reconstruction loss (pixel-wise MSE) is the signal that forces encoder and dynamics to preserve visual information.
RSSM: the Dreamer trick
The dynamics model in Dreamer does not produce a single monolithic state. It decomposes the state st into a deterministic part ht (a GRU hidden state that accumulates history) and a stochastic part zt (a sample from a categorical distribution). The motivation is that the world is intrinsically uncertain, and modelling that uncertainty explicitly improves long-horizon prediction.
Notation reminder. In Dreamer's RSSM we are now operating inside the latent state: st = (ht, zt). The symbol zt below refers specifically to the stochastic categorical component, following the DreamerV3 paper. It is one of the two components of the state, not a re-labelling of the encoder output we used earlier in the simple WM diagram.
The DreamerV3 [4] total loss is a weighted sum of three terms: reconstruction loss (pixel MSE), KL divergence between prior and posterior (forces the prior to be predictive), and reward loss (a separate MLP predicts the reward from st). This decomposition follows directly from the variational lower bound (ELBO) for sequential models. DreamerV3 adds KL balancing: the prior is allowed to move faster towards the posterior than the posterior towards the prior, with separate stop-gradients on each side. It is a small detail in the loss code and one of the reasons DreamerV3 trains stably across very different environments without per-task hyperparameter tuning.
Numerical example
Example: a robot observing an object
Input: a 256×256×3 video frame xt = 196,608 float32 values = 768 KB.
After the encoder: a latent state st of dimension 256 = 1 KB.
Compression: 196,608 / 256 = 768×. The encoder must decide what to keep.
Action: at = "move arm right" = a vector of 7 values, one per joint of a 7-DOF arm (a redundant manipulator like the Franka Panda, with one joint more than the 6 strictly needed to place and orient the hand in space). Note this is joint space, not the 6-DOF Cartesian pose of the end-effector.
Dynamics Model: concat(st, at) has dimension 263 → MLP → st+1 of dimension 256.
Decoder: st+1 → 196,608 predicted pixels for the next frame, x̂t+1.
Loss: pixel-wise MSE = (1/196,608) Σ (x̂i − xi)².
The 768× compression is the critical point. The encoder is free to choose which 256 numbers to use. If it chooses badly, the decoder will fail to reconstruct and the loss will be high. So the gradient flow forces the encoder to preserve information that matters for pixel reconstruction. Which, importantly, includes reflections, textures, shadows. Details that are irrelevant for the decision.
The "dream to learn" loop
Once the world model is trained, Dreamer uses it to generate imagined trajectories. The policy is not trained on real data (slow, expensive) but on the world model's dreams. A real robot typically interacts with the environment at tens of steps per second; the world model can produce on the order of thousands of imagined transitions per second on a single GPU, depending on architecture. This is the source of Dreamer's data efficiency.
The Dreamer loop: act, imagine, improve
Real experience feeds the world model. The world model feeds the policy. The improved policy generates new real experience.
The pixel reconstruction problem
World Models have a structural limitation: the decoder. To train the model you need a loss signal, and that signal is pixel-wise MSE between the predicted frame and the real frame. The gradient flowing back through encoder and dynamics is computed with respect to visual fidelity, not with respect to representation utility.
The practical consequence is that the encoder spends capacity preserving information the decoder can then reconstruct. Reflections, shadows, textures, background details: all of these enter the loss with the same weight as the things that actually matter for decisions.
Example: a glass on a table
For the decision (do not push it too hard) you need:
- position of the glass
- direction and velocity of your hand
- distance from the edge
What the World Model has to predict (pixel reconstruction):
- how light reflections on the glass surface change
- how shadows move on the table
- the exact wood texture
- the exact colour of every background pixel
A large fraction of the decoder's representational capacity is committed to reconstructing detail that the decision does not depend on. This is why models like Sora produce beautiful videos but are weak as planners.
This observation is what led Yann LeCun to propose JEPA. If the problem is that the decoder forces reconstruction of useless pixels, the solution is to remove the decoder.
JEPA: predicting without reconstructing
In 2022 LeCun published "A Path Towards Autonomous Machine Intelligence" [6], a 60-page position paper criticising the generative approach and proposing an alternative: JEPA, the Joint-Embedding Predictive Architecture.
The idea: instead of predicting future pixels (or masked patches), predict the latent representation of what is hidden. The loss is no longer pixel-wise MSE. It is a distance in embedding space.
The architecture
JEPA uses three components.
Context encoder eφ. Receives the context (visible patches of the image, or past frames of a video) and produces its representation zctx.
Target encoder eφ'. Receives the target (masked patches, or future frames) and produces ztgt, which acts as "ground truth" in latent space. Critical detail: the parameters φ' are not updated by the gradient. They are updated as an exponential moving average (EMA) of φ.
Predictor pψ. Receives zctx and tries to predict ẑtgt. Typically much lighter than the encoders (6 transformer layers against the 24 of the encoder for ViT-L).
JEPA architecture
Two asymmetric encoders, a predictor, no decoder. The loss lives in latent space.
The formal loss, expanded
Let us write it out properly. Let x be an image, M(x) the context (visible patches), N(x) the target (masked patches). Let φ be the online encoder parameters, φ' the target encoder parameters, ψ the predictor parameters.
Numerical example: I-JEPA on a 224×224 image
How a Vision Transformer turns an image into tokens (ViT-L/16)
The image is cut into a grid of 16×16-pixel patches. Each patch becomes one token, the same way a word is a token in a sentence. A 224×224 image gives 14×14 = 196 tokens.
Example: I-JEPA, ViT-L/16
Input: a 224×224×3 image. With 16×16 patches, since 224 ÷ 16 = 14, you get a 14×14 grid → 14×14 = 196 patches. Each patch is one token for the transformer.
Masking: sample 4 target blocks, each covering 15–20% of the image. The context is what is left.
Context tokens: around 50 visible patches → encoder produces 50 vectors in ℝ1024.
Predictor input: 50 context vectors + position embeddings for the masked patches → produces ẑ for each target patch.
Target encoder (EMA): around 146 masked patches → produces 146 "true" vectors in ℝ1024, no grad.
Loss: MSE averaged over the 146 vectors of dim 1024 = (1/(146·1024)) Σ (ẑij − zij)².
What does not happen: no decoder, no pixel reconstruction, no image generation. The network learns without ever producing humanly interpretable output.
BADAS 2.0: JEPA in production
Before we go deeper into the collapse problem, it is worth checking that the JEPA approach actually works at production scale. The most concrete and well-documented public case is BADAS 2.0 by Nexar [17] [18], a collision anticipation system that predicts road incidents before they happen. The Nexar site publishes full benchmarks and direct comparisons with a generative competitor, which is rare and lets us check JEPA's thesis on a safety-critical task.
BADAS 2.0 uses V-JEPA 2 as its backbone: a ViT-L Vision Transformer with 300M parameters, fine-tuned end-to-end on roughly 200,000 labeled videos (~2M windowed clips) collected from a 350,000-dashcam Nexar fleet. It is used by commercial fleets, insurance companies, and AV development programs.
The controlled comparison with Cosmos-Reason2
The most interesting experiment Nexar has published is a controlled comparison. They took NVIDIA Cosmos-Reason2-2B (a 2B-parameter vision-language model post-trained from Qwen3-VL-2B-Instruct, designed for embodied reasoning) and fine-tuned it on the same training data as BADAS 2.0. Same training set, same protocol, same task.
| Metric | BADAS 2.0 (V-JEPA 2) | COSMOS-BADAS |
|---|---|---|
| Average Precision | 99.4% | 94.0% |
| Early Warning Recall | 91.3% | 48.3% |
| Training data | 2M real-world clips | 2M real-world clips (same) |
| Main model (the 99.4% / 91.3% above) | 300M params | 2B params |
| Smallest viable variant | 22M params (98.4% AP) | none below 2B (cloud only) |
| Architecture | JEPA (non-generative) | Autoregressive VLM |
| Explainability | Native attention maps | Chain-of-thought text |
The most telling metric is early warning recall: 91.3% for BADAS against 48.3% for Cosmos. BADAS detects danger ahead of time nine times out of ten. Cosmos does so less than half the time. In the context of collision anticipation, "late" equals "wrong".
One clarification, because Nexar's own published table compresses it. The 99.4% AP and 91.3% recall above belong to the main BADAS 2.0 model, the 300M V-JEPA 2 ViT-L. BADAS also ships distilled variants, and the smallest of them, Flash Lite at 22M parameters, scores a slightly lower 98.4% AP. So the headline accuracy and the headline tiny model are technically two different models in the same family. Both comparisons against Cosmos hold, they are just not the same row.
That 22M variant is the one that matters most. A 22M-parameter model beating a 2B-parameter model on the same task, with 91× fewer parameters, is where the architectural difference stops being a technical detail and becomes a fundamental efficiency gap.
Why JEPA wins here
Two reasons, both worth separating. The first is the prior each model carries into fine-tuning. V-JEPA 2 was pre-trained on more than 1M hours of internet video [9], internalising at scale the regularities of physical motion — objects falling, pedestrians changing direction, vehicles braking. Cosmos-Reason2 was post-trained from a generalist vision-language model with no comparable exposure to long-form video physics. Same fine-tuning data on top, very different priors underneath.
The second reason is structural — what each architecture is even computing. To predict accidents you do not need to reconstruct what the scene will look like in the next frame. You need to capture whether that car is braking, whether that pedestrian is committing to crossing. V-JEPA 2 optimises directly for those semantic invariances, in latent space. Cosmos-Reason2, even after fine-tuning, is still organised around generating tokens of free-form reasoning, with the latency, verbosity, and overconfidence problems that follow.
A detail worth noting on generalisation: BADAS 2.0 also works on scenarios that have nothing to do with road driving — drones in flight, cleaning robots, forklifts. The model learned general physics during V-JEPA 2 pre-training on millions of hours of video, not just road rules. Nexar's own framing is "Beyond the Road": any machine with a camera that moves through the physical world can use BADAS, not because it was trained on every environment, but because it learned how the physical world works. Yann LeCun, who sits on the Nexar board, comments: "Models don't emerge from abstractions alone. They come from sustained exposure to reality."
That JEPA wins this comparison at all rests on one technical condition: during pre-training, the system did not collapse into a useless trivial solution. This is the topic of the next section, and it is the part of the story most articles skip in a sentence. It is worth slowing down on.
The collapse problem, in detail
This is the section worth reading in full. Representation collapse is the central problem of all non-contrastive self-supervised methods, and understanding it means understanding why JEPA architectures look the way they do and could not have been designed differently.
The problem, formulated
Consider the JEPA loss with both encoders sharing parameters (no EMA for a moment): L(φ, ψ) = ||pψ(eφ(M(x))) − eφ(N(x))||². What is the trivial minimum?
A neural network minimising this loss without constraints will happily converge to the constant solution. All representations collapse onto the same point, the loss goes to zero, and the model has learned nothing. This is complete collapse.
There is also a more subtle phenomenon, dimensional collapse, characterised by Jing, Vincent, LeCun and Tian in 2022 [13]. Here the representations do not collapse to a single point, but they live in a subspace whose dimension is much smaller than the embedding space. Empirically you detect it by computing the covariance matrix of the representations across a batch and counting how many eigenvalues are significantly above zero. If the encoder outputs vectors in ℝ1024 but only 30 eigenvalues are non-trivial, there is dimensional collapse.
Why collapse is a property of the gradient flow
Tian, Chen and Ganguli in 2021 [12] gave a theoretical analysis of the problem for the linear case. Their key result: if both encoders are identical and updated with the same gradient, gradient flow inevitably leads to collapse. The steepest descent direction always points towards the constant solution.
To avoid this, you need asymmetries. The three families of solutions modern architectures use are:
| Strategy | Mechanism | Examples |
|---|---|---|
| Contrastive | Explicitly push positive pairs together and negative pairs apart via InfoNCE-style loss | SimCLR, MoCo |
| Asymmetric architecture | Stop-gradient + EMA target + predictor. Breaks the symmetry that causes collapse | BYOL, SimSiam, JEPA |
| Explicit regularisation | Loss terms that enforce variance and decorrelation of features | VICReg, Barlow Twins |
| Sharpening + centering | Temperature on the target softmax + batch centroid subtraction | DINO, DINOv2 |
JEPA belongs to the asymmetric architecture family and inherits directly from BYOL (Grill et al., DeepMind 2020 [10]), the work that first showed a non-contrastive method can be trained without collapsing. Let us look at why in detail.
The three asymmetries of BYOL/JEPA
Asymmetry 1 — Stop-gradient. The target encoder receives zero gradient. It cannot chase the online encoder; it can only be chased. Without this asymmetry both encoders would converge to the same trivial minimum. Chen and He 2021 [11] showed empirically that removing the stop-gradient in SimSiam causes immediate collapse to the trivial solution. The loss drops to zero and the per-channel standard deviation of the embeddings goes to zero with it.
Asymmetry 2 — EMA. The target encoder parameters are updated as a moving average of the online parameters: φ' ← τφ' + (1−τ)φ. With τ = 0.996, the target moves at 0.4% of the online encoder's rate per step. This means the target encoder is a "slow" version of the online encoder and provides a stable training signal. SimSiam [11] showed that EMA is not strictly required (just stop-gradient + predictor can suffice in some regimes), but every JEPA-family work has retained it because empirically it stabilises training significantly.
Asymmetry 3 — Predictor. The predictor pψ transforms zctx before comparison with ztgt. This transformation is learned and updated by the gradient. Tian, Chen, Ganguli [12] showed analytically — in the linear gradient-flow regime — that removing the predictor causes the dynamics to collapse onto the trivial solution. Empirically the same holds far outside that regime: every published JEPA-style training that removes the predictor collapses. The predictor cannot be the identity function; it has to learn to map online-encoder representations to a target that lags behind it.
The intuition behind the three asymmetries
The online encoder "wants" to produce representations the predictor can easily map onto the target. The target is a lagged version of the online. The only way the predictor stays useful as the target keeps moving is if the encoder produces representations rich enough to be predictable across that lag.
Put differently: the system is in a non-trivial fixed point where online encoder, predictor, and target encoder are in a dynamic equilibrium that requires meaningful representations to sustain itself. Collapse is another fixed point, but unstable under this dynamic.
Why V-JEPA fights a worse collapse
In video the problem is more severe. Two consecutive frames at 30 fps differ by milliseconds and are visually almost identical. If the model learns to "copy" the previous frame's representation, it scores a low loss without learning anything meaningful. The V-JEPA paper does not name this failure mode; it is descriptive shorthand to call it temporal shortcut or temporal collapse. The mitigations are explicit in the paper even if the name is not.
V-JEPA [8] fights this with three additional mechanisms compared to I-JEPA:
- Aggressive masking: up to 90% of the video is masked (against 75% in I-JEPA). This forces the model to use long-range information.
- Spatio-temporal block masking: masks are contiguous blocks extending across several consecutive frames. You cannot guess a masked pixel from its neighbours if those neighbours are also masked.
- Multi-block targets: the predictor has to predict several independent target blocks from the same context. This increases signal diversity and reduces the risk of learning shortcuts specific to one target.
Despite these mitigations, V-JEPA still shows partial dimensional collapse: the effective rank of the representation covariance matrix is significantly lower than the embedding dimension. The problem is not closed, it is managed at an acceptable level.
VICReg: an explicit alternative
Worth mentioning VICReg (Bardes, Ponce, LeCun, ICLR 2022 [14]) as a conceptually different alternative. Instead of preventing collapse with asymmetry, VICReg prevents it with explicit regularisation. The loss has three terms:
VICReg is cleaner mathematically (no architectural tricks, just explicit loss terms) but requires careful tuning of the coefficients λ, μ, ν. JEPA won in popularity because it trains more easily, even if the mechanism is less transparent. The choice between these families is still an open research question.
LLMs, Diffusion, and JEPA: the three families
The AI generative debate often frames the world as LLMs vs diffusion as if those were the only options. When the task is predicting the future, there are three fundamentally different approaches, and understanding their differences clarifies where JEPA sits.
Autoregressive (LLMs)
LLMs work by autoregressive prediction. Given x1:t, they model p(xt+1 | x1:t) as a categorical distribution over a discrete vocabulary. They generate one token at a time, sampling each from this distribution.
The approach can be applied to images and video by tokenising the input (VQ-VAE, VQGAN, or more recent discrete tokenisers). Models like iGPT, Parti, and NVIDIA Cosmos [16] use this paradigm. Their strength is composability with the transformer techniques matured for language: scaling laws, instruction tuning, RLHF.
Diffusion
Diffusion models (Stable Diffusion, DALL-E 3, Sora) implement an inverse stochastic process. They start from pure noise xT ~ N(0, I) and "denoise" it iteratively across T steps (typically 20–50 in modern models, up to 1000 in original DDPMs). The model is trained to predict the noise added at each step.
The structural difference from LLMs: diffusion models operate non-autoregressively. All pixels are refined simultaneously at each step. This enables exceptional visual quality but makes the model non-interactive in the classical sense: you cannot hand it an action and ask what happens.
JEPA (non-generative)
JEPA does not generate anything. It produces no pixels, no discrete tokens. It produces only latent representations. For this reason it is not directly comparable to LLMs and diffusion as a "generative model"; it is a separate category, that of discriminative self-supervised models.
Three paradigms compared
Only JEPA does not try to reconstruct the original output. Everything else follows from this.
The philosophical difference is deep. LLMs and diffusion try to reconstruct the exact output (the next token, the noise-free image). To do that they must model the full distribution p(x) or p(x | y), including all irrelevant details. JEPA only tries to capture the structure of the input, discarding details by construction.
For applications that care about what will happen rather than what it will look like, JEPA is theoretically more suitable. This is the exact intuition that led Nexar to build BADAS 2.0 on V-JEPA 2 instead of a generative model.
Which architecture to use, when
The structural differences summarised:
| Aspect | World Models | JEPA |
|---|---|---|
| What it predicts | Future pixels (x̂t+1) | Latent representations (ẑt+1) |
| Loss | Pixel MSE + KL + reward | Latent-space MSE |
| Decoder | Yes (required) | No (absent) |
| Action conditioning | Yes, central (f(s, a)) | No (pure encoder) |
| Can generate images | Yes | No |
| Anti-collapse | Not needed (decoder forces non-triviality) | Critical (stop-grad + EMA + predictor) |
| Pretraining compute (ViT-H/14, ImageNet) | ~12,000 GPU-hours (MAE) | ~1,200 GPU-hours (I-JEPA): ~10× less |
| Typical use | Planning, RL, robotics | Encoder pretraining, perception |
| Models | DreamerV3, TD-MPC2, Cosmos | I-JEPA, V-JEPA, V-JEPA 2 |
Use a World Model when: you need an internal simulator to plan actions; you want to "dream" trajectories for reinforcement learning; you need visualisable predictions for debug or for explainability to non-technical users.
Use JEPA when: you want to pretrain an efficient encoder for downstream use; you care about classification, detection, or anomaly detection more than generation; your task is one where visual detail is noise (collision anticipation, anomaly detection); you have limited compute and abundant unlabeled data.
Combine them when: you want both. A JEPA-pretrained encoder can sit inside a World Model, replacing the standard VAE. This is exactly what recent work explores — a pure perception module (JEPA) + a dynamics module with action conditioning (World Model). V-JEPA 2-AC [9], the action-conditioned variant Meta released in June 2025, is a concrete step in this direction.
Hardware, VRAM, costs: the real numbers
The architectural discussion stays abstract until you confront the practical problem: what does it actually take to train one of these models? How much VRAM, on what hardware, for how long, with how much data? Answers change radically depending on whether you are doing pre-training from scratch, fine-tuning, or inference only. Worth doing the arithmetic.
The memory formula, from first principles
For a model with N parameters trained with Adam in mixed precision (BF16 activations + FP32 master weights, the modern standard), GPU memory breaks down as follows:
Without FlashAttention, an extra term proportional to L · b · a · s² appears (a = attention heads), which dominates for long sequences. Flash-Attention is essentially mandatory for video-scale sequence lengths today. Two more techniques can reduce memory by an order of magnitude. Gradient checkpointing reduces activations to O(√L) at the cost of ~30% extra time recomputing. FSDP / ZeRO-3 shards parameters, gradients, and optimizer state across N GPUs, reducing per-GPU memory by ~N at the cost of significant inter-GPU traffic.
Worked example: fine-tuning V-JEPA 2 ViT-L (300M params)
Hand calculation: V-JEPA 2 ViT-L on 16-frame video at 224×224
Assumptions: ViT-L architecture (24 transformer layers, hidden dim 1024, 16 attention heads, ~300M params), BF16 mixed precision, Adam optimizer, FlashAttention enabled, no gradient checkpointing.
Parameter memory (16N): 300M · 16 = 4.8 GB. Note that 16N is not the weights alone. The raw BF16 weights are only 2N = 600 MB (that is the inference footprint). Training needs 16N because you also hold gradients, the FP32 master copy, and the two Adam states, per the formula above.
Sequence length: V-JEPA uses spatio-temporal tubelets of size 2×16×16, so 16 frames at 224×224 give (16/2) · (224/16)² = 8 · 196 = 1,568 tokens.
Activations per sample: 34 · 24 · 1,568 · 1,024 ≈ 1.31 GB.
With batch 8: 8 · 1.31 ≈ 10.5 GB activations.
Total per GPU: 4.8 + 10.5 + ~10% overhead ≈ 17 GB.
Verdict: comfortably fits on A100 40GB at batch 8. A100 80GB allows batch 24–32. H100 80GB trains roughly 3× faster at BF16, same memory. The Nexar BADAS 2.0 training setup uses 8 A100 80GB with a global batch around 256, which lines up with these numbers.
Same calculation for Cosmos-Reason2-2B
Hand calculation: Cosmos-Reason2-2B fine-tuning
Cosmos-Reason2-2B is a vision-language model post-trained from Qwen3-VL-2B-Instruct. It produces text reasoning over video, so its training memory profile is dominated by long autoregressive contexts.
Assumptions: 2B total parameters, Qwen3-VL-2B-like architecture (estimated 28 transformer layers and hidden dim ≈ 2048; exact specs not disclosed by Alibaba), BF16 mixed precision, FlashAttention enabled, video tokenized at fps=4 with moderate context length.
Parameter memory (16N): 2B · 16 = 32 GB (weights + gradients + FP32 master + Adam states, as above). Already exceeds A100 40GB by itself.
Activations: at hidden 2048, 28 layers, sequence 4,096 tokens, batch 4: 34 · 28 · 4 · 4,096 · 2,048 ≈ 32 GB.
Total: 32 + 32 + overhead ≈ 70 GB, at modest batch and sequence. Numbers shift ±20% with the actual hidden dim.
Verdict: A100 80GB minimum, batch must stay small. For serious training, 8 A100 80GB with FSDP or H100 80GB. This is the structural reason BADAS Flash Lite (22M V-JEPA 2) beats Cosmos-Reason2 (2B): at equivalent VRAM cost, JEPA allows for both larger effective models and larger batches.
Hardware: when to use what
GPU choice is not just "bigger is better". The relevant axes are VRAM (capacity), bandwidth (for data-intensive training), and numerical format support (FP8 changes the picture on H100). Cloud prices below reflect mid-2026 market rates on specialised GPU providers (Lambda, RunPod, Jarvislabs, Spheron, Thunder Compute); hyperscalers (AWS, GCP, Azure) typically charge 2–6× more. Spot pricing can be 40–60% cheaper than on-demand. Prices drift weekly.
| GPU | VRAM | BF16 TFLOPS | Cloud cost (on-demand) | When to use |
|---|---|---|---|---|
| RTX 4090 | 24 GB | 165 | $0.3–0.6/hr | Inference, dev, fine-tuning <1B params |
| L40S | 48 GB | 362 | $0.8–1.5/hr | Fine-tuning, batch-large inference |
| A100 40GB | 40 GB | 312 | $0.6–1.5/hr | Standard for JEPA ViT-L fine-tuning |
| A100 80GB | 80 GB | 312 | $1.5–2.5/hr | Pre-training, 1–3B parameter models |
| H100 80GB | 80 GB | 989 (BF16) / 1979 (FP8) | $2.0–3.5/hr | Serious training, large models, large batches |
| H200 | 141 GB | 989 | $3.5–6.0/hr | Very large models without FSDP |
| B200 (Blackwell) | 192 GB | ~2250 (with sparsity) | $5–10/hr | Foundation model pre-training |
The A100 → H100 step usually pays for itself for pure training: H100 costs roughly 1.5–2× more per hour but trains roughly 3× faster in BF16 and 6× faster in FP8 (relative to A100 BF16). Payback is fast on large jobs. For inference, A100 and L40S remain competitive.
Three concrete scenarios, with costs
Scenario A: V-JEPA pre-training from scratch (industrial)
A serious replication of V-JEPA at the scale of the original paper.
Assumption: the V-JEPA paper does not disclose exact GPU count or wall-clock training time. Numbers below are an order-of-magnitude estimate based on typical FAIR-cluster training for ViT-L on ~2M videos.
Model: ViT-L (~300M params).
Dataset: ~2M videos from public corpora, 16-frame clips.
Hardware: ~128 A100 80GB (estimate).
Time: ~14 days for full training (estimate).
Cost: 128 · $2/hr · 336 hr ≈ $86,000.
V-JEPA 2 [9], trained on over 1M hours of internet video, sits well above this. Replicating V-JEPA 2 at full scale is realistically a 7-figure compute budget. For most adopters, starting from Meta's public checkpoints is the only sensible choice.
Scenario B: V-JEPA 2 fine-tuning on a proprietary task (typical)
The real use case for 95% of adopters.
Model: V-JEPA 2 ViT-L pretrained + classification head.
Dataset: 200k labeled videos (~2M windowed clips, BADAS scale).
Hardware: 8 A100 80GB, global batch 256.
Time: 5–7 days for 20–30 epochs.
Cost: 8 · $2/hr · 144 hr ≈ $2,300.
Cost ratio against full pre-training: ~40×. This is why foundation model + fine-tuning is the dominant paradigm.
Scenario C: DreamerV3 training on an RL task
Typical for academic robotics or industrial R&D.
Model: DreamerV3 medium variant, ~25M parameters total (WM + actor + critic).
Dataset: collected online via environment interaction, ~1–10M env steps.
Hardware: 1 A100 40GB (sufficient; the bottleneck is environment throughput).
Time: 3–10 days depending on environment parallelisation.
Cost: $150–700.
DreamerV3 is almost always environment-bound, not GPU-bound. Adding GPUs does not help if the environment cannot be parallelised. The DreamerV3 paper reports that all Dreamer agents are trained on a single A100 GPU each, including the Minecraft diamond experiment.
Scenario D: production inference (always)
After training, the model has to run somewhere. The numbers change significantly.
BADAS 2.0 (300M): 34 ms on A100, 41 ms on Jetson Thor (automotive edge).
BADAS Flash Lite (22M): 2.8 ms on A100, 5.9 ms on Jetson Thor.
DreamerV3 inference: <10 ms on a consumer GPU for the policy alone.
Cosmos-Reason2-2B inference: ~500 ms to several seconds per clip, depending on reasoning length, because of autoregression.
For edge deployment, model size is critical. Here JEPA has a structural advantage: 22M parameters can run on a Jetson; 2B parameters cannot.
Data efficiency: JEPA vs World Model
Data efficiency is a different metric from compute efficiency, and the architectures separate interestingly here.
I-JEPA [7] reaches competitive ImageNet linear probing with a ViT-H/14 trained for 300 epochs in roughly 1,200 GPU-hours. MAE (Masked Autoencoder, generative reconstruction) needs about 10× more compute for similar linear probing on the same architecture. The I-JEPA paper reports being "over 10× more efficient than a ViT-H/14 pretrained with MAE". On paper: I-JEPA is significantly more compute- and sample-efficient because there is no decoder to train and the loss is in latent space.
V-JEPA 2 was pre-trained on more than 1M hours of internet video on the order of tens of millions of clips. Fine-tuning on the BADAS task requires 200k labelled videos. For comparison, Sora (diffusion video) is estimated to have been trained on tens of millions of hours of video with resources estimated at hundreds of millions of dollars.
DreamerV3 has a particular data-efficiency profile: it shines with little data because most of the learning happens in the world model's dreams. On standard pixel-based control benchmarks like the DMC Suite (DeepMind Control), it reaches strong performance within 1M environment steps; model-free baselines like PPO typically need an order of magnitude more interactions to reach comparable scores. On Atari with a 100k-step budget, DreamerV3 is among the most sample-efficient agents reported. The world model acts as a data amplifier.
A useful rule of thumb
Self-supervised pre-training (JEPA, MAE): millions of unlabeled samples. More data is better, with saturation around 10M–100M samples.
Supervised fine-tuning: 10k–500k labeled samples. The performance-vs-data curve saturates much earlier.
Reinforcement learning (Dreamer): 100k–10M env steps depending on task complexity. The world model is the efficiency multiplier.
Practical rule: the foundation model should be pretrained on as much web data as possible; you should only fine-tune on your domain-specific dataset. It is the winning paradigm in cost/performance terms.
What this means: for builders, investors, decision-makers
Different readers leave this article with different obligations. The technical bottom line is the same; the practical move is not.
If you build
For perceptual tasks that have to run safely and at the edge — collision anticipation, anomaly detection, visual quality control, surgical assistance, drone perception — JEPA is now the default starting point. Take a public V-JEPA 2 checkpoint, fine-tune on your domain data, distil to a small variant for deployment. Resist the temptation to use a 2B+ vision-language model as the perception layer just because it is the easiest API call: the BADAS-vs-Cosmos comparison shows the structural penalty. Reserve World Models for cases where you genuinely need to roll out actions in imagination — robotics control, RL with expensive simulators, agentic policies that have to plan multi-step interventions. For everything in between, the hybrid V-JEPA 2-AC and similar action-conditioned JEPAs are the architecture to watch through 2026.
If you invest
The compute moat for foundation-model-quality perception is concentrating in two places: pre-training on >1M hours of video (Meta, NVIDIA, a handful of others) and high-quality labelled fine-tuning data with privileged access (Nexar's 350k dashcam fleet, Tesla's vehicle fleet, certain medical imaging consortia, the autonomous logistics players). The middle layer — the people who fine-tune and deploy on existing checkpoints — is where margins are healthier and where most defensible companies will be built. Be careful of pitches that promise to "train a foundation model from scratch" outside the very specific niches where that economic argument holds.
If you decide budgets
Three numbers to internalise from this article. Pre-training a serious JEPA from scratch: about $86k of compute, plus team and data costs that dwarf that. Fine-tuning an existing checkpoint to a real task: about $2.3k of compute, days of work. Running an RL world model for robotics on a single GPU: $150–700. The implication is that the right question for most organisations is not "should we train a foundation model?" but "do we have the labelled data and the engineering bandwidth to fine-tune an existing one well, and the inference budget to deploy it at our scale?" The bottleneck is rarely compute and almost always data and integration.
Conclusion
World Models and JEPA are two different answers to the same problem: how to teach machines to predict the future by self-supervision. World Models do it by generating pixels, constrained by a decoder and a reconstruction loss. JEPA does it by staying in latent space, managing the representation collapse problem with carefully calibrated architectural asymmetry.
Pixel reconstruction is more intuitive and produces interpretable output, but it wastes representational capacity on irrelevant details. Latent prediction is more efficient and captures semantic structure, but its output is not directly interpretable and it requires the full anti-collapse arsenal we discussed.
Neither is "better" in absolute terms. They are different tools for different problems. The choice depends on what you need: a simulator to plan with (World Model), an encoder to perceive with (JEPA), or both combined in the hybrid architectures emerging right now. V-JEPA 2-AC and similar action-conditioned variants are the first concrete steps in this direction.
What unites both is the underlying intuition that has driven the field for years: the ability to predict the future is central to intelligence. A system that can only react to the present is structurally limited. The next generation of AI, which will operate in the physical world and have to reason about the consequences of its own actions, will need to imagine. The architectures we are building today are the first serious attempts to give it that ability.
References
- Ha, D., & Schmidhuber, J. (2018). World Models. arXiv:1803.10122. arxiv.org/abs/1803.10122
- Hafner, D., Lillicrap, T., Ba, J., & Norouzi, M. (2020). Dream to Control: Learning Behaviors by Latent Imagination. ICLR 2020. arXiv:1912.01603
- Hafner, D., Lillicrap, T., Norouzi, M., & Ba, J. (2021). Mastering Atari with Discrete World Models. ICLR 2021. arXiv:2010.02193
- Hafner, D., Pasukonis, J., Ba, J., & Lillicrap, T. (2023). Mastering Diverse Domains through World Models. arXiv:2301.04104
- Wu, P., Escontrela, A., Hafner, D., Goldberg, K., & Abbeel, P. (2022). DayDreamer: World Models for Physical Robot Learning. CoRL 2022. arXiv:2206.14176
- LeCun, Y. (2022). A Path Towards Autonomous Machine Intelligence. OpenReview preprint. — the position paper that introduced the JEPA framework.
- Assran, M., Duval, Q., Misra, I., Bojanowski, P., Vincent, P., Rabbat, M., LeCun, Y., & Ballas, N. (2023). Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture. CVPR 2023. arXiv:2301.08243 — the I-JEPA paper.
- Bardes, A., Garrido, Q., Ponce, J., Chen, X., Rabbat, M., LeCun, Y., Assran, M., & Ballas, N. (2024). Revisiting Feature Prediction for Learning Visual Representations from Video. TMLR 2024. arXiv:2404.08471 — the V-JEPA paper.
- Assran, M., Bardes, A., Fan, D., Garrido, Q., et al. (2025). V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning. arXiv:2506.09985 — the V-JEPA 2 paper used as backbone in BADAS 2.0.
- Grill, J.-B., Strub, F., Altché, F., et al. (2020). Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning. NeurIPS 2020. arXiv:2006.07733 — first non-contrastive method that avoids collapse without negative pairs.
- Chen, X., & He, K. (2021). Exploring Simple Siamese Representation Learning (SimSiam). CVPR 2021. arXiv:2011.10566 — BYOL variant without EMA, just stop-gradient + predictor.
- Tian, Y., Chen, X., & Ganguli, S. (2021). Understanding Self-Supervised Learning Dynamics without Contrastive Pairs. ICML 2021. arXiv:2102.06810 — theoretical analysis of the gradient flow for BYOL/SimSiam.
- Jing, L., Vincent, P., LeCun, Y., & Tian, Y. (2022). Understanding Dimensional Collapse in Contrastive Self-supervised Learning. ICLR 2022. arXiv:2110.09348 — formal characterisation of dimensional collapse.
- Bardes, A., Ponce, J., & LeCun, Y. (2022). VICReg: Variance-Invariance-Covariance Regularization for Self-Supervised Learning. ICLR 2022. arXiv:2105.04906
- Zbontar, J., Jing, L., Misra, I., LeCun, Y., & Deny, S. (2021). Barlow Twins: Self-Supervised Learning via Redundancy Reduction. ICML 2021. arXiv:2103.03230
- NVIDIA (Agarwal, N., et al.) (2025). Cosmos World Foundation Model Platform for Physical AI. arXiv:2501.03575 — the Cosmos paper underlying Cosmos-Predict and the Cosmos-Reason model family used as baseline by Nexar.
- Goldshmidt, R., Scott, H., Niccolini, L., & Matzner, H. (2026). Beyond the Beep: Scalable Collision Anticipation and Real-Time Explainability with BADAS-2.0. arXiv:2604.05767 — the technical paper describing the BADAS 2.0 system and benchmarks.
- Nexar AI (2026). BADAS 2.0: A V-JEPA 2 World Model for Collision Anticipation. badas.nexar.app — public benchmarks and controlled comparison against Cosmos-Reason2-2B.