First Principles Series · Part 2

A fair comparison of World Models and JEPA (since I couldn't find one)

The math, the costs, the production case. The article I would have wanted to read a month ago.

TL;DR for the busy reader

World Models predict the future in pixel space. They are expensive to train and waste capacity on visual detail, but they produce simulators you can plan with. JEPA predicts the future in latent space. It is cheaper, more sample-efficient, harder to interpret, and dominant for perceptual tasks at the edge.

The most concrete public proof is Nexar's BADAS 2.0 for collision anticipation: built on V-JEPA 2 (300M params), it reaches 99.4% Average Precision and 91.3% Early Warning Recall on Nexar's long-tail benchmark, against 94.0% / 48.3% for a fine-tuned NVIDIA Cosmos-Reason2-2B on the same data. The smallest BADAS variant (22M params) still beats Cosmos with 91× fewer parameters.

The cost side is just as informative: serious JEPA pre-training runs around $86k, fine-tuning a public checkpoint around $2.3k, an RL world model on a single GPU around $150–700. The compute moat is mostly in pre-training, which Meta and NVIDIA are running for you. Most teams should fine-tune, not pre-train.

If you follow the AI debate, you have heard about World Models and JEPA. They are two of the most discussed architectures in robotics and autonomous systems. The discussion is often confused, though: the terms are used interchangeably, the differences are not made clear, and the technical details that would let you understand what these systems actually do are almost always missing.

This article is an attempt to clear that up. We will start from the problem both architectures try to solve, then see how they solve it in fundamentally different ways, with concrete numerical examples. After a public production case (BADAS 2.0) that grounds the discussion, we will spend a long section on the representation collapse problem, which is the actual technical heart of JEPA and which is usually dismissed in one line. We will close with the part that is most often left out of similar articles: the hardware, VRAM, and cost numbers you need to actually train these models, and what to do with all of this on Monday morning.

The problem: predicting what comes next

Think about how your brain works when you cross a street. You do not just react to what you see right now. You imagine where the car will be in two seconds. You predict what happens if you speed up. You simulate alternative scenarios in your head. This capacity for mental simulation is what World Models and JEPA try to give to machines.

Why is it important? Because a system that can only react to present stimuli is structurally limited. A robot that grasps an object needs to predict where its hand will be 100 milliseconds from now. A self-driving car needs to anticipate the behaviour of other vehicles. An AI agent that plans needs to reason about the consequences of its actions before acting.

The technical problem is: how do you teach a machine to predict the future when you do not have "ground-truth future" labels? The answer both architectures give is the same. Train by self-supervision. Show the model a sequence, hide part of it, ask the model to guess the hidden part. If it succeeds, it has learned something about the structure of the world.

The two architectures diverge on what to hide and what to ask as the prediction. World Models hide future frames and ask the model to reconstruct them pixel by pixel. JEPA hides patches of an image or video and asks the model to predict only the latent representation of those patches. The difference looks subtle and is actually deep.

The distinction that governs everything else

World Models: predict future pixels → optimise for visual fidelity.

JEPA: predict latent representations → optimise for semantic structure.

This looks like a minor design choice. It changes the compute profile, the data efficiency, the generalisation properties, and the tasks each architecture is suited for.

World Models: simulating the world internally

The concept of World Model was formalised by Ha and Schmidhuber in 2018 in a paper of the same name [1]. The idea is to build an internal simulator of the world that lets an agent "imagine" what happens if it takes a given action, without having to actually take it.

The fundamental equation

A World Model is, in its simplest form, a function that takes the current state of the world and an action, and produces the next state:

s_{t+1} = f_θ(s_t, a_t) + ε Deterministic-with-noise World Model equation. ε is stochastic noise capturing irreducible uncertainty. We carry it as an abstract symbol here; in Dreamer's RSSM, below, this noise becomes an explicit probabilistic component of the state, not just an additive term.

Notation used throughout this article

x_t — raw observation at time t (e.g., a 256×256×3 image, so 196,608 numbers).

s_t — state, i.e., the compressed latent representation of the world at time t (e.g., a vector of 256 numbers, produced by the encoder).

a_t — action taken at time t.

r_t — reward signal at time t (RL setting only).

e_φ, f_θ, d_ψ — encoder (params φ), dynamics (params θ), decoder (params ψ).

x̂, ŝ — predicted versions of x and s.

Two notational notes worth flagging up front. In Dreamer's RSSM, the state s_t is itself decomposed into two parts, written as the pair (h_t, z_t) following the paper's convention. The h is the deterministic recurrent state, the z is the stochastic categorical component. This Dreamer-specific z is therefore a sub-component of the latent state, not the full state s_t. In the JEPA section later on, we use z (with subscripts like z_ctx, z_tgt) for JEPA latents, following the JEPA literature's convention. These are conceptually the analogue of s in World Models but the literature names them z. Both notations are kept deliberately, to make this article match what the reader will see in the original papers.

The observation x_t is too large to manipulate directly. A 256×256 RGB image has 196,608 numbers. A 1-second video at 30 fps has nearly 6 million. World Models therefore work in a compressed latent space — the state s_t — produced by an encoder, with three distinct components.

The three components

World Model architecture

Encoder compresses, Dynamics evolves, Decoder reconstructs. Training is driven by pixel-wise MSE on the reconstructed frame.

Encoder e_φ. Compresses the high-dimensional observation into a compact latent representation. Typically a CNN followed by an MLP for images, a transformer for video. A 256×256 RGB image (196,608 numbers) becomes a vector of 256 or 512 numbers. This compression forces the model to discard noise.

Dynamics Model f_θ. The core of the World Model. It takes the latent state s_t and the action a_t and predicts s_t+1. Typically an RNN (GRU, LSTM) or a transformer. Dreamer uses a variant called RSSM (Recurrent State-Space Model) that is worth looking at in detail.

Decoder d_ψ. Reconstructs pixels from the predicted latent state. Its main role is during training: the reconstruction loss (pixel-wise MSE) is the signal that forces encoder and dynamics to preserve visual information.

RSSM: the Dreamer trick

The dynamics model in Dreamer does not produce a single monolithic state. It decomposes the state s_t into a deterministic part h_t (a GRU hidden state that accumulates history) and a stochastic part z_t (a sample from a categorical distribution). The motivation is that the world is intrinsically uncertain, and modelling that uncertainty explicitly improves long-horizon prediction.

Notation reminder. In Dreamer's RSSM we are now operating inside the latent state: s_t = (h_t, z_t). The symbol z_t below refers specifically to the stochastic categorical component, following the DreamerV3 paper. It is one of the two components of the state, not a re-labelling of the encoder output we used earlier in the simple WM diagram.

h_t = GRU(h_{t-1}, z_{t-1}, a_{t-1}) // deterministic component z_t ~ Categorical(softmax(MLP(h_t))) // stochastic component s_t = (h_t, z_t) // full state prior ẑ_t ~ p(z_t | h_t) // predicted from past h_t posterior z_t ~ q(z_t | h_t, x_t) // corrected by observation x_t RSSM. The state s decomposes into (h, z). The loss includes a KL(posterior ‖ prior) term.

The DreamerV3 [4] total loss is a weighted sum of three terms: reconstruction loss (pixel MSE), KL divergence between prior and posterior (forces the prior to be predictive), and reward loss (a separate MLP predicts the reward from s_t). This decomposition follows directly from the variational lower bound (ELBO) for sequential models. DreamerV3 adds KL balancing: the prior is allowed to move faster towards the posterior than the posterior towards the prior, with separate stop-gradients on each side. It is a small detail in the loss code and one of the reasons DreamerV3 trains stably across very different environments without per-task hyperparameter tuning.

Numerical example

Example: a robot observing an object

Input: a 256×256×3 video frame x_t = 196,608 float32 values = 768 KB.

After the encoder: a latent state s_t of dimension 256 = 1 KB.

Compression: 196,608 / 256 = 768×. The encoder must decide what to keep.

Action: a_t = "move arm right" = a vector of 7 values, one per joint of a 7-DOF arm (a redundant manipulator like the Franka Panda, with one joint more than the 6 strictly needed to place and orient the hand in space). Note this is joint space, not the 6-DOF Cartesian pose of the end-effector.

Dynamics Model: concat(s_t, a_t) has dimension 263 → MLP → s_t+1 of dimension 256.

Decoder: s_t+1 → 196,608 predicted pixels for the next frame, x̂_t+1.

Loss: pixel-wise MSE = (1/196,608) Σ (x̂_i − x_i)².

The 768× compression is the critical point. The encoder is free to choose which 256 numbers to use. If it chooses badly, the decoder will fail to reconstruct and the loss will be high. So the gradient flow forces the encoder to preserve information that matters for pixel reconstruction. Which, importantly, includes reflections, textures, shadows. Details that are irrelevant for the decision.

The "dream to learn" loop

Once the world model is trained, Dreamer uses it to generate imagined trajectories. The policy is not trained on real data (slow, expensive) but on the world model's dreams. A real robot typically interacts with the environment at tens of steps per second; the world model can produce on the order of thousands of imagined transitions per second on a single GPU, depending on architecture. This is the source of Dreamer's data efficiency.

The Dreamer loop: act, imagine, improve

Real experience feeds the world model. The world model feeds the policy. The improved policy generates new real experience.

The pixel reconstruction problem

World Models have a structural limitation: the decoder. To train the model you need a loss signal, and that signal is pixel-wise MSE between the predicted frame and the real frame. The gradient flowing back through encoder and dynamics is computed with respect to visual fidelity, not with respect to representation utility.

The practical consequence is that the encoder spends capacity preserving information the decoder can then reconstruct. Reflections, shadows, textures, background details: all of these enter the loss with the same weight as the things that actually matter for decisions.

Example: a glass on a table

For the decision (do not push it too hard) you need:

position of the glass
direction and velocity of your hand
distance from the edge

What the World Model has to predict (pixel reconstruction):

how light reflections on the glass surface change
how shadows move on the table
the exact wood texture
the exact colour of every background pixel

A large fraction of the decoder's representational capacity is committed to reconstructing detail that the decision does not depend on. This is why models like Sora produce beautiful videos but are weak as planners.

This observation is what led Yann LeCun to propose JEPA. If the problem is that the decoder forces reconstruction of useless pixels, the solution is to remove the decoder.

JEPA: predicting without reconstructing

In 2022 LeCun published "A Path Towards Autonomous Machine Intelligence" [6], a 60-page position paper criticising the generative approach and proposing an alternative: JEPA, the Joint-Embedding Predictive Architecture.

The idea: instead of predicting future pixels (or masked patches), predict the latent representation of what is hidden. The loss is no longer pixel-wise MSE. It is a distance in embedding space.

World Model: s_t → DECODER → x̂_{t+1} Loss = ‖x̂_{t+1} − x_{t+1}‖² (pixel) JEPA: z_ctx → PREDICTOR → ẑ_tgt Loss = ‖ẑ_tgt − z_tgt‖² (latent) The structural difference. WM decodes the latent state s_t back to pixels; JEPA only ever compares latents (z_ctx, z_tgt). The technical core of the problem is what computes z_tgt as a reference signal.

The architecture

JEPA uses three components.

Context encoder e_φ. Receives the context (visible patches of the image, or past frames of a video) and produces its representation z_ctx.

Target encoder e_φ'. Receives the target (masked patches, or future frames) and produces z_tgt, which acts as "ground truth" in latent space. Critical detail: the parameters φ' are not updated by the gradient. They are updated as an exponential moving average (EMA) of φ.

Predictor p_ψ. Receives z_ctx and tries to predict ẑ_tgt. Typically much lighter than the encoders (6 transformer layers against the 24 of the encoder for ViT-L).

JEPA architecture

Two asymmetric encoders, a predictor, no decoder. The loss lives in latent space.

The formal loss, expanded

Let us write it out properly. Let x be an image, M(x) the context (visible patches), N(x) the target (masked patches). Let φ be the online encoder parameters, φ' the target encoder parameters, ψ the predictor parameters.

z_ctx = e_φ(M(x)) // gradient flows here z_tgt = sg(e_φ'(N(x))) // sg = stop-gradient ẑ_tgt = p_ψ(z_ctx, pos_tgt) // cross-attn with position embeddings L(φ, ψ) = E_{x, M, N} [ ‖p_ψ(e_φ(M(x))) − sg(e_φ'(N(x)))‖² ] update rule: φ ← φ − η · ∇_φ L ψ ← ψ − η · ∇_ψ L φ' ← τ · φ' + (1 − τ) · φ with τ ≈ 0.996 – 0.9995 Full I-JEPA equations. The squared-L2 norm is the I-JEPA choice; V-JEPA [8] uses an L1 norm instead, which empirically helps with video. Stop-gradient and EMA are the core of collapse prevention either way.

Numerical example: I-JEPA on a 224×224 image

How a Vision Transformer turns an image into tokens (ViT-L/16)

The image is cut into a grid of 16×16-pixel patches. Each patch becomes one token, the same way a word is a token in a sentence. A 224×224 image gives 14×14 = 196 tokens.

Example: I-JEPA, ViT-L/16

Input: a 224×224×3 image. With 16×16 patches, since 224 ÷ 16 = 14, you get a 14×14 grid → 14×14 = 196 patches. Each patch is one token for the transformer.

Masking: sample 4 target blocks, each covering 15–20% of the image. The context is what is left.

Context tokens: around 50 visible patches → encoder produces 50 vectors in ℝ¹⁰²⁴.

Predictor input: 50 context vectors + position embeddings for the masked patches → produces ẑ for each target patch.

Target encoder (EMA): around 146 masked patches → produces 146 "true" vectors in ℝ¹⁰²⁴, no grad.

Loss: MSE averaged over the 146 vectors of dim 1024 = (1/(146·1024)) Σ (ẑ_ij − z_ij)².

What does not happen: no decoder, no pixel reconstruction, no image generation. The network learns without ever producing humanly interpretable output.

BADAS 2.0: JEPA in production

Before we go deeper into the collapse problem, it is worth checking that the JEPA approach actually works at production scale. The most concrete and well-documented public case is BADAS 2.0 by Nexar [17] [18], a collision anticipation system that predicts road incidents before they happen. The Nexar site publishes full benchmarks and direct comparisons with a generative competitor, which is rare and lets us check JEPA's thesis on a safety-critical task.

BADAS 2.0 uses V-JEPA 2 as its backbone: a ViT-L Vision Transformer with 300M parameters, fine-tuned end-to-end on roughly 200,000 labeled videos (~2M windowed clips) collected from a 350,000-dashcam Nexar fleet. It is used by commercial fleets, insurance companies, and AV development programs.

The controlled comparison with Cosmos-Reason2

The most interesting experiment Nexar has published is a controlled comparison. They took NVIDIA Cosmos-Reason2-2B (a 2B-parameter vision-language model post-trained from Qwen3-VL-2B-Instruct, designed for embodied reasoning) and fine-tuned it on the same training data as BADAS 2.0. Same training set, same protocol, same task.

Metric	BADAS 2.0 (V-JEPA 2)	COSMOS-BADAS
Average Precision	99.4%	94.0%
Early Warning Recall	91.3%	48.3%
Training data	2M real-world clips	2M real-world clips (same)
Main model (the 99.4% / 91.3% above)	300M params	2B params
Smallest viable variant	22M params (98.4% AP)	none below 2B (cloud only)
Architecture	JEPA (non-generative)	Autoregressive VLM
Explainability	Native attention maps	Chain-of-thought text

The most telling metric is early warning recall: 91.3% for BADAS against 48.3% for Cosmos. BADAS detects danger ahead of time nine times out of ten. Cosmos does so less than half the time. In the context of collision anticipation, "late" equals "wrong".

One clarification, because Nexar's own published table compresses it. The 99.4% AP and 91.3% recall above belong to the main BADAS 2.0 model, the 300M V-JEPA 2 ViT-L. BADAS also ships distilled variants, and the smallest of them, Flash Lite at 22M parameters, scores a slightly lower 98.4% AP. So the headline accuracy and the headline tiny model are technically two different models in the same family. Both comparisons against Cosmos hold, they are just not the same row.

That 22M variant is the one that matters most. A 22M-parameter model beating a 2B-parameter model on the same task, with 91× fewer parameters, is where the architectural difference stops being a technical detail and becomes a fundamental efficiency gap.

Why JEPA wins here

Two reasons, both worth separating. The first is the prior each model carries into fine-tuning. V-JEPA 2 was pre-trained on more than 1M hours of internet video [9], internalising at scale the regularities of physical motion — objects falling, pedestrians changing direction, vehicles braking. Cosmos-Reason2 was post-trained from a generalist vision-language model with no comparable exposure to long-form video physics. Same fine-tuning data on top, very different priors underneath.

The second reason is structural — what each architecture is even computing. To predict accidents you do not need to reconstruct what the scene will look like in the next frame. You need to capture whether that car is braking, whether that pedestrian is committing to crossing. V-JEPA 2 optimises directly for those semantic invariances, in latent space. Cosmos-Reason2, even after fine-tuning, is still organised around generating tokens of free-form reasoning, with the latency, verbosity, and overconfidence problems that follow.

A detail worth noting on generalisation: BADAS 2.0 also works on scenarios that have nothing to do with road driving — drones in flight, cleaning robots, forklifts. The model learned general physics during V-JEPA 2 pre-training on millions of hours of video, not just road rules. Nexar's own framing is "Beyond the Road": any machine with a camera that moves through the physical world can use BADAS, not because it was trained on every environment, but because it learned how the physical world works. Yann LeCun, who sits on the Nexar board, comments: "Models don't emerge from abstractions alone. They come from sustained exposure to reality."

That JEPA wins this comparison at all rests on one technical condition: during pre-training, the system did not collapse into a useless trivial solution. This is the topic of the next section, and it is the part of the story most articles skip in a sentence. It is worth slowing down on.

The collapse problem, in detail

This is the section worth reading in full. Representation collapse is the central problem of all non-contrastive self-supervised methods, and understanding it means understanding why JEPA architectures look the way they do and could not have been designed differently.

The problem, formulated

Consider the JEPA loss with both encoders sharing parameters (no EMA for a moment): L(φ, ψ) = ||p_ψ(e_φ(M(x))) − e_φ(N(x))||². What is the trivial minimum?

If e_φ(x) = c for every x (encoder collapsed to constant) then e_φ(M(x)) = e_φ(N(x)) = c (both branches output c) loss = ‖p_ψ(c) - c‖² (zero if p_ψ learns identity at c) The loss minimum is reached by a completely useless solution.

A neural network minimising this loss without constraints will happily converge to the constant solution. All representations collapse onto the same point, the loss goes to zero, and the model has learned nothing. This is complete collapse.

There is also a more subtle phenomenon, dimensional collapse, characterised by Jing, Vincent, LeCun and Tian in 2022 [13]. Here the representations do not collapse to a single point, but they live in a subspace whose dimension is much smaller than the embedding space. Empirically you detect it by computing the covariance matrix of the representations across a batch and counting how many eigenvalues are significantly above zero. If the encoder outputs vectors in ℝ¹⁰²⁴ but only 30 eigenvalues are non-trivial, there is dimensional collapse.

Why collapse is a property of the gradient flow

Tian, Chen and Ganguli in 2021 [12] gave a theoretical analysis of the problem for the linear case. Their key result: if both encoders are identical and updated with the same gradient, gradient flow inevitably leads to collapse. The steepest descent direction always points towards the constant solution.

To avoid this, you need asymmetries. The three families of solutions modern architectures use are:

Strategy	Mechanism	Examples
Contrastive	Explicitly push positive pairs together and negative pairs apart via InfoNCE-style loss	SimCLR, MoCo
Asymmetric architecture	Stop-gradient + EMA target + predictor. Breaks the symmetry that causes collapse	BYOL, SimSiam, JEPA
Explicit regularisation	Loss terms that enforce variance and decorrelation of features	VICReg, Barlow Twins
Sharpening + centering	Temperature on the target softmax + batch centroid subtraction	DINO, DINOv2

JEPA belongs to the asymmetric architecture family and inherits directly from BYOL (Grill et al., DeepMind 2020 [10]), the work that first showed a non-contrastive method can be trained without collapsing. Let us look at why in detail.

The three asymmetries of BYOL/JEPA

Asymmetry 1 — Stop-gradient. The target encoder receives zero gradient. It cannot chase the online encoder; it can only be chased. Without this asymmetry both encoders would converge to the same trivial minimum. Chen and He 2021 [11] showed empirically that removing the stop-gradient in SimSiam causes immediate collapse to the trivial solution. The loss drops to zero and the per-channel standard deviation of the embeddings goes to zero with it.

Asymmetry 2 — EMA. The target encoder parameters are updated as a moving average of the online parameters: φ' ← τφ' + (1−τ)φ. With τ = 0.996, the target moves at 0.4% of the online encoder's rate per step. This means the target encoder is a "slow" version of the online encoder and provides a stable training signal. SimSiam [11] showed that EMA is not strictly required (just stop-gradient + predictor can suffice in some regimes), but every JEPA-family work has retained it because empirically it stabilises training significantly.

Asymmetry 3 — Predictor. The predictor p_ψ transforms z_ctx before comparison with z_tgt. This transformation is learned and updated by the gradient. Tian, Chen, Ganguli [12] showed analytically — in the linear gradient-flow regime — that removing the predictor causes the dynamics to collapse onto the trivial solution. Empirically the same holds far outside that regime: every published JEPA-style training that removes the predictor collapses. The predictor cannot be the identity function; it has to learn to map online-encoder representations to a target that lags behind it.

The intuition behind the three asymmetries

The online encoder "wants" to produce representations the predictor can easily map onto the target. The target is a lagged version of the online. The only way the predictor stays useful as the target keeps moving is if the encoder produces representations rich enough to be predictable across that lag.

Put differently: the system is in a non-trivial fixed point where online encoder, predictor, and target encoder are in a dynamic equilibrium that requires meaningful representations to sustain itself. Collapse is another fixed point, but unstable under this dynamic.

Why V-JEPA fights a worse collapse

In video the problem is more severe. Two consecutive frames at 30 fps differ by milliseconds and are visually almost identical. If the model learns to "copy" the previous frame's representation, it scores a low loss without learning anything meaningful. The V-JEPA paper does not name this failure mode; it is descriptive shorthand to call it temporal shortcut or temporal collapse. The mitigations are explicit in the paper even if the name is not.

V-JEPA [8] fights this with three additional mechanisms compared to I-JEPA:

Aggressive masking: up to 90% of the video is masked (against 75% in I-JEPA). This forces the model to use long-range information.
Spatio-temporal block masking: masks are contiguous blocks extending across several consecutive frames. You cannot guess a masked pixel from its neighbours if those neighbours are also masked.
Multi-block targets: the predictor has to predict several independent target blocks from the same context. This increases signal diversity and reduces the risk of learning shortcuts specific to one target.

Despite these mitigations, V-JEPA still shows partial dimensional collapse: the effective rank of the representation covariance matrix is significantly lower than the embedding dimension. The problem is not closed, it is managed at an acceptable level.

VICReg: an explicit alternative

Worth mentioning VICReg (Bardes, Ponce, LeCun, ICLR 2022 [14]) as a conceptually different alternative. Instead of preventing collapse with asymmetry, VICReg prevents it with explicit regularisation. The loss has three terms:

L_VICReg = λ · I(z, z') + μ · V(z) + ν · C(z) where: I(z, z') = ‖z − z'‖² // invariance, as in JEPA V(z) = Σ_j max(0, γ − std(z_·j)) // variance, per dimension C(z) = Σ_{i≠j} [Cov(z)_ij]² // covariance, off-diagonal only VICReg. The V term enforces non-zero variance, C decorrelates dimensions. No EMA, no stop-grad.

VICReg is cleaner mathematically (no architectural tricks, just explicit loss terms) but requires careful tuning of the coefficients λ, μ, ν. JEPA won in popularity because it trains more easily, even if the mechanism is less transparent. The choice between these families is still an open research question.

LLMs, Diffusion, and JEPA: the three families

The AI generative debate often frames the world as LLMs vs diffusion as if those were the only options. When the task is predicting the future, there are three fundamentally different approaches, and understanding their differences clarifies where JEPA sits.

Autoregressive (LLMs)

LLMs work by autoregressive prediction. Given x_1:t, they model p(x_t+1 | x_1:t) as a categorical distribution over a discrete vocabulary. They generate one token at a time, sampling each from this distribution.

The approach can be applied to images and video by tokenising the input (VQ-VAE, VQGAN, or more recent discrete tokenisers). Models like iGPT, Parti, and NVIDIA Cosmos [16] use this paradigm. Their strength is composability with the transformer techniques matured for language: scaling laws, instruction tuning, RLHF.

Diffusion

Diffusion models (Stable Diffusion, DALL-E 3, Sora) implement an inverse stochastic process. They start from pure noise x_T ~ N(0, I) and "denoise" it iteratively across T steps (typically 20–50 in modern models, up to 1000 in original DDPMs). The model is trained to predict the noise added at each step.

The structural difference from LLMs: diffusion models operate non-autoregressively. All pixels are refined simultaneously at each step. This enables exceptional visual quality but makes the model non-interactive in the classical sense: you cannot hand it an action and ask what happens.

JEPA (non-generative)

JEPA does not generate anything. It produces no pixels, no discrete tokens. It produces only latent representations. For this reason it is not directly comparable to LLMs and diffusion as a "generative model"; it is a separate category, that of discriminative self-supervised models.

Three paradigms compared

Only JEPA does not try to reconstruct the original output. Everything else follows from this.

The philosophical difference is deep. LLMs and diffusion try to reconstruct the exact output (the next token, the noise-free image). To do that they must model the full distribution p(x) or p(x | y), including all irrelevant details. JEPA only tries to capture the structure of the input, discarding details by construction.

For applications that care about what will happen rather than what it will look like, JEPA is theoretically more suitable. This is the exact intuition that led Nexar to build BADAS 2.0 on V-JEPA 2 instead of a generative model.

Which architecture to use, when

The structural differences summarised:

Aspect	World Models	JEPA
What it predicts	Future pixels (x̂_t+1)	Latent representations (ẑ_t+1)
Loss	Pixel MSE + KL + reward	Latent-space MSE
Decoder	Yes (required)	No (absent)
Action conditioning	Yes, central (f(s, a))	No (pure encoder)
Can generate images	Yes	No
Anti-collapse	Not needed (decoder forces non-triviality)	Critical (stop-grad + EMA + predictor)
Pretraining compute (ViT-H/14, ImageNet)	~12,000 GPU-hours (MAE)	~1,200 GPU-hours (I-JEPA): ~10× less
Typical use	Planning, RL, robotics	Encoder pretraining, perception
Models	DreamerV3, TD-MPC2, Cosmos	I-JEPA, V-JEPA, V-JEPA 2

Use a World Model when: you need an internal simulator to plan actions; you want to "dream" trajectories for reinforcement learning; you need visualisable predictions for debug or for explainability to non-technical users.

Use JEPA when: you want to pretrain an efficient encoder for downstream use; you care about classification, detection, or anomaly detection more than generation; your task is one where visual detail is noise (collision anticipation, anomaly detection); you have limited compute and abundant unlabeled data.

Combine them when: you want both. A JEPA-pretrained encoder can sit inside a World Model, replacing the standard VAE. This is exactly what recent work explores — a pure perception module (JEPA) + a dynamics module with action conditioning (World Model). V-JEPA 2-AC [9], the action-conditioned variant Meta released in June 2025, is a concrete step in this direction.

Hardware, VRAM, costs: the real numbers

The architectural discussion stays abstract until you confront the practical problem: what does it actually take to train one of these models? How much VRAM, on what hardware, for how long, with how much data? Answers change radically depending on whether you are doing pre-training from scratch, fine-tuning, or inference only. Worth doing the arithmetic.

The memory formula, from first principles

For a model with N parameters trained with Adam in mixed precision (BF16 activations + FP32 master weights, the modern standard), GPU memory breaks down as follows:

M_weights = 2N // BF16 weights M_grad = 2N // BF16 gradients M_master = 4N // FP32 master weights (numerical stability) M_adam = 8N // Adam moments m, v in FP32 ───────────────────────────────────────────── M_params = 16N bytes \approx 16 GB per 1B params M_act \approx 34 \cdot L \cdot b \cdot s \cdot h bytes (with FlashAttention, BF16) M_total = M_params + M_act + overhead (~10-20%) L: number of layers, b: batch, s: sequence length, h: hidden dim. The 34-byte constant per (token, layer) is from Korthikanti et al. 2022, assuming Flash-Attention so the O(s²) attention matrix is not materialised.

Without FlashAttention, an extra term proportional to L · b · a · s² appears (a = attention heads), which dominates for long sequences. Flash-Attention is essentially mandatory for video-scale sequence lengths today. Two more techniques can reduce memory by an order of magnitude. Gradient checkpointing reduces activations to O(√L) at the cost of ~30% extra time recomputing. FSDP / ZeRO-3 shards parameters, gradients, and optimizer state across N GPUs, reducing per-GPU memory by ~N at the cost of significant inter-GPU traffic.

Worked example: fine-tuning V-JEPA 2 ViT-L (300M params)

Hand calculation: V-JEPA 2 ViT-L on 16-frame video at 224×224

Assumptions: ViT-L architecture (24 transformer layers, hidden dim 1024, 16 attention heads, ~300M params), BF16 mixed precision, Adam optimizer, FlashAttention enabled, no gradient checkpointing.

Parameter memory (16N): 300M · 16 = 4.8 GB. Note that 16N is not the weights alone. The raw BF16 weights are only 2N = 600 MB (that is the inference footprint). Training needs 16N because you also hold gradients, the FP32 master copy, and the two Adam states, per the formula above.

Sequence length: V-JEPA uses spatio-temporal tubelets of size 2×16×16, so 16 frames at 224×224 give (16/2) · (224/16)² = 8 · 196 = 1,568 tokens.

Activations per sample: 34 · 24 · 1,568 · 1,024 ≈ 1.31 GB.

With batch 8: 8 · 1.31 ≈ 10.5 GB activations.

Total per GPU: 4.8 + 10.5 + ~10% overhead ≈ 17 GB.

Verdict: comfortably fits on A100 40GB at batch 8. A100 80GB allows batch 24–32. H100 80GB trains roughly 3× faster at BF16, same memory. The Nexar BADAS 2.0 training setup uses 8 A100 80GB with a global batch around 256, which lines up with these numbers.

Same calculation for Cosmos-Reason2-2B

Hand calculation: Cosmos-Reason2-2B fine-tuning

Cosmos-Reason2-2B is a vision-language model post-trained from Qwen3-VL-2B-Instruct. It produces text reasoning over video, so its training memory profile is dominated by long autoregressive contexts.

Assumptions: 2B total parameters, Qwen3-VL-2B-like architecture (estimated 28 transformer layers and hidden dim ≈ 2048; exact specs not disclosed by Alibaba), BF16 mixed precision, FlashAttention enabled, video tokenized at fps=4 with moderate context length.

Parameter memory (16N): 2B · 16 = 32 GB (weights + gradients + FP32 master + Adam states, as above). Already exceeds A100 40GB by itself.

Activations: at hidden 2048, 28 layers, sequence 4,096 tokens, batch 4: 34 · 28 · 4 · 4,096 · 2,048 ≈ 32 GB.

Total: 32 + 32 + overhead ≈ 70 GB, at modest batch and sequence. Numbers shift ±20% with the actual hidden dim.

Verdict: A100 80GB minimum, batch must stay small. For serious training, 8 A100 80GB with FSDP or H100 80GB. This is the structural reason BADAS Flash Lite (22M V-JEPA 2) beats Cosmos-Reason2 (2B): at equivalent VRAM cost, JEPA allows for both larger effective models and larger batches.

Hardware: when to use what

GPU choice is not just "bigger is better". The relevant axes are VRAM (capacity), bandwidth (for data-intensive training), and numerical format support (FP8 changes the picture on H100). Cloud prices below reflect mid-2026 market rates on specialised GPU providers (Lambda, RunPod, Jarvislabs, Spheron, Thunder Compute); hyperscalers (AWS, GCP, Azure) typically charge 2–6× more. Spot pricing can be 40–60% cheaper than on-demand. Prices drift weekly.

GPU	VRAM	BF16 TFLOPS	Cloud cost (on-demand)	When to use
RTX 4090	24 GB	165	$0.3–0.6/hr	Inference, dev, fine-tuning <1B params
L40S	48 GB	362	$0.8–1.5/hr	Fine-tuning, batch-large inference
A100 40GB	40 GB	312	$0.6–1.5/hr	Standard for JEPA ViT-L fine-tuning
A100 80GB	80 GB	312	$1.5–2.5/hr	Pre-training, 1–3B parameter models
H100 80GB	80 GB	989 (BF16) / 1979 (FP8)	$2.0–3.5/hr	Serious training, large models, large batches
H200	141 GB	989	$3.5–6.0/hr	Very large models without FSDP
B200 (Blackwell)	192 GB	~2250 (with sparsity)	$5–10/hr	Foundation model pre-training

The A100 → H100 step usually pays for itself for pure training: H100 costs roughly 1.5–2× more per hour but trains roughly 3× faster in BF16 and 6× faster in FP8 (relative to A100 BF16). Payback is fast on large jobs. For inference, A100 and L40S remain competitive.

Three concrete scenarios, with costs

Scenario A: V-JEPA pre-training from scratch (industrial)

A serious replication of V-JEPA at the scale of the original paper.

Assumption: the V-JEPA paper does not disclose exact GPU count or wall-clock training time. Numbers below are an order-of-magnitude estimate based on typical FAIR-cluster training for ViT-L on ~2M videos.

Model: ViT-L (~300M params).

Dataset: ~2M videos from public corpora, 16-frame clips.

Hardware: ~128 A100 80GB (estimate).

Time: ~14 days for full training (estimate).

Cost: 128 · $2/hr · 336 hr ≈ $86,000.

V-JEPA 2 [9], trained on over 1M hours of internet video, sits well above this. Replicating V-JEPA 2 at full scale is realistically a 7-figure compute budget. For most adopters, starting from Meta's public checkpoints is the only sensible choice.

Scenario B: V-JEPA 2 fine-tuning on a proprietary task (typical)

The real use case for 95% of adopters.

Model: V-JEPA 2 ViT-L pretrained + classification head.

Dataset: 200k labeled videos (~2M windowed clips, BADAS scale).

Hardware: 8 A100 80GB, global batch 256.

Time: 5–7 days for 20–30 epochs.

Cost: 8 · $2/hr · 144 hr ≈ $2,300.

Cost ratio against full pre-training: ~40×. This is why foundation model + fine-tuning is the dominant paradigm.

Scenario C: DreamerV3 training on an RL task

Typical for academic robotics or industrial R&D.

Model: DreamerV3 medium variant, ~25M parameters total (WM + actor + critic).

Dataset: collected online via environment interaction, ~1–10M env steps.

Hardware: 1 A100 40GB (sufficient; the bottleneck is environment throughput).

Time: 3–10 days depending on environment parallelisation.

Cost: $150–700.

DreamerV3 is almost always environment-bound, not GPU-bound. Adding GPUs does not help if the environment cannot be parallelised. The DreamerV3 paper reports that all Dreamer agents are trained on a single A100 GPU each, including the Minecraft diamond experiment.

Scenario D: production inference (always)

After training, the model has to run somewhere. The numbers change significantly.

BADAS 2.0 (300M): 34 ms on A100, 41 ms on Jetson Thor (automotive edge).

BADAS Flash Lite (22M): 2.8 ms on A100, 5.9 ms on Jetson Thor.

DreamerV3 inference: <10 ms on a consumer GPU for the policy alone.

Cosmos-Reason2-2B inference: ~500 ms to several seconds per clip, depending on reasoning length, because of autoregression.

For edge deployment, model size is critical. Here JEPA has a structural advantage: 22M parameters can run on a Jetson; 2B parameters cannot.

Data efficiency: JEPA vs World Model

Data efficiency is a different metric from compute efficiency, and the architectures separate interestingly here.

I-JEPA [7] reaches competitive ImageNet linear probing with a ViT-H/14 trained for 300 epochs in roughly 1,200 GPU-hours. MAE (Masked Autoencoder, generative reconstruction) needs about 10× more compute for similar linear probing on the same architecture. The I-JEPA paper reports being "over 10× more efficient than a ViT-H/14 pretrained with MAE". On paper: I-JEPA is significantly more compute- and sample-efficient because there is no decoder to train and the loss is in latent space.

V-JEPA 2 was pre-trained on more than 1M hours of internet video on the order of tens of millions of clips. Fine-tuning on the BADAS task requires 200k labelled videos. For comparison, Sora (diffusion video) is estimated to have been trained on tens of millions of hours of video with resources estimated at hundreds of millions of dollars.

DreamerV3 has a particular data-efficiency profile: it shines with little data because most of the learning happens in the world model's dreams. On standard pixel-based control benchmarks like the DMC Suite (DeepMind Control), it reaches strong performance within 1M environment steps; model-free baselines like PPO typically need an order of magnitude more interactions to reach comparable scores. On Atari with a 100k-step budget, DreamerV3 is among the most sample-efficient agents reported. The world model acts as a data amplifier.

A useful rule of thumb

Self-supervised pre-training (JEPA, MAE): millions of unlabeled samples. More data is better, with saturation around 10M–100M samples.

Supervised fine-tuning: 10k–500k labeled samples. The performance-vs-data curve saturates much earlier.

Reinforcement learning (Dreamer): 100k–10M env steps depending on task complexity. The world model is the efficiency multiplier.

Practical rule: the foundation model should be pretrained on as much web data as possible; you should only fine-tune on your domain-specific dataset. It is the winning paradigm in cost/performance terms.

What this means: for builders, investors, decision-makers

Different readers leave this article with different obligations. The technical bottom line is the same; the practical move is not.

If you build

For perceptual tasks that have to run safely and at the edge — collision anticipation, anomaly detection, visual quality control, surgical assistance, drone perception — JEPA is now the default starting point. Take a public V-JEPA 2 checkpoint, fine-tune on your domain data, distil to a small variant for deployment. Resist the temptation to use a 2B+ vision-language model as the perception layer just because it is the easiest API call: the BADAS-vs-Cosmos comparison shows the structural penalty. Reserve World Models for cases where you genuinely need to roll out actions in imagination — robotics control, RL with expensive simulators, agentic policies that have to plan multi-step interventions. For everything in between, the hybrid V-JEPA 2-AC and similar action-conditioned JEPAs are the architecture to watch through 2026.

If you invest

The compute moat for foundation-model-quality perception is concentrating in two places: pre-training on >1M hours of video (Meta, NVIDIA, a handful of others) and high-quality labelled fine-tuning data with privileged access (Nexar's 350k dashcam fleet, Tesla's vehicle fleet, certain medical imaging consortia, the autonomous logistics players). The middle layer — the people who fine-tune and deploy on existing checkpoints — is where margins are healthier and where most defensible companies will be built. Be careful of pitches that promise to "train a foundation model from scratch" outside the very specific niches where that economic argument holds.

If you decide budgets

Three numbers to internalise from this article. Pre-training a serious JEPA from scratch: about $86k of compute, plus team and data costs that dwarf that. Fine-tuning an existing checkpoint to a real task: about $2.3k of compute, days of work. Running an RL world model for robotics on a single GPU: $150–700. The implication is that the right question for most organisations is not "should we train a foundation model?" but "do we have the labelled data and the engineering bandwidth to fine-tune an existing one well, and the inference budget to deploy it at our scale?" The bottleneck is rarely compute and almost always data and integration.

Conclusion

World Models and JEPA are two different answers to the same problem: how to teach machines to predict the future by self-supervision. World Models do it by generating pixels, constrained by a decoder and a reconstruction loss. JEPA does it by staying in latent space, managing the representation collapse problem with carefully calibrated architectural asymmetry.

Pixel reconstruction is more intuitive and produces interpretable output, but it wastes representational capacity on irrelevant details. Latent prediction is more efficient and captures semantic structure, but its output is not directly interpretable and it requires the full anti-collapse arsenal we discussed.

Neither is "better" in absolute terms. They are different tools for different problems. The choice depends on what you need: a simulator to plan with (World Model), an encoder to perceive with (JEPA), or both combined in the hybrid architectures emerging right now. V-JEPA 2-AC and similar action-conditioned variants are the first concrete steps in this direction.

What unites both is the underlying intuition that has driven the field for years: the ability to predict the future is central to intelligence. A system that can only react to the present is structurally limited. The next generation of AI, which will operate in the physical world and have to reason about the consequences of its own actions, will need to imagine. The architectures we are building today are the first serious attempts to give it that ability.

References

Ha, D., & Schmidhuber, J. (2018). World Models. arXiv:1803.10122. arxiv.org/abs/1803.10122
Hafner, D., Lillicrap, T., Ba, J., & Norouzi, M. (2020). Dream to Control: Learning Behaviors by Latent Imagination. ICLR 2020. arXiv:1912.01603
Hafner, D., Lillicrap, T., Norouzi, M., & Ba, J. (2021). Mastering Atari with Discrete World Models. ICLR 2021. arXiv:2010.02193
Hafner, D., Pasukonis, J., Ba, J., & Lillicrap, T. (2023). Mastering Diverse Domains through World Models. arXiv:2301.04104
Wu, P., Escontrela, A., Hafner, D., Goldberg, K., & Abbeel, P. (2022). DayDreamer: World Models for Physical Robot Learning. CoRL 2022. arXiv:2206.14176
LeCun, Y. (2022). A Path Towards Autonomous Machine Intelligence. OpenReview preprint. — the position paper that introduced the JEPA framework.
Assran, M., Duval, Q., Misra, I., Bojanowski, P., Vincent, P., Rabbat, M., LeCun, Y., & Ballas, N. (2023). Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture. CVPR 2023. arXiv:2301.08243 — the I-JEPA paper.
Bardes, A., Garrido, Q., Ponce, J., Chen, X., Rabbat, M., LeCun, Y., Assran, M., & Ballas, N. (2024). Revisiting Feature Prediction for Learning Visual Representations from Video. TMLR 2024. arXiv:2404.08471 — the V-JEPA paper.
Assran, M., Bardes, A., Fan, D., Garrido, Q., et al. (2025). V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning. arXiv:2506.09985 — the V-JEPA 2 paper used as backbone in BADAS 2.0.
Grill, J.-B., Strub, F., Altché, F., et al. (2020). Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning. NeurIPS 2020. arXiv:2006.07733 — first non-contrastive method that avoids collapse without negative pairs.
Chen, X., & He, K. (2021). Exploring Simple Siamese Representation Learning (SimSiam). CVPR 2021. arXiv:2011.10566 — BYOL variant without EMA, just stop-gradient + predictor.
Tian, Y., Chen, X., & Ganguli, S. (2021). Understanding Self-Supervised Learning Dynamics without Contrastive Pairs. ICML 2021. arXiv:2102.06810 — theoretical analysis of the gradient flow for BYOL/SimSiam.
Jing, L., Vincent, P., LeCun, Y., & Tian, Y. (2022). Understanding Dimensional Collapse in Contrastive Self-supervised Learning. ICLR 2022. arXiv:2110.09348 — formal characterisation of dimensional collapse.
Bardes, A., Ponce, J., & LeCun, Y. (2022). VICReg: Variance-Invariance-Covariance Regularization for Self-Supervised Learning. ICLR 2022. arXiv:2105.04906
Zbontar, J., Jing, L., Misra, I., LeCun, Y., & Deny, S. (2021). Barlow Twins: Self-Supervised Learning via Redundancy Reduction. ICML 2021. arXiv:2103.03230
NVIDIA (Agarwal, N., et al.) (2025). Cosmos World Foundation Model Platform for Physical AI. arXiv:2501.03575 — the Cosmos paper underlying Cosmos-Predict and the Cosmos-Reason model family used as baseline by Nexar.
Goldshmidt, R., Scott, H., Niccolini, L., & Matzner, H. (2026). Beyond the Beep: Scalable Collision Anticipation and Real-Time Explainability with BADAS-2.0. arXiv:2604.05767 — the technical paper describing the BADAS 2.0 system and benchmarks.
Nexar AI (2026). BADAS 2.0: A V-JEPA 2 World Model for Collision Anticipation. badas.nexar.app — public benchmarks and controlled comparison against Cosmos-Reason2-2B.

Luigi Simeone is a Technology Executive, Independent Researcher, and Fractional CTO based in Turin, Italy. He holds a PhD in Signal Processing & Nonlinear Dynamics from the University of Southampton and is currently looking for the right problem to solve — preferably one that doesn't fit in a single box.