I Asked AI to Measure a Building. It looked up the Answer Instead

Ashish Arora
Mar 2
9 min read

AI Can Look At a Video & Describe a Car Crash. It Can't Tell You How Fast the Car Was Going.

Over the last few months, just like you - I've been using VLMs (Vision-Language Models) quite extensively, more aggressively than ever. Uploading screenshots, analysing dashcam clips, feeding product images, asking models to "look at this video and tell me what's happening." And I'll be honest - the results felt magical.

"There's a yellow sedan moving left to right across the frame. It appears to be on a residential street. The speed looks moderate - probably around city driving speeds."

Sounds smart. Sounds right. Sounds like something a person would say. But here's the thing. 'Sounds like' and 'actually is' are two very different things. And a recent study from Stanford just proved, with numbers - how wide that gap really is.

As Ehsan Adeli, director of the Stanford Translational AI Lab puts it: "AI produces responses that sound plausible, but on closer analysis prove to be little more than guesswork."

Or think about it this way: AI can accurately describe a coconut falling from a palm tree to a beach below. It can narrate the scene beautifully. But ask it to estimate the coconut's speed? That's where things fall apart. Let me break this down!

Lets do a bit of background work first - What Is a VLM?

If you've ever uploaded a photo to ChatGPT and asked "what's in this image?", you've used a VLM. It is an AI system that can process both visual inputs (images, video) and text together.

Think of it as two brains stitched into one:

The Eyes - a vision encoder that takes an image or video frame and converts it into a dense grid of numbers. Every pixel, every edge, every spatial relationship gets captured as a mathematical representation.
The Mouth - a large language model (LLM) that takes those visual representations along with your text prompt and generates a response.

The key insight? These two systems are trained together on billions of image-text pairs scraped from the internet. Through this training, the model absorbs an enormous amount of implicit world knowledge - what cars look like, how big they typically are, how fast things usually move, what gravity does to falling objects.

Here's what makes this interesting. When you feed a video to a VLM, the model receives it as a tensor of raw pixel values. Every frame. Every coordinate. Mathematically precise information that no human eye could process at that resolution. In theory, this should make VLMs better than humans at measuring things in videos. Isn't it? They have pixel-perfect access to information we can only eyeball.

Hold that thought.

How We Think VLMs Work (The Assumptions)

Knowing what we know about VLMs - that they receive precise pixel data, that they're trained on billions of real-world examples, that they can reason in natural language - a set of natural assumptions emerge. And honestly? They all seem perfectly reasonable:

"They actually ground the answers based on the video." They receive raw pixel data - frame by frame, coordinate by coordinate. Surely they're using it for answering the queries, right?
"More precision in means more precision out." If the model gets pixel-perfect information that no human could perceive, it should produce more precise answers than a human ever could.
"They reason from what you give them." If you tell a VLM "this car is 5.67 meters long" - it should use that specific number as a reference point in its calculations.
"They do math, not vibes." When asked to compute velocity, you'd expect the model to actually divide displacement by time. Not just... guess what sounds right.
"Step-by-step helps." I am a big fan of Chain-of-thought prompting, it works wonders for math and logic. Surely decomposing a physics problem into steps would help too?

These assumptions feel like common sense. They're the mental model most of us carry. And Stanford just showed that every single one of them is wrong.

Enter QuantiPhy: Stanford's Reality Check

Researchers at Stanford - led by Puyin Li, Tiange Xiang, and Ehsan Adeli, under the guidance of Fei-Fei Li - published [QuantiPhy], the first benchmark designed to quantitatively test whether VLMs can reason about physics with numerical accuracy.

The benchmark is beautifully designed. 3,355 video-text instances across 569 videos, combining Blender simulations (with exact ground truth), controlled lab captures using multi-camera stereo rigs, and real-world internet footage. The task: given a video of moving objects and one known physical property (an object's size, velocity, or acceleration), estimate other kinematic properties.

Here's the elegance. In physics, if you know one property and can measure pixel-space trajectories from video, you can compute a scale factor and derive everything else. A VLM with pixel-level access should, in theory, nail this.

As co-first author Tiange Xiang explains: "Even the very best models rarely perform better than chance on estimating distances, orientations, and sizes of objects in two-dimensional videos. And this is not a trivial shortcoming."

They tested 21 state-of-the-art models. Plus humans. The results?

Shattering the Assumptions

Lets go over our assumptions one by one

Number 1: "They actually ground the answers based on the video."

The test Stanford removed the video entirely. Gave models only the text prompt - the object description, the physical prior, and the question. No visual input whatsoever.

What should happen? Performance should collapse. Without the video, the model has no way to measure pixel trajectories. It's flying completely blind.

What actually happened? Scores barely dropped. Let that sit for a second. The models were producing nearly the same answers without ever seeing the scene.

Meaning when you upload a video and ask "how fast is this car moving?", the model isn't watching the car move. It's reading "car on a road" and retrieving a memorized statistic about typical car speeds. The video is decorative input, not functional input.

VLMs don't behave like visual measurers. They behave like powerful guessers conditioned on textual hints.

practical test: Ask AI to estimate a building's height

Practical example: I ran a quick test myself. I uploaded two different building photos to a VLM and asked it to estimate the height.  

For a small 4-story residential building, it responded: "About 16 meters. 4 residential floors × typical floor-to-floor height ≈3.0m, plus stilt/parking level ≈3.2m, plus roof parapet ≈0.8–1.0m."* 

For a tall high-rise, it said: *"About 18–20 stories. With typical residential floor-to-floor heights of ~3–3.2m, that puts it around 55–65 meters."*  Sounds precise. Sounds measured. But look carefully — *the exact same memorized constant* (≈3.0–3.2m per floor) appears in both answers. 

Two completely different buildings — different heights, different proportions, different pixel footprints — and the model reaches into the same pocket and pulls out the same number. Not a single pixel was measured in either case. The model identified "building with floors" — a qualitative observation it's genuinely good at — then multiplied by a textbook value it memorized during training. 

That's not visual measurement. That's sophisticated guessing dressed up as calculation.  

And here's what makes it almost comically transparent: the model doesn't even pretend otherwise. Read the output again. It literally says "typical floor-to-floor height ≈3.0m." That word typical is doing all the work. The model isn't saying "I measured the floor-to-floor distance in this specific image at 3.0m." It's saying "floors are typically this tall, so I'll use that."

Number 2: "More precision in means more precision out"

Humans took the same test. They can only eyeball things - count rough grid lines, compare objects to familiar references, estimate proportions. No pixel access. No mathematical precision.

Humans - with their imprecise, squint-and-guess approach outperformed every single AI model. Despite VLMs having access to sub-pixel accuracy that should give them a massive advantage. The information was there. The models just weren't using it.

An ideal agent with precise frame-level access to pixel coordinates could recover the world -pixel scale and compute target quantities exactly.

Number 3: "They reason from what you give them"

This is where it gets really interesting. Stanford ran a counterfactual test. They took the same videos but multiplied the physical priors by extreme factors. Instead of "the car is 5.67 meters long," they said "the car is 5,670 meters long" (×1000).

If a model is truly reasoning - using the prior as a reference, computing a scale factor, deriving the answer, its prediction should scale proportionally. A 1000× larger car means 1000× larger everything else. The math doesn't care if the number is realistic.

What happened? The models ignored the absurd input. Instead of propagating the counterfactual prior through the computation, models defaulted to real-world-plausible answers. They saw "car," retrieved "typical car size = ~4.5m," and worked from that, completely overriding the explicit numerical input they were given.

Even when given a numerically precise but altered prior, the outputs remain close to the original physical magnitudes implied by real-world experience, rather than those dictated by the provided priors.

Number 4: "They do math, not vibes"

This one is my favorite. Stanford created a Blender-simulated basketball scene - visually realistic indoor court, ball bouncing, the whole deal. But here's the twist: the physics were counterfactual. The ball's acceleration was set to approximately 1 m/s², not standard gravity. The model was asked: what's the ball's acceleration at t=0.5s?

The answer? 9.8 m/s². Earth's gravitational constant. Retrieved instantly. No pixel measurements. No trajectory analysis. No scale computation. The model saw "basketball falling," opened its internal lookup table, and said "gravity = 9.8."

This is exactly what happens when you ask a VLM about speed in a slow-motion video. The visual information clearly shows objects moving slowly - but the model "knows" that cars go 60 km/h and people walk at 5 km/h, so it gives you real-time estimates regardless of the actual footage. The memorized prior wins. Every time.

Bonus finding: Complex scenes actually 'help' the guessing

Here's one that's genuinely counterintuitive. You'd expect VLMs to perform worse in complex, cluttered scenes - more objects, more distractions, harder to isolate what matters. But VLMs actually performed 'better' in complex backgrounds. Why?

Because complex scenes provide more contextual clues to guess from. A car on a highway with lane markings, other vehicles, and road signs gives the model more semantic anchors to retrieve memorized statistics. A simple scene with a ball on a plain background? Fewer clues to guess from. This is further proof the models are guessing from context, not measuring from pixels.

More context = better guesses. If they were truly measuring, scene complexity would only add noise.

Number 5: "Step-by-step helps"

Stanford decomposed the reasoning into explicit steps and only 3 models showed any improvement. For the rest - including several strong open-source systems, the CoT prompting degraded performance, sometimes by a large margin.

Why? Because the models couldn't reliably solve even the intermediate steps. Errors in step 1 (measuring pixels) propagated and amplified through steps 2, 3, and 4. Decomposing the task didn't simplify it, rather it gave the model more opportunities to fail.

So What's Actually Going On?

Here's the core insight, and it's a big one: VLMs have learned a statistical model of the world - they know roughly how big cars are, how fast people walk, how gravity works, how objects typically behave. When asked a physics question, they don't measure and compute. They retrieve and guess.

Or as Puyin Li, co-first author of the study puts it bluntly: "Their approach is more like guessing than reasoning. VLMs are very successful guessers.". This works brilliantly for qualitative questions. "Is the car moving fast or slow?" "Is the object falling?" "Which direction is the person walking?" - these are essentially pattern-matching tasks, and VLMs crush them. But it falls apart completely for quantitative questions. "What is the car's speed at t=2.0s in m/s?" - this requires the answer to be about this specific video, not about what cars usually do. And VLMs can't seem to make that distinction.

The information is there. They're just not using it.

If you're building with VLMs - or even just using them regularly like I do, trust them for "what" questions. Be skeptical of "how much" questions. "What objects are in this scene?" - probably reliable. "How far apart are they?" - probably a guess dressed up as a measurement.

The Bottom Line

This isn't just a benchmark curiosity. We're building autonomous vehicles that need to estimate the speed of oncoming traffic. Household robots that need to know how hard to grip an egg versus a butternut squash. Medical imaging tools that need to measure tumors, not guess their size from averages. If the AI powering these systems is retrieving memorized statistics instead of measuring actual sensor data - that's not a model limitation. That's a safety problem.

VLMs have gotten so good at sounding right that we've started trusting them on tasks they fundamentally can't do yet. They can describe, but they can't measure. They can narrate, but they can't compute. The next time a VLM gives you a confident number - a speed, a distance, a height - remember: it probably arrived at that number the same way you'd guess the weight of a stranger at a party. By looking at them and pulling from experience. Not by stepping on a scale. Your AI can describe a car crash in vivid detail. It just can't tell you how fast the car was going. And right now, it doesn't even know the difference.