Xpeng AI Dr Liu interview

Xpeng’s head of autonomous driving told Electrek that the company is spending roughly 300 million RMB (~$41 million) per month on AI training alone and believes it has already reached parity with Tesla’s FSD v13 — with v14 within reach before the end of summer.

I sat down with Dr. Xianming Liu, head of Xpeng’s General Intelligence Center, the day after he delivered a keynote at CVPR 2026 in Denver — sharing the stage with Tesla’s Ashok Elluswamy and leaders from Nvidia and Waymo at one of the world’s top computer vision conferences.

The conversation covered Xpeng’s VLA 2.0 architecture, the company’s sensor strategy, its Volkswagen licensing deal, and why Dr. Liu believes the entire autonomous driving industry needs to stop treating language models as the answer to self-driving.

Here’s the full interview:

Xpeng’s head of AI Xianming Liu on catching up to Tesla FSD

’Language is poison’

Dr. Liu has become known for a provocative statement: “language is poison” when it comes to autonomous driving. In our interview, he explained the nuance behind that headline.

Xpeng’s first-generation VLA (Vision-Language-Action) model used language tokens as an intermediate step — the system would see the road, translate what it saw into language-like representations, then convert those into driving actions. VLA 2.0, which I tested in Beijing in April and found comparable to Tesla’s FSD v14, removes that middle step entirely.

But Liu clarified that Xpeng hasn’t abandoned language completely. The system still accepts language as input — text prompts and instructions from the driver. What it removes is language as an intermediate output during the actual driving process.

“We still utilize languages as input, so this is a key to improve the generalizability,” Liu said. “You talk to the vehicle and you give some instructions. The vehicle needs to understand how to execute these instructions. But during the driving, we don’t output any language tokens because this is a redundancy or a bottleneck of the model.”

The reasoning is straightforward: the car ingests roughly two billion visual tokens per second from its cameras but only needs maybe 10 or 20 tokens to control the steering wheel and pedals. That’s a massive dimensionality reduction, and adding a language translation step in between just introduces unnecessary computation and latency.

“In order to get a language expression, you need to generate a lot of unnecessary computation to explain it. So that’s why we remove the language as intermediate layers, but we still keep the language as input,” he explained.

World Model: The Next Piece of the Puzzle

Liu used his CVPR keynote to unveil something new — Xpeng’s world model, which he frames not as a separate technology from VLA but as “the other side of the same problem.”

VLA 2.0 learns from human driving behavior — it studies how millions of drivers react in real situations and learns to replicate those decisions. The world model learns the physics of the environment — it predicts what will happen next in the scene, how other agents will move, what the consequences of any given action will be.

“People trying to separate the two concepts, world model and VLA, as two dimensions of the technology, but actually they are the same,” Liu said. “Our goal is trying to build a foundation model powerful enough to understand the world.”

The practical application: Xpeng is now training VLA 2.0 to simultaneously predict what the cameras will see in the near future and what the car should do — combining driving and world prediction in one model. The company expects to deploy this upgrade to production vehicles later this year.

Xpeng has published a series of research papers backing this work, including X-World for controllable video generation, X-Foresight for joint future prediction and planning, and X-Cache which cuts world model computation by 70% with negligible quality loss. The company also had its “DrivePTS” paper on driving scene generation accepted at CVPR 2026.

Radar Stays for Active Safety, Not for Driving

One detail that often gets lost in Xpeng’s “pure vision” marketing: the P7+, G7, and other recent models still carry three millimeter-wave radars and twelve ultrasonic sensors alongside their cameras. I asked Liu how those fit into an end-to-end architecture.

His answer was clear: they don’t feed into the main driving AI at all.

“We do utilize these other sensors, but they are utilized for the active safety system, which requires an orthogonal system to be totally redundant with the main driving system,” Liu said. The radar and ultrasonics power AEB (automatic emergency braking) and AES (automatic emergency steering) — a completely separate safety layer.

The main driving system is vision-only. Liu’s reasoning comes down to information density and latency: “Camera readout time is a couple milliseconds, pretty quick, and the frequency could be very high. From the information density perspective, camera is one of the best sensors. If you’re using LiDAR, radars, the processing time is pretty long, usually tens or hundreds of milliseconds.”

This puts Xpeng in an interesting middle ground. Tesla famously removed radar and ultrasonic sensors entirely from its vehicles, relying solely on cameras for everything including active safety. Waymo goes the opposite direction with a full LiDAR suite. Xpeng uses cameras exclusively for the driving brain but keeps radar as a separate safety net.

When I asked if the vision system could eventually get so good that the redundant safety layer becomes unnecessary, Liu was blunt: “We hope so, but to be honest, it’s not possible. We all make mistakes. System also makes mistakes. Even you can reach 99.9999%, you still have the chance to make mistakes. Adding another layer of redundancy definitely will help.”

He added: “You’re not talking about chatting with ChatGPT and making a mistake — you just say, ‘hey, this is so silly, redo.’ You’re talking about lives.”

300 Million RMB a Month on AI Training

I asked Liu about the scale of Xpeng’s investment in autonomous driving. His answer was startling for a company that delivered roughly 200,000 vehicles last year.

“There’s a lot of jokes on the internet saying I’m asking a lot of budget from the boss,” Liu said. “He said something like 300 million a month RMB for me to train the model. That’s roughly the truth. I do spend a lot of money.”

That’s roughly $41 million USD per month, or close to $500 million a year on AI model training alone — a significant figure for a company with RMB 47.66 billion ($6.5 billion) in cash at the end of 2025. Liu acknowledged this is unusual for a car company: “As a car company, you can never imagine such a big R&D investment because you can never make the money back. But our company is determined to be a Physical AI company.”

Xpeng disclosed at CVPR that its training infrastructure has achieved a 4,360% gain in single-job training efficiency over the past 12 months, with GPU utilization climbing from 40% to 90%. VLA 2.0 uses billions of parameters and consumes over four trillion tokens per model iteration.

Tesla Comparison: ‘Same Philosophy, Different Data’

Liu was diplomatic but specific when comparing Xpeng’s approach to Tesla’s FSD.

“I think we share similar underlying philosophy and principles, which is scaling up,” he said. “No matter Tesla or Xpeng or other companies working on this trajectory are working on the same thing — just following the scaling law, make sure you have a system which is data-driven and you can feed unlimited data into the model.”

The key difference, according to Liu, is data diversity. Chinese roads are significantly more chaotic than American ones — a point I experienced firsthand during my 40-minute VLA 2.0 test drive in Beijing, where I encountered more edge cases than I’d see in weeks of driving in North America.

“In China you have a larger chance to get corner cases and data. This is one advantage,” Liu said. He argued this could make it easier for Xpeng to expand internationally than for Tesla to bring FSD to China — not easier per se, but “you have more chances because you have more diverse data.”

The Golden Gate Bridge Bet

Xpeng CEO He Xiaopeng made a public bet with Liu last year: if VLA 2.0 doesn’t match Tesla FSD performance by August 30, 2026, Liu has to streak naked across the Golden Gate Bridge.

Liu says he’s safe. “I’m pretty confident I don’t need to run,” he told me. “The condition is reaching parity with Tesla FSD in the beginning of this year. Based on test drives, we already hit the goal.”

He claims Xpeng went from FSD v12 parity to “almost v14 or even better than v13” in just a few months, crediting the team’s rapid iteration speed. The August deadline still stands, but Liu seems relaxed about it.

Google’s Pixel Analogy and the Volkswagen Deal

Perhaps the most revealing moment came when Liu described Xpeng’s identity. He compared the company to Google making Pixel phones — hardware exists primarily to demonstrate and collect data for the software.

“Producing cars or manufacturing cars definitely is one of the main reasons we are working now,” he said. “We need physical devices in the real world to make sure we get feedback, we get data. Just like Google is producing Pixel devices just trying to demonstrate, ‘OK, this is what Android can do.’ But on the other hand, we want to make sure we are an AI company.”

This framing puts the Volkswagen VLA 2.0 licensing deal in context. Volkswagen became the first external customer for Xpeng’s autonomous driving technology, validating the company’s push to position itself as an AI platform provider rather than purely a vehicle manufacturer.