Learn AI

Vision-language-action models have reshaped autonomous driving to incorporate languages into the decision-making process. However, most existing pipelines only utilize the language modality for scene descriptions or reasoning and lack the flexibility to follow diverse user instructions for personalized driving. To address this, we first construct a large-scale driving dataset (InstructScene) containing around 100,000 scenes annotated with diverse driving instructions with the corresponding trajectories. We then propose a unified Vision-Language-World-Action model, Vega, for instruction-based g

arXivcs.ROcs.AI1 day ago

Drive My Way: Preference Alignment of Vision-Language-Action Model for Personalized Driving

Human driving behavior is inherently personal, which is shaped by long-term habits and influenced by short-term intentions. Individuals differ in how they accelerate, brake, merge, yield, and overtake across diverse situations. However, existing end-to-end autonomous driving systems either optimize for generic objectives or rely on fixed driving modes, lacking the ability to adapt to individual preferences or interpret natural language intent. To address this gap, we propose Drive My Way (DMW), a personalized Vision-Language-Action (VLA) driving framework that aligns with users' long-term driv

arXivcs.AIcs.CL1 day ago

Training the Knowledge Base through Evidence Distillation and Write-Back Enrichment

The knowledge base in a retrieval-augmented generation (RAG) system is typically assembled once and never revised, even though the facts a query requires are often fragmented across documents and buried in irrelevant content. We argue that the knowledge base should be treated as a trainable component and propose WriteBack-RAG, a framework that uses labeled examples to identify where retrieval succeeds, isolate the relevant documents, and distill them into compact knowledge units that are indexed alongside the original corpus. Because the method modifies only the corpus, it can be applied once

arXivcs.CVcs.AI1 day ago

PackForcing: Short Video Training Suffices for Long Video Sampling and Long Context Inference

Autoregressive video diffusion models have demonstrated remarkable progress, yet they remain bottlenecked by intractable linear KV-cache growth, temporal repetition, and compounding errors during long-video generation. To address these challenges, we present PackForcing, a unified framework that efficiently manages the generation history through a novel three-partition KV-cache strategy. Specifically, we categorize the historical context into three distinct types: (1) Sink tokens, which preserve early anchor frames at full resolution to maintain global semantics; (2) Mid tokens, which achieve

arXivcs.CVcs.AI1 day ago

PixelSmile: Toward Fine-Grained Facial Expression Editing

Fine-grained facial expression editing has long been limited by intrinsic semantic overlap. To address this, we construct the Flex Facial Expression (FFE) dataset with continuous affective annotations and establish FFE-Bench to evaluate structural confusion, editing accuracy, linear controllability, and the trade-off between expression editing and identity preservation. We propose PixelSmile, a diffusion framework that disentangles expression semantics via fully symmetric joint training. PixelSmile combines intensity supervision with contrastive learning to produce stronger and more distinguis

arXivcs.AIcs.MM1 day ago

Back to Basics: Revisiting ASR in the Age of Voice Agents

Automatic speech recognition (ASR) systems have achieved near-human accuracy on curated benchmarks, yet still fail in real-world voice agents under conditions that current evaluations do not systematically cover. Without diagnostic tools that isolate specific failure factors, practitioners cannot anticipate which conditions, in which languages, will cause what degree of degradation. We introduce WildASR, a multilingual (four-language) diagnostic benchmark sourced entirely from real human speech that factorizes ASR robustness along three axes: environmental degradation, demographic shift, and l

From the Labs

OpenAI3 days ago

Inside our approach to the Model Spec

Learn how OpenAI’s Model Spec serves as a public framework for model behavior, balancing safety, user freedom, and accountability as AI systems advance.

HuggingFace8 days ago

Build a Domain-Specific Embedding Model in Under a Day

Hacker Newsabout 3 hours ago

CERN uses tiny AI models burned into silicon for real-time LHC data filtering

OpenAIabout 13 hours ago

STADLER reshapes knowledge work at a 230-year-old company

Learn how STADLER uses ChatGPT to transform knowledge work, saving time and accelerating productivity across 650 employees.

Hacker Newsabout 16 hours ago

Namespace: We've raised $23M to build the compute layer for code

Hacker Newsabout 18 hours ago

Show HN: Open-Source Animal Crossing–Style UI for Claude Code Agents

We posted here on Monday and got some great feedback. We’ve implemented a few of the most requested updates:- iMessage channel support (agents can text people and you can text agents) Other channels are simple to extend. - A built-in browser (agents can navigate and interact with websites) - Scheduling (run tasks on a timer / cron/ in the future) - Built in tunneling so that the agents can share local stuff with you over the internet - More robust MCP and Skills support so anyone can e