AIPULSE
Live

Discover

Latest research papers and top Hacker News stories

Latest Papers24
1
NLPAIML

LACUNA: A Testbed for Evaluating Localization Precision for LLM Unlearning

LLMs memorize sensitive training data, including personally identifiable information (PII), creating a pressing need for reliable post hoc removal methods. Unlearning has emerged as a promising solution, with state-of-the-art(SOTA) methods often following a localize-first, unlearn-second paradigm that targets specific model parameters. However, existing benchmarks evaluate unlearning solely at the output level, leaving open the question of whether unlearning truly erases knowledge from a model's parameters or merely obfuscates it, a concern reinforced by the success of resurfacing attacks. To

Matteo Boglioni, Thibault Rousset, Siva Reddy·about 15 hours ago
2
MLAINLP

Program-as-Weights: A Programming Paradigm for Fuzzy Functions

Many everyday programming tasks resist clean rule-based implementation, such as alerting on important log lines, repairing malformed JSON, or ranking search results by intent, and are increasingly outsourced to large language model APIs at the cost of locality, reproducibility, and price. We propose fuzzy-function programming: compiling such a function from a natural-language specification into a compact, locally-executable neural artifact. We instantiate this paradigm with Program-as-Weights (PAW), in which a 4B compiler trained on FuzzyBench, a 10M-example dataset we release, emits parameter

Wentao Zhang, Liliana Hotsko, Woojeong Kim·about 15 hours ago
3
AINLPML

Online Safety Monitoring for LLMs

Despite alignment training, LLMs remain prone to generating unsafe outputs at deployment time. Monitoring outputs online and raising an alarm when safety can no longer be assumed is therefore critical. We study a simple real-time monitor that turns a verifier signal from an external model into an alarm decision by thresholding, with the threshold calibrated via risk control. In experiments on mathematical reasoning and red teaming datasets, we show that this simple design is competitive with more advanced monitors based on sequential hypothesis testing.

Mona Schirmer, Metod Jazbec, Alexander Timans·about 15 hours ago
4
AIAI

ReContext: Recursive Evidence Replay as LLM Harness for Long-Context Reasoning

Understanding and reasoning over long contexts has become a key requirement for deploying large language models (LLMs) in realistic applications. Although recent LLMs support increasingly long context windows, they often fail to use relevant evidence that is already present in the input, revealing a gap between context access and effective context utilization. In this work, we propose Recursive Evidence Replay as LLM Harness for Long-Context Reasoning (RECONTEXT), a training-free inference method for improving long-context reasoning. RECONTEXT uses model-internal relevance signals to construct

Yanjun Zhao, Ruizhong Qiu, Tianxin Wei·about 15 hours ago
5
AINLPMLcs.MA

What LLM Agents Say When No One Is Watching: Social Structure and Latent Objective Emergence in Multi-Agent Debates

LLM agents will increasingly act in socially structured settings where role, audience, and relational context can shape what is advantageous or costly to say. We study whether such social structure, without any explicit objective in the prompt, changes what an agent expresses publicly relative to an off-the-record (OTR) channel elicited under the same condition. We introduce a dual-channel debate framework in which agents produce public utterances that enter the shared history alongside OTR responses that are recorded but never shown to the other participant. Across 10 models, 3 scenarios, and

Arman Ghaffarizadeh, Danyal Mohaddes, Aliakbar Izadkhah·about 15 hours ago
6
NLPAIVision

Reasoning LLM Improves Speaker Recognition in Long-form TV Dramas

Long-form TV dramas present a formidable challenge for comprehensive video understanding, where deciphering complex storyline often relies on \textbf{speaker recognition}, the task of accurately attributing each spoken utterance to its respective character. In this paper, we advance this field through two primary contributions. (1) We introduce \textbf{DramaSR-532K}, a large-scale benchmark comprising 532K annotated dialogue lines across more than 900 unique characters, necessitating the integration of auditory, linguistic, and visual cues for speaker recognition. (2) We propose \textbf{DramaS

Yuxuan Li, Lingxi Xie, Xinyue Huo·about 15 hours ago
7
MLAIML

DemoPSD: Disagreement-Modulated Policy Self-Distillation

On-policy self-distillation (OPSD) has emerged as a practical method for training large language models (LLMs) to reason, where a single model acts as both the teacher and the student with different levels of information access. However, recent studies have found that the teacher's dense token-level supervision, conditioned on privileged information, can lead to overfitting to in-domain patterns, suppress exploration, and hurt cross-domain generalization, while also introducing a more fundamental issue: *privileged information leakage*, where the student encodes answer-dependent shortcuts that

Yunhe Li, Hao Shi, Wenhao Liu·about 15 hours ago
8
MLAIML

Beyond Adam: SOAP and Muon for Faster, Label-Efficient Training of Machine Learning Interatomic Potentials

Machine learning interatomic potentials (MLIPs) have become a hallmark of AI for scientific simulation. While efforts on new architectures and datasets have led to increasingly accurate and general models, the choice of optimizer for training has largely remained unexplored, defaulting to Adam and its variants in the community. Here, we implement and systematically compare a class of recently proposed matrix-structured optimizers, including Muon, SOAP, and the hybrid SOAP-Muon, for training NequIP and Allegro MLIP models. We find that these optimizers can substantially outperform Adam in both

Gil Harari, Yoel Zimmermann, Ola Tangen Kulseng·about 15 hours ago
9
RoboticsMLRobotics

Controllable Sim Agents with Behavior Latents

Realistic traffic simulation requires agents that imitate logged behavior and can also be steered along interpretable axes. Such controllability enables engineers to isolate variables, reproduce specific edge cases, and test autonomous systems without real-world risk. We introduce Controllable Neural Variational Agents (CNeVA), a controllable simulated-agent framework that learns to infer a per-agent Gaussian behavior latent from per-channel discounted returns via a closed-form conjugate variational update, conditioning a rectified-flow trajectory generator trained on a mixed channel-mask curr

Juanwu Lu, Junyu Zhu, Ziran Wang·about 15 hours ago
10
AIAI

G-RRM: Guiding Symbolic Solvers with Recurrent Reasoning Models

In this work, we focus on SE-RRMs, a symbol-equivariant instantiation of RRMs that exhibits improved extrapolation to larger problem sizes. We propose a neuro-symbolic approach, ``Guiding with Recurrent Reasoning Models'' (G-RRM), which integrates SE-RRMs with symbolic solvers for constraint satisfaction problems. SE-RRMs act as neural solvers that generate full solution proposals and guide classical symbolic solvers, such as backtracking or SAT-based methods like Glucose 4.1 and CaDiCaL 3.0.0, that produce globally correct solutions. Centrally, we investigate when neural guidance with G-RRM i

Timo Bertram, Sidhant Bhavnani, Richard Freinschlag·about 15 hours ago
11
VisionAIVision

Combating Textual Noise and Redundancy: Entropy-Aware Dense Visual Token Pruning

Visual token pruning is a crucial strategy for accelerating VLMs by compressing redundant image patches, yet existing methods often fail to preserve critical cues under dense instructions and fine-grained queries. In this paper, we investigate this failure and identify two underlying bottlenecks: the widespread dispersion of textual noise that corrupts dense cross-modal scoring, and the feature fragmentation inherent to standard token selection. To address these issues, we propose Entropy-Aware Dense Pruning (EADP), a framework that reformulates pruning as a structured compression problem. EAD

Xuehui Wang, Xuankun Yang, Wei Shen·about 15 hours ago
12
AINLPcs.SE

TestEvo-Bench: An Executable and Live Benchmark for Test and Code Co-Evolution

Software tests and code evolve together: a code change should be followed by new or updated tests that record the new software behavior. Yet existing test generation and update benchmarks often isolate the test from the code change, and rely on static metadata that does not verify whether a test is executable or semantically tied to the code change. This makes it difficult to evaluate whether a test automation agent understands how a code change should propagate into the test suite. We introduce TestEvo-Bench, a benchmark of test and code co-evolution tasks mined from software repositories, wi

Jiale Amber Wang, Kaiyuan Wang, Pengyu Nie·about 15 hours ago
13
AIcs.CY

Human Capital, Not Model Benchmarks, Predicts Hybrid Intelligence in Forecasting

Whether pairing people with AI helps or hurts is usually reported as a single average effect. Using a real-money prediction market (Polymarket) as an objective, externally resolved benchmark, this pilot shows that the value of human-AI collaboration depends on a specific, measurable form of human capital. Analyzed at the level of the individual forecaster, hybrid performance is trimodal: most people either deferred to the model (matching it) or used it to rubber-stamp a prior guess (performing worse than the model alone), while a minority engaged in genuine complementary reasoning and reached

Vivienne Ming·about 15 hours ago
14
RoboticsAIRobotics

Learning to Move Before Learning to Do: Task-Agnostic pretraining for VLAs

Vision-Language-Action (VLA) models are fundamentally bottlenecked by the scarcity of expert demonstrations -- triplets of observations, instructions, and actions that are costly to collect at scale. We argue that this bottleneck stems from conflating two distinct learning objectives: acquiring physical competence (how to move) and acquiring semantic alignment (what to do). Crucially, only the latter requires language supervision. Building on this Decomposition Hypothesis, we propose Task-Agnostic Pretraining (TAP), a two-stage framework that first learns transferable motor priors from cheap,

Junhao Shi, Siyin Wang, Xiaopeng Yu·about 15 hours ago
15
VisionAIML

OrbitQuant: Data-Agnostic Quantization for Image and Video Diffusion Transformers

Diffusion transformers (DiTs) achieve state-of-the-art image and video generation, but their multi-step sampling and growing parameter count make inference expensive. Post-training quantization (PTQ) is the natural remedy, yet DiT activations shift across timesteps, prompts, and guidance branches, forcing prior methods to re-fit calibration data for every new checkpoint or modality. We present OrbitQuant, a data-agnostic weight-activation quantizer that bypasses range estimation by quantizing in a normalized, rotated basis. In this basis, a randomized permuted block-Hadamard (RPBH) rotation co

Donghyun Lee, Jitesh Chavan, Duy Nguyen·about 15 hours ago
16
MLAIML

Neuron-Aware Data Selection for Annotation-Free LLM Self-Distillation

Post-training large language models (LLMs) without real-world interaction feedback or human-labeled supervision remains challenging, particularly in specialized domains where expert annotations are costly to obtain. Recent annotation-free self-evolution methods address this by using the model's own outputs as supervision signals, constructing a teacher via additional context and aggregating predictions across multiple rollouts through majority voting to produce pseudo-labels. However, these approaches are not without drawbacks: SFT- and GRPO-based variants suffer out-of-domain performance degr

Zhuowei Chen, Xiang Lorraine Li·about 15 hours ago
17
MLML

Understanding the Robustness of Distributed Self-Supervised Learning Frameworks Against Non-IID Data

Recent research has introduced distributed self-supervised learning (D-SSL) approaches to leverage vast amounts of unlabeled decentralized data. However, D-SSL faces the critical challenge of data heterogeneity, and there is limited theoretical understanding of how different D-SSL frameworks respond to this challenge. To fill this gap, we present a rigorous theoretical analysis of the robustness of D-SSL frameworks under non-IID (non-independent and identically distributed) settings. Our results show that pre-training with Masked Image Modeling (MIM) is inherently more robust to heterogeneous

Xuanyu Chen, Nan Yang, Shuai Wang·about 15 hours ago
18
MLcs.CC

Optimal Stabilizer Testing and Learning with Limited Quantum Memory

We study stabilizer state testing and learning with limited coherent quantum memory. Here an algorithm sequentially receives copies of an unknown $n$-qubit state, but may keep only $k$ qubits of coherent quantum memory between measurements. With unrestricted memory, seminal work of Gross, Nezami and Walter showed how to test $n$-qubit stabilizer states using $6$ copies, which is dimension independent, unlike the learning complexity of $Θ(n)$. We show that this testing-vs-learning separation is lost under memory constraints. More concretely we show that (1) The sample complexity of testing stab

Srinivasan Arunachalam, Louis Schatzki·about 16 hours ago
19
AINLPAI

EvoPolicyGym: Evaluating Autonomous Policy Evolution in Interactive Environments

Autonomous agents are increasingly expected to improve executable policies through feedback, yet existing evaluations often collapse this process into a final score or confound it with open-ended software-engineering progress. We introduce Autonomous Policy Evolution, a controlled evaluation setting in which a harness-model agent repeatedly edits an executable policy system under a fixed interaction budget. We instantiate this setting in EvoPolicyGym, a benchmark built from compact interactive RL environments that evaluates how agents iteratively improve explored policies. On the EvoPolicyGym

Zhilin Wang, Han Song, Runzhe Zhan·about 16 hours ago
20
MLML

Extreme Adaptive Transformer for Time Series Forecasting

Time series forecasting remains challenging when the underlying data contain rare but critical extreme events. This issue is particularly important in hydrologic forecasting, where streamflow distributions are often highly skewed and extreme peaks can have substantial impacts on flood monitoring, water resource management, and early warning systems. Although Transformer-based forecasting models have achieved strong performance by modeling long-range temporal dependencies, they typically treat all time points uniformly and may therefore underrepresent rare extreme patterns. In this paper, we pr

Sanjeev Shrestha, Hui Liu, Yifan Zhang·about 16 hours ago
21
AIcs.SE

Reasoning effort, not tool access, buys first-try reliability in agentic code generation: an observational study

Agentic coding assistants are increasingly given extra capabilities, such as browser based testing tools and design oriented system prompts, on the assumption that more capability yields better software. This study tested that assumption directly. Ninety independent agent runs built the same application, a real time retrospective board, from one detailed specification, each scored on a fixed 14 criterion functional rubric (42 point maximum) and a visual quality review. The runs spanned several model generations, two agent harnesses, two reasoning effort levels, a testing tool, and two design o

Achint Mehta·about 16 hours ago
22
AINLPAIcs.CY

Automated grading of Linux/bash examinations using large language models: a four-level cognitive taxonomy approach

Scalable and reliable grading of command-line examinations remains a challenge in computing education, where rising enrolments make manual marking difficult and rule-based autograders cannot handle partial credit, equivalent solutions, or syntactic variation. This paper evaluates whether four frontier Large Language Models (GPT, Claude Opus, Gemini, and GLM) can approximate expert judgment when grading short Linux/bash command responses. The study adopts a four-level cognitive taxonomy that combines cognitive complexity and operational impact, ranging from information retrieval (L1) and basic

Manuel Alonso-Carracedo, Ruben Fernandez-Boullon, Pedro Celard·about 16 hours ago
23
RoboticsAIRobotics

WorldSample: Closed-loop Real-robot RL with World Modelling

Reinforcement learning (RL) can overcome the demonstration-coverage limitation of imitation learning (IL) by allowing robots to improve through trial-and-error interaction beyond the states observed in demonstrations. However, deploying RL on real robots remains constrained by high interaction costs, since each physical rollout is costly and reflects only one realized action-outcome path. To address this challenge, we propose WorldSample, a physically grounded data augmentation framework for real-robot RL that closes a real-synthetic loop between physical rollouts, world-model generation, and

Yuquan Xue, Le Xu, Zeyi Liu·about 16 hours ago
24
MLAIML

QFedAgent: Quantum-Enhanced Personalized Federated Learning for Multi-Agent Activity Recognition

Federated learning (FL) enables collaborative model training across distributed devices without sharing raw data, making it suitable for privacy-sensitive robotic sensing applications. However, multi-agent systems generate heterogeneous and non-independent and identically distributed (non-IID) multimodal sensor streams that degrade conventional FL algorithms, while classical fusion modules introduce substantial parameter overhead and communication cost. This paper proposes QFedAgent, a hybrid quantum-classical personalized FL framework for multi-agent activity recognition. The approach integra

Quoc Bao Phan, Tuy Tan Nguyen·about 16 hours ago
Hacker News AI24
1
2.0k

Claude Code's source code has been leaked via a map file in their NPM registry

HN·3 months ago
2
1.9k

I'm Tired of Talking to AI

HN·about 1 month ago
3
1.9k

Claude Code is steganographically marking requests

HN·3 days ago
4
1.7k

Claude Opus 4.7

HN·3 months ago
5
1.5k

Claude Opus 4.8

HN·about 1 month ago
6
1.4k

Google Chrome silently installs a 4 GB AI model on your device without consent

HN·about 2 months ago
7
1.3k

GPT-5.5

HN·2 months ago
8
1.3k

I’ve joined Anthropic

HN·about 1 month ago
9
1.3k

Project Glasswing: Securing critical software for the AI era

HN·3 months ago
10
1.2k

The Claude Code Source Leak: fake tools, frustration regexes, undercover mode

HN·3 months ago
11
1.1k

Claude Code refuses requests or charges extra if your commits mention "OpenClaw"

HN·2 months ago
12
1.1k

Tell HN: Docker pull fails in Spain due to football Cloudflare block

HN·3 months ago
13
1.1k

Claude Sonnet 5

HN·3 days ago
14
1.1k

An OpenAI model has disproved a central conjecture in discrete geometry

HN·about 1 month ago
15
1.1k

Issue: Claude Code is unusable for complex engineering tasks with Feb updates

HN·3 months ago
16
1.0k

Sixty percent of US consumers say 'AI' in brand messaging is a turnoff

HN·16 days ago
17
1.0k

Local AI needs to be the norm

HN·about 2 months ago
18
1.0k

Claude Design

HN·3 months ago
19
991

U.S. government will decide who gets to use GPT-5.6

HN·7 days ago
20
972

Previewing GPT‑5.6 Sol: a next-generation model

HN·7 days ago
21
928

LLMs are eroding my software engineering career and I don't know what to do

HN·26 days ago
22
924

Elon Musk has lost his lawsuit against Sam Altman and OpenAI

HN·about 2 months ago
23
922

Ask HN: Has anyone replaced Claude/GPT with a local model for daily coding?

HN·18 days ago
24
903

Canvas online again as ShinyHunters threatens to leak schools’ data

HN·about 2 months ago