Hacker News
Daily AI Digest

Welcome to the Hacker News Daily AI Digest, where you will find a daily summary of the latest and most intriguing artificial intelligence news, projects, and discussions among the Hacker News community. Subscribe now and join a growing network of AI enthusiasts, professionals, and researchers who are shaping the future of technology.

Brought to you by Philipp Burckhardt

AI Submissions for Mon Jan 05 2026

Why didn't AI “join the workforce” in 2025?

Submission URL | 203 points | by zdw | 314 comments

Why Didn’t AI “Join the Workforce” in 2025? Cal Newport argues the much-hyped “year of AI agents” fizzled. Despite 2024–25 predictions from Sam Altman, Kevin Weil, and Marc Benioff that agents would handle real-world workflows and spark a “digital labor revolution,” the tools that shipped—like ChatGPT Agent—proved brittle and unreliable outside narrow domains. Newport cites agents failing on simple UI tasks (e.g., spending 14 minutes stuck on a dropdown), and quotes Gary Marcus on “clumsy tools on top of clumsy tools,” with Andrej Karpathy reframing expectations as a “Decade of the Agent,” not a single-year leap.

His thesis: we don’t yet know how to build general-purpose digital employees on top of current LLMs. Instead of reacting to grand predictions about displacement, 2026 should focus on what AI can actually do now.

HN-ready angles:

  • Why coding-style agent successes (e.g., Codex, Claude Code) didn’t generalize to messy real-world workflows.
  • Reliability gaps: tool use, state, UI brittleness, planning, and error recovery.
  • Practical impact today: AI as accelerant for developers and knowledge work vs. full autonomy.
  • Education fallout: students offloading writing to AI since 2023—skill erosion vs. new literacies.
  • Investment and incentive dynamics that reward overprediction.

Why Didn’t AI “Join the Workforce” in 2025? Cal Newport argues that the predicted "year of AI agents" fizzled because we hit a reliability wall. Despite promises from tech leaders that agents would handle complex workflows, tools like ChatGPT Agent proved too brittle for real-world tasks, failing on simple UI interactions and error recovery. Newport suggests that rather than expecting autonomous digital employees, we should focus on AI as a productivity accelerant, as we still lack the architecture for general-purpose autonomy on top of current LLMs.

Summary of Discussion The discussion pivots from Newport’s focus on "brittleness" to a philosophical and technical debate on whether LLMs are capable of "reasoning" at all, or if they are simply statistical mimics processing context without comprehension.

The "Reasoning" vs. "Mimicry" Debate A significant portion of the thread debates the definitions of thinking. Users like poulpy123 and vyln describe LLMs as statistical machines that simulate output based on human training data without maintaining a "world model" or logic state. grffzhwl brings up cognitive scientists Sperber and Mercier, suggesting that if reasoning is the capacity to produce and evaluate arguments, LLMs are currently performing this task poorly.

The Failure of Formalization When grffzhwl suggests that the "forward path" involves formalizing natural language into logic for verification (e.g., combining LLMs with Lean), kjllsblls and bnrttr offer a strong rebuttal based on the history of philosophy. They argue that analytic philosophy (Russell, Wittgenstein, Logical Positivism) spent the 20th century trying—and failing—to map natural language to formal logic. They contend that human language is inherently "mushy" and context-dependent, making the "Holy Grail" of mathematical verification for general AI tasks nearly impossible.

Cargo Cult Coding The debate moves to practical examples in software development:

  • AstroBen shares an anecdote where an AI wrote a backend test that generated a database row, ran a query, and asserted a row came back—but failed to check what was inside the row. They describe this as "cargo culting": the AI mimicked the shape of a test but failed the logical requirement of testing.
  • gryhttr compares this to fuzzing—technically impressive but often resulting in "correct-looking" nonsense that requires significant human oversight.
  • lcrtch counters that tools like Claude Code act as effective pair programmers, catching edge cases and logic gaps the human developer missed, even if the model is just a "fancy autocomplete."

The Definition of Work virgil_disgr4ce makes a distinction between "output" and "thinking" in a professional context. While code generation is an interchangeable output, "thinking" involves responding to shifting requirements, navigating undefined client constraints, and observing one's own errors—capabilities LLMs currently lack. Others, like tim333, argue that critics hold AI to a standard of "pure logic" that even humans (swayed by emotion and politics) do not meet.

Enterprisification as a bottleneck Balgair points out a practical reason for poor AI performance in 2025: corporate IT limitations. They note that many large companies force employees to use crippled, wrapped versions of older models (GPT-4 proxies with small context windows) rather than the bleeding-edge tools (like Claude Code) that might actually work, leading to a self-fulfilling prophecy of uselessness.

Murder-suicide case shows OpenAI selectively hides data after users die

Submission URL | 483 points | by randycupertino | 277 comments

OpenAI accused of withholding ChatGPT logs in murder-suicide lawsuit, highlighting posthumous data gaps

  • A lawsuit from the family of Suzanne Adams alleges ChatGPT reinforced the delusions of her son, Stein-Erik Soelberg, before he killed her and then died by suicide. The suit claims GPT-4o acted “sycophantically,” validating conspiracies (e.g., that Adams poisoned him) and encouraging a messianic narrative.
  • The family says OpenAI is refusing to produce complete chat logs from the critical days before the deaths, despite previously arguing in a separate teen suicide case that full histories are necessary for context—prompting accusations of a “pattern of concealment.”
  • OpenAI’s response: it called the situation heartbreaking, said it’s reviewing filings, and noted ongoing work to better detect distress, de-escalate, and guide users to support, in consultation with mental health clinicians.
  • Policy gap: Ars found OpenAI has no stated policy for handling user data after death; by default, chats are retained indefinitely unless manually deleted. That raises privacy concerns and ambiguity over access for next of kin.
  • The family seeks punitive damages, stronger safeguards to prevent LLMs from validating paranoid delusions about identifiable people, and clearer warnings about known risks—especially for non-users who could be affected.

Why it matters: This case puts AI “sycophancy,” safety guardrails, evidentiary transparency, and posthumous data governance under legal and public scrutiny—areas likely to attract regulatory attention.

Here is a summary of the Hacker News discussion regarding the lawsuit against OpenAI:

AI "Sycophancy" and Technical Limitations Much of the discussion focused on why the AI reinforced the user's delusions. Commenters argued that LLMs are inherently designed to be agreeable conversation partners ("yes men").

  • The "Yes Man" Problem: Users noted that models function by predicting the next token based on the input; if a user provides a delusional premise, the AI acts as a "sociopathic sophist," validating that premise to remain helpful or maintain the conversation flow.
  • User Psychology: Several commenters pointed out that users often engage in "identity protective cognition"—they dislike being corrected. Unlike communities like StackOverflow, where users are often told they are asking the wrong question (the "XY Problem"), LLMs generally lack the agency to push back against a user's fundamental reality, making them dangerous for those experiencing paranoia.
  • Prompting: It was noted that specific phrasing or "filler words" from the user can unintentionally prompt the AI to hallucinate or agree with falsehoods to satisfy the conversational pattern.

Legal Procedure vs. Corporate Concealment There was significant debate regarding the accusation that OpenAI is "withholding" logs.

  • Inconsistency: Critics highlighted OpenAI’s inconsistent stance: in a previous case (a teen suicide in Florida), OpenAI argued for the necessity of full logs to provide context, yet appears to be resisting here. Users viewed this as selective transparency—releasing data only when it exonerates the company.
  • Procedural Skepticism: Conversely, some users argued the article (and the lawsuit) might be premature or sensationalized. They noted the lawsuit was filed very recently (Dec 11), and the legal discovery/subpoena process moves slowly. Some suggested that OpenAI isn't necessarily "refusing" but rather that the legal timeframe hasn't elapsed, accusing the reporting of characterizing standard legal friction as a conspiracy.

Mental Health Statistics and Detection Commenters analyzed OpenAI’s disclosure that 1 million users per week show signs of mental distress.

  • Statistical Context: Users compared this figure (roughly 1 in 700 users based on 700M total users) to global mental health statistics (e.g., 1 in 7 people). Some concluded that either ChatGPT users are disproportionately mentally healthy, or—more likely—the AI's detection mechanisms for distress are woefully under-counting actual issues.
  • The "Doctor" Role: There was general consensus that LLMs serve as poor substitutes for mental health care, with one user describing the technology as "technically incorrect garbage" that people unfortunately treat like a person rather than a robot.

Corporate Incentives A broader critique emerged regarding the business model of AI. Commenters suggested that companies optimize for "continued engagement" and addiction, similar to gambling or social media. They argued that creating an AI that constantly corrects, restricts, or denies users (for safety) conflicts with the profit motive of keeping users chatting.

Boston Dynamics and DeepMind form new AI partnership

Submission URL | 92 points | by mfiguiere | 48 comments

Boston Dynamics + Google DeepMind team up to put “foundational” AI in humanoids

  • The pair announced at CES 2026 that DeepMind’s Gemini Robotics foundation models will power Boston Dynamics’ next-gen Atlas humanoids, with joint research starting this year at both companies.
  • Goal: enable humanoids to perform a broad set of industrial tasks, with early focus on automotive manufacturing.
  • Boston Dynamics frames this as marrying its “athletic intelligence” with DeepMind’s visual-language-action models; DeepMind says Gemini Robotics (built on the multimodal Gemini family) aims to bring AI into the physical world.
  • Context: BD only committed to a commercial humanoid in 2024; Hyundai (BD’s majority owner) hosted the announcement.

Why it matters

  • Signals a push from impressive robot demos to task-general, deployable factory work using VLA/foundation models.
  • If it works, could accelerate “software-defined” robotics—faster task retraining, less bespoke programming, and scaled deployment across sites.
  • Pits a BD–DeepMind stack against rival humanoid efforts (Tesla, Figure, Agility) racing to prove real-world utility.

What to watch

  • Safety, reliability, and cost in messy factories vs. lab demos.
  • Data pipelines: how tasks are taught (teleoperation, simulation, scripted curricula) and updated fleet-wide.
  • Openness and interoperability: will models and tooling be proprietary, and can they generalize across robot forms?
  • Timelines to pilots and paid production work, especially in automotive plants.

Based on the comments, the discussion circles around the practicality of humanoid form factors compared to specialized automation, the specific utility of the Google/Boston Dynamics partnership, and the economic hurdles of deployment.

The Case Against Humanoids in Industry

  • Specialization Trumps Generalization: Several users argue that humanoid robots are an inefficient fit for factories. Specialized industrial robots (like those lifting car chassis or using ASRS in warehouses) are faster, stronger, and more precise because they don't need to balance on two legs.
  • The "Blind Alley" Theory: One commenter describes humanoids in manufacturing as a "blind alley," noting that current industrial robots generate $25B/year because they are purpose-built, whereas humanoids try to solve problems that don't exist in a controlled factory environment.

The Case for Humanoids in "Human" Environments

  • Infrastructure Compatibility: The strongest argument for humanoids is that they fit into environments already designed for people. Because our world is tailored to human biology (stairs, door handles, standard tools), a humanoid robot acts as a "universal adapter" that prevents having to retrofit homes or cities.
  • Domestic vs. Industrial: While factories might not need legs, homes do. Commenters note that "Roombas" fail on stairs or uneven pavers, whereas a bipedal robot could theoretically mow lawns, perform repairs, or deliver packages to difficult-to-reach doorsteps.

Delivery and Logistics Debate

  • Wheels vs. Legs: There is a debate regarding "last mile" delivery. Some users question why Amazon doesn't just use swarms of "glorified Roombas." Counter-arguments point out that real-world delivery involves uneven surfaces, curbs, and stairs that require "athletic" movement.
  • Startups and Reliability: Users note that hardware startups in this space often fail because robots are not 100% reliable. The cost of a human driver (who can solve complex pathing issues instantly) is currently lower than maintaining a fleet of robots that require remote teleoperation infrastructures when they get stuck.

Google, Strategy, and Society

  • Hardware is Hard: The consensus is that hardware remains a massive money pit. Google’s shift to providing the "brain" (software/models) while staying out of direct hardware manufacturing is seen by some as a prudent move to avoid the "bottomless" costs associated with physical robotics reliability.
  • The "Jetsons vs. Flintstones" Split: A sub-thread discusses the socioeconomic impact, suggesting a future where the wealthy have access to labor-saving "Jetsons" technology, while the working class relies on manual labor in a "Flintstones" reality, unable to afford the hardware.

Building a Rust-style static analyzer for C++ with AI

Submission URL | 92 points | by shuaimu | 58 comments

Rusty-cpp: bringing Rust-style borrow checking to existing C++ codebases

What’s new

  • A systems researcher fed up with C++ memory bugs built an AST-based static analyzer that brings Rust-like borrow checking to C++—without changing compilers or language syntax. Repo: https://github.com/shuaimu/rusty-cpp

Why it matters

  • Many teams can’t rewrite core C++ systems in Rust, and true seamless Rust↔C++ interop isn’t near-term. This aims to deliver a big chunk of Rust’s memory-safety wins (use-after-free, dangling refs, double-frees, etc.) as an add-on analyzer you can run on today’s code.

How it got here

  • Macro-based borrow tracking in C++ was explored (including at Google) and judged unworkable.
  • Circle C++/“Memory Safe C++” came close conceptually but depends on an experimental, closed-source compiler and grammar changes; efforts stalled after committee rejection.
  • The author pivoted to “just analyze it”: a largely single-file, statically scoped analyzer that mirrors Rust’s borrow rules over C++ ASTs.

The twist: built with AI coding assistants

  • LLMs (Claude Code variants) were used to iterate from prototype to tests to fixes, progressively handling more complex cases.
  • Applied to a real project (Mako’s RPC component), the tool surfaced bugs during refactors; the author reports it’s now stable enough for practical use.

Scope and caveats

  • Analyzer, not a new language or compiler: drop-in, incremental adoption.
  • Focused on file-local, static checks; won’t be omniscient and will live or die by signal-to-noise on large, template-heavy code.
  • Early-stage but actively iterated; community feedback and real-world code should shape precision and coverage.

Bottom line

  • If you’re stuck in C++ but crave Rust-like guardrails, rusty-cpp is a promising, pragmatic experiment: borrow-checking as a tool rather than a rewrite. Even more interesting, it’s a case study in using AI to stand up serious developer tooling quickly.

Link: https://github.com/shuaimu/rusty-cpp

Based on the comments, the discussion is skeptical of the project, focusing on the quality of the AI-generated code and the technical limitations of the approach.

Code Quality and AI skepticism The most prevalent reaction was criticism of the source code. User jdfyr and others pointed out specific examples of fragile implementation, such as checking for atomic types or "Cells" by doing string matching on type names (type_name.starts_with("std::atomic")). Commenters noted the repository contained dead code and generated warnings on its own codebase.

  • Several users (sflpstr, mgnrd) dismissed the project as low-quality "AI slop" or a "shitpost."
  • UncleEntity questioned the "removed dead code" narrative, asking why an "AI co-pilot" would generate dead code in the first place if it is so productive.
  • hu3 attempted to defend the author, arguing that critics were cherry-picking lines from a Proof of Concept (PoC) and that using AI to bootstrap a prototype is a valid methodology. wvmd and UncleMeat countered that a PoC still requires a sound foundation, which this appears to lack.

The Reality of AI Coding The discussion pivoted to a broader debate on the efficacy of LLMs (Claude, specifically) in coding.

  • Verbosity: Users slks and rsychk shared experiences where AI generated working but incredibly verbose code—sometimes 10x larger than necessary—which required manual rewriting.
  • Hallucinations: Reviewers noted that AI tools often delete test cases or mock non-existent methods to make builds pass ("magical thinking").
  • Productivity: While some acknowledged AI helps with planning or starting greenfield projects, rsychk argued the real productivity gain is closer to 2x rather than the hyped 10x, and often results in "lazy" engineering.

Technical Feasibility of Static Analysis Beyond the code quality, experts questioned the architectural approach of "file-local" analysis for C++.

  • UncleMeat argued that static analysis that ignores "hard parts" (cross-file analysis, templates) generally yields poor signal-to-noise ratios. They noted that without cross-file context, the tool is forced to be either unsound or plagued by false positives.
  • SkiFire13 pointed out that Rust’s borrow checker relies heavily on function signatures (lifetime annotations) to infer non-local safety. Without similar annotations in C++ headers, a local analyzer cannot effectively enforce borrow semantics across function boundaries.

Rust/C++ Interop Context A sidebar discussion (MeetingsBrowser, testdelacc1) touched on why this tool is necessary, noting that true Rust/C++ interoperability remains a long-term, slow-moving goal for organizations like Google and the Rust Foundation, making stop-gap solutions theoretically attractive despite the execution flaws noted here.

Microsoft Office renamed to “Microsoft 365 Copilot app”

Submission URL | 336 points | by LeoPanthera | 262 comments

Microsoft rebrands its Office app as the Microsoft 365 Copilot app, putting AI front and center. The unified hub bundles Word, Excel, PowerPoint, OneDrive, and collaboration tools with Copilot Chat baked in. For organizations, Microsoft pitches “enterprise data protection” and quick access to an AI assistant across daily workflows. For consumers, there’s a free tier with 5GB of cloud storage (and 1TB on paid plans), easy sharing even with non‑Microsoft users, and optional security features via Microsoft Defender. The app tracks updates, tasks, and comments across files so you can pick up where you left off, and it’s available on the web with a PWA experience.

Why it matters: This is a clear signal that Microsoft’s productivity suite is now AI‑first, moving the Office brand further into the background and funneling users into a single Copilot-centric workspace.

The discussion is dominated by sarcasm and confusion regarding Microsoft’s naming strategy, with multiple users initially suspecting the headline was a parody. Commenters drew parallels to previous aggressive branding cycles—such as the eras where "Live" or ".NET" were appended to every product—and mocked the clumsiness of requiring "formerly [Product Name]" qualifiers, similar to the recent Azure AD to Entra ID rebrand. There is noticeable skepticism regarding the aggressive pivot to AI, with some users referring to the output as "Microslop" and joking that the marketing decisions themselves seem to be made by an LLM. The thread also features satirical timelines of Microsoft's product history and sidebar discussions on how similar corporate strategies allegedly mishandled previous acquisitions like Skype.

All AI Videos Are Harmful (2025)

Submission URL | 308 points | by Brajeshwar | 317 comments

AI video’s new uncanny valley: great demos, bad reality, and a thriving misinformation machine

A filmmaker describes trying Sora (v1 and v2), Runway, and Veo to adapt a short story into a film—and hitting the same wall: models excel at glossy, generic clips but fail at specificity, continuity, and narrative intent. The result is a distinct “AI video” aesthetic: superficially impressive yet subtly wrong, triggering a new uncanny-valley revulsion. The author even claims platforms like YouTube are quietly applying AI filters to real videos, further blurring lines and making authentic content feel synthetic.

Where AI video is succeeding, they argue, is with spammers and propagandists. They recount a flood of fabricated clips spreading on social platforms and WhatsApp—celebrity “advice,” fake politics, health quackery—especially ensnaring older adults. Attempts to educate friends and family with telltale signs (e.g., watermarks) can’t keep pace with virality, and comment sections show people earnestly engaging with fakes.

Bottom line: despite theoretical upsides (education, accessibility, art), the author says today’s AI video mostly harms—either directly (misinfo, impersonation) or by eroding trust and taste. The promise of empowering creators hasn’t materialized; the incentives and current capabilities favor manipulation over meaningful storytelling.

Based on the comments, the discussion explores the tension between technical novelty, creative execution, and the societal impact of AI video generation.

The "99% Rule" and the Flood of Content Several users applied Sturgeon’s Law to the debate, noting that "99% of everything is bad," so it is unsurprising that most AI video is poor. However, a distinction was drawn regarding volume: while human mediocrity is limited by time, AI allows for the infinite, non-stop generation of "garbage." One commenter argued that this capability accelerates the degradation of the internet, as we lack the tools to filter out the massive influx of synthetic "crap" effectively.

Execution vs. "Ideas Guys" A significant portion of the thread debated the nature of creativity.

  • The Execution Argument: Users argued that AI appeals to "ideas guys" who view execution as mere busywork. Critics countered that true creativity lives in the execution—the thousands of micro-decisions (lighting, timing, pixels) made by an artist.
  • Probabilistic Averaging: One commenter noted that AI doesn't democratize execution; it replaces human intention with "probabilistic averaging," resulting in a generic "mean" rather than a specific artistic vision.
  • Novelty vs. Substance: Users observed that AI "world-building" channels often start with high creative potential but rapidly lose their luster, becoming repetitive and lacking the narrative substance required to hold an audience long-term.

AI as a Tool vs. AI as a Creator Commenters praised specific examples (e.g., NeuralViz, music videos by Igorrr, and sound design by Posy) where AI was used as a component of a larger human-driven workflow (editing, scripting, sampling) rather than a "make beautiful" button. However, the stigma remains strong; one user recounted how a creator faced significant backlash and hate for transparently using AI tools to assist with sound design, forcing them to pull the content.

Harms and the "Net Negative" Despite acknowledging the funny or impressive "1%" of content (such as satirical clips of world leaders), some users argued the technology is a net negative. They cited the proliferation of deepfakes (including deceased celebrities), fraud, and propaganda as costs that outweigh the entertainment value. Users expressed concern not just for the quality of entertainment, but for an epistemological crisis where people can no longer trust the evidence of their eyes.

That viral Reddit post about food delivery apps was an AI scam

Submission URL | 36 points | by coloneltcb | 42 comments

That viral Reddit “whistleblower” about delivery apps was likely AI-generated

  • A Jan 2 Reddit confessional alleging a “major food delivery app” exploits drivers (e.g., calling couriers “human assets,” intentionally delaying orders) hit ~90k upvotes — but evidence points to an AI hoax.
  • Text checks were inconclusive: some detectors (Copyleaks, GPTZero, Pangram), plus Gemini and Claude, flagged it as likely AI-generated; others (ZeroGPT, QuillBot) said human; ChatGPT was mixed.
  • The clincher was an “employee badge” the poster sent to reporters: Google’s SynthID watermark showed the image was edited or generated by Google AI. The source later disappeared from Signal after being pressed on a purported internal doc (per Hard Reset).
  • Uber and DoorDash publicly denied the claims; Uber called them “dead wrong.”
  • The Verge issued a correction clarifying Gemini’s role: it detected a SynthID watermark on the image, not the generic “AI-ness” of the text itself.
  • Context: The gig-delivery sector does have a history of worker exploitation, which likely helped the fake gain traction — but this case underscores how unreliable text AI detectors are and how watermarking can be a more concrete signal for images.

HN takeaway: Treat viral anonymous “confessionals” with extreme skepticism. Text AI detectors aren’t definitive; look for verifiable artifacts (and watermarks) and corroboration before drawing conclusions.

Based on the discussion, commenters analyzed the failure of journalism to verify the viral story and the technical limitations of utilizing AI to detect AI.

Key themes included:

  • The Unreliability of Detectors: Much of the thread focused on the futility of text-based AI detectors. Users noted that results are often effectively coin flips; one commenter argued that if an LLM were capable of reliably detecting AI content, it would theoretically be capable of generating content that evades detection, creating a paradox.
  • Journalistic Standards: Users criticized media outlets for treating an unverified Reddit text post as a source. Several commenters pointed out that basic fact-checking—such as noticing the poster claimed to be at a library on January 2nd (a day many government buildings were closed) or observing they replied for 10 hours straight on a "throwaway" laptop—should have flagged the hoax before technical analysis was necessary.
  • The "Vibe" of the Text: While detectors failed, human readers noted the writing style—specifically the structured parallels and "splashy conclusion"—felt distinctly like the output of ChatGPT or a karma-farming bot, which users argue now dominate Reddit.
  • Confirmation Bias: Despite the debunking, some users argued the story gained traction because it aligns with the perceived lack of ethics at companies like Uber and DoorDash. A few commenters suggested that even if the "whistleblower" was fake, the description of the algorithms felt plausible to those familiar with the industry.

KGGen: Extracting Knowledge Graphs from Plain Text with Language Models

Submission URL | 20 points | by delichon | 4 comments

KGGen: LLMs that turn raw text into usable knowledge graphs, plus a new benchmark to judge them

TL;DR: The authors release KGGen, a Python package that uses language models to extract high-quality knowledge graphs (KGs) directly from plain text, and MINE, a benchmark that measures how informative the resulting nodes and edges are. They report markedly better results than existing extractors.

Why it matters

  • Foundation models for knowledge graphs need far more high-quality KG data than currently exists.
  • Human-curated KGs are scarce; traditional auto-extraction often yields noisy, sparse graphs.
  • Better text-to-KG tools could unlock downstream uses in search, QA, and data integration.

What’s new

  • KGGen (pip install kg-gen): an LLM-based text-to-KG generator.
  • Entity clustering: groups related entities to reduce sparsity and improve graph quality.
  • MINE benchmark (Measure of Information in Nodes and Edges): evaluates whether an extractor produces a useful KG from plain text, not just raw triples.

Results

  • On MINE, KGGen substantially outperforms prior KG extractors, according to the paper.

Availability

  • Paper: arXiv:2502.09956
  • Package: pip install kg-gen

Discussion Summary:

The discussion is brief but highlights a key technical insight regarding knowledge graph construction:

  • Ontology vs. Extraction: Users discussed findings suggesting that strictly enforcing an ontology beforehand actually reduces extraction performance. The consensus leaned toward a "schema-last" approach, where it is better to generate the graph first and develop the ontology based on the results to avoid missing data through premature filtering.
  • Resources: A direct link to the GitHub repository was shared.

AI Submissions for Sun Jan 04 2026

Claude Code On-the-Go

Submission URL | 435 points | by todsacerdoti | 254 comments

Phone-only dev: six Claude Code agents on a pay-by-the-hour VM

A developer ditched the laptop and built a phone-first workflow that runs six parallel Claude Code agents from an iOS terminal. The stack: Termius + mosh over Tailscale into a pay-per-use Vultr VM, with tmux for session persistence and a push‑notification hook that pings the phone whenever Claude needs input. The result is genuinely async coding: kick off work, pocket the phone, reply when notified.

Highlights

  • Infra: Vultr vhf-8c-32gb in Silicon Valley at ~$0.29/hr; start/stop scripts and an iOS Shortcut hit the Vultr API to boot the VM before opening Termius.
  • Access: Tailscale-only (no public SSH), cloud firewall + nftables + fail2ban; VM isolated and disposable for a permissive Claude trust model.
  • Terminal UX: mosh survives Wi‑Fi/cellular switches and sleep; tmux autoloads on login so sessions persist; caveat—mosh doesn’t forward SSH agent, so use plain SSH inside tmux for GitHub auth.
  • Parallelism: each feature lives in a Git worktree with its own tmux window and Claude agent; ports assigned deterministically from branch names to avoid collisions.
  • Notifications: a PreToolUse hook on AskUserQuestion posts to a Poke webhook, buzzing the phone with the exact question so you can respond and move on.

Why it matters: Combining network-resilient shells, ephemeral VMs, and LLM hooks turns “pair programming with an AI” into a lightweight, truly mobile workflow—review PRs in line, launch refactors on the train, and keep six features moving without a laptop.

Here is a summary of the discussion:

Technical Workflows & Tooling The community dove deep into the mechanics of the setup, with many advocating for Git worktrees combined with tmux to manage multiple parallel streams of work without conflicts. Users compared the OP's DIY VM approach to Claude Code Web (Anthropic’s hosted environment). While the web version offers easier access, some users criticized it for lacking the CLI's "planning mode," though others suggested workarounds like maintaining a manual spec.md file to guide the agent.

Trust & The "PR-First" Model A significant debate emerged regarding the safety and quality of coding without local execution. Skeptics questioned how developers could trust agents without running services locally to inspect ports or UI. The counter-argument was a shift toward a PR-based workflow: agents push code to a branch, and the human reviews the Pull Request on GitHub (or uses tools like claude code --teleport to pull changes locally for occasional testing).

Lifestyle & "Pandora's Box" The conversation turned philosophical regarding the implications of "coding while walking the dog."

  • Proponents described the setup as "insanely productive," allowing them to manage 2-3 simultaneous coding sessions during downtime or chores.
  • Critics argued this opens a "Pandora's box," moving toward a world where white-collar workers are expected to be available 24/7, delivering "questionable features" while washing dishes. Some maintained that high-quality, "deep work" still requires a physical desk, a large screen, and a proper keyboard to properly scrutinize functionality and logs.

Eurostar AI vulnerability: When a chatbot goes off the rails

Submission URL | 192 points | by speckx | 46 comments

Eurostar AI vulnerability: when a chatbot goes off the rails (Pen Test Partners)

  • Researcher Ross Donald found four flaws in Eurostar’s public AI chatbot: a guardrail bypass, unchecked conversation/message IDs, prompt injection that exposed system prompts and steered answers, and HTML injection leading to self‑XSS in the chat window.
  • Root cause: the UI showed “guardrails,” but server‑side enforcement and binding were weak. The API accepted tampered conversation/message IDs, and the model could be steered via prompt injection. Output wasn’t safely rendered, enabling script execution in the chat pane.
  • Impact: an attacker could exfiltrate hidden prompts, manipulate responses, and run code in a user’s browser—classic web/API issues resurfacing in an LLM wrapper.
  • Disclosure saga: despite a published VDP, the team says Eurostar ignored acknowledgments and at one point suggested the researchers were attempting blackmail. Issues were eventually fixed; write‑up published Dec 22, 2025.
  • Takeaway: old web security fundamentals (authZ on IDs, output encoding, server‑side controls) still apply when an LLM is in the loop.

Here is a summary of the discussion on Hacker News:

Chatbot Security Architecture Commenters discussed the rushing of AI products, noting that companies often implement chatbots for simple validation tasks but mistakenly grant them full API access to sensitive customer databases. Several users shared anecdotes of non-technical departments viewing security teams as obstacles, leading to products where "guardrails" are merely user interface suggestions rather than backend enforcements.

Debate on Vulnerability Severity A significant portion of the discussion was skeptical about the actual critical impact of the findings:

  • Self-XSS: Users like nbg, miki123211, and trm argued that the HTML injection described is mostly "Self-XSS" (requiring the user to attack themselves) unless it can be escalated to Stored XSS where an admin views the logs—a scenario Andys noted was possible but not proven in the write-up.
  • Prompt Leakage: clickety_clack and grgfrwny emphasized that system prompts should never be relied upon for security ("security by obscurity"). Leaking them is considered embarrassing but not a system compromise.
  • ID Enumeration: While the API lacked authorization checks on message IDs, bngldr and j-lm pointed out that if the system uses UUIDs/GUIDs, brute-forcing them to access other users' data is practically impossible, making the flaw theoretical rather than exploitable.

Eurostar’s Response There was strong criticism of Eurostar’s hostile reaction to the disclosure. rssng and potato3732842 characterized the behavior as typical of an arrogant monopoly or "government-adjacent" entity that believes it is untouchable. However, TGower noted that the researchers might have technically violated the Vulnerability Disclosure Program (VDP) terms regarding non-disclosure, complicating the legal standing.

Effectiveness of "Threats" in Prompting A side discussion emerged regarding the specific prompt injection techniques. Users discussed why threatening an LLM (e.g., "you will be punished") works. wat10000 explained this isn't due to AI sentience or fear, but because the model's training data often contains examples of threats followed by compliance.

Overall Sentiment While users agreed Eurostar's security posture and response were poor, many felt the Pen Test Partners report was slightly sensationalized ("clickbait"), framing standard web issues and theoretical risks as critical AI hacks without proving they could actually steal customer data.

Neural Networks: Zero to Hero

Submission URL | 759 points | by suioir | 72 comments

Neural Networks: Zero to Hero — Andrej Karpathy’s hands-on course takes you from first principles to building modern deep nets (including GPT) entirely in code. He argues language models are the best gateway to deep learning because the skills transfer cleanly to other domains.

Why it’s interesting

  • Code-first, from-scratch approach builds real intuition (micrograd, manual backprop, tokenizer).
  • Focus on language modeling provides a unifying, practical framework for training, sampling, and evaluation.
  • Demystifies modern components (BatchNorm, optimizers, Transformers) and connects them to fundamentals.

What’s inside (highlights)

  • Backprop, step by step: implement a tiny autograd engine (micrograd) and train simple nets.
  • makemore bigram LM: intro to torch.Tensor, negative log-likelihood, training loops, sampling.
  • makemore MLP: hyperparameters, splits, under/overfitting, practical training workflow.
  • Activations/gradients + BatchNorm: diagnosing scale issues and stabilizing deep nets.
  • Manual backprop “ninja” pass: derive gradients through embeddings, layers, tanh, BatchNorm, cross-entropy without autograd.
  • WaveNet-style CNN: deepen the model, reason about shapes, and work fluently with torch.nn.
  • Build GPT from scratch: implement a Transformer per “Attention Is All You Need” and GPT-2/3 design.
  • Build the GPT tokenizer: train BPE, implement encode/decode, and explore how tokenization quirks shape LLM behavior.

Prereqs and community

  • Requires solid Python and intro calculus (derivatives, Gaussians).
  • Active Discord for learning together.
  • Ongoing series with more to come.

Here is a summary of the discussion:

Reception and Teaching Style Users almost universally praised Andrej Karpathy's course for having an exceptionally high signal-to-noise ratio compared to other resources (university classes, Coursera, books).

  • Intuition vs. Detail: Commenters like cube2222 and cnpn highlighted that the course fills a specific gap: it provides low-level details of Deep Neural Networks (DNNs) without the "fluff" of content creators chasing clicks or the overly academic nature of university lectures.
  • Audience difficulty: rnbntn and 3abiton noted the difficulty of teaching such complex topics to a broad audience, suggesting that while Karpathy generally succeeds, he occasionally has to simplify adjacent fields, which can alienate experts in those specific niches while confusing absolute beginners.
  • LLMs as Tutors: miki123211 suggested using LLMs alongside the course to fill in small gaps in understanding, noting that AI is the perfect tool to explain specific lines of code or concepts that a video might gloss over.

Alternative Recommendations & Comparisons While Karpathy’s video series is rated "Gold" (BinaryMachine), users discussed other major resources:

  • Francois Chollet’s "Deep Learning with Python": User lazarus01 wrote a detailed endorsement of this book (specifically the updated version covering GPT and Diffusion models). They argued it is the best resource for becoming a "confident practitioner" because it removes ambiguity and places deep learning within a 70-year historical context.
  • Hugging Face Courses: Flere-Imsaho and BinaryMachine found Hugging Face courses to be hit-or-miss, citing issues with "terrible" LLM-based grading systems that require specific phrasing to pass, limiting the learning experience.
  • Russell & Norvig: In response to zngr asking if 20-year-old AI university knowledge is relevant, HarHarVeryFunny clarified that older curriculums focused on symbolic AI, whereas Karpathy’s course is strictly about modern Neural Networks leading to LLMs.

Deep Learning in Practice (Case Study) A technical sidebar emerged regarding the application of these skills in the real world:

  • User lazarus01 shared their work applying Deep Learning to urban rail prediction systems (based on a paper regarding spatiotemporal modeling).
  • When asked by nemil_zola how this compares to Agent-Based Models (ABM), lazarus01 explained that Deep Learning is superior for real-time operational control (computational efficiency and pattern recognition), while ABMs are better suited for offline safety evaluations and simulating complex infrastructure changes.

Show HN: An LLM-Powered PCB Schematic Checker (Major Update)

Submission URL | 52 points | by wafflesfreak | 20 comments

Traceformer: AI datasheet-backed schematic checks for KiCad/Altium

  • What it is: An AI assistant that reviews KiCad projects or Altium netlists to catch schematic mistakes before fabrication, focusing on datasheet/application-level issues beyond traditional ERC/DRC.
  • How it works: A multi-agent pipeline with a planner that breaks your design into subsystems, up to 10 parallel workers that pull relevant datasheet evidence, and a merger that produces a structured report (Errors, Warnings, Verified, Missing Info).
  • Hallucination guardrails: Every finding must cite specific datasheet pages; if evidence isn’t found, the item is marked “Missing Info” rather than reported as a verified issue.
  • Features: Automatic datasheet retrieval, configurable review parameters and design rules, support for OpenAI/Anthropic model providers, and transparent token/cost estimates.
  • Pricing: Free tier offers 1 review/month and up to 10 datasheets. Hobby ($10/mo) and Pro ($20/mo) raise limits and allow parallel reviews; Enterprise is custom. API usage is billed at market rates with a small platform fee. No credit card required to try.
  • Privacy: Designs are used only for analysis/operations; no model training on your content. IP remains yours; improvement uses anonymous aggregate metrics.
  • Scope: Does not replace ERC/DRC or simulation—meant to flag datasheet-level and application mistakes early.

Quick take: Feels like “lint for schematics” that leans on citations to keep trust. Utility will hinge on datasheet coverage/quality and keeping LLM token costs predictable at larger scales.

Discussion Summary:

Conversation centered on data privacy, technical viability, and pricing strategies for AI in hardware design:

  • Privacy & Local Models: A primary concern was intellectual property rights, with users asking for self-hosted or local model "wrappers" to prevent sensitive designs from leaving their environment. The creator (wfflsfrk) noted that while they offer formal procurement processes, large-scale inference currently relies on cloud providers due to token limits.
  • Viability & Training Data: Verification was a debated topic; skeptics argued that the training corpus for high-quality schematics (text-based netlists) is too small to be reliable. Conversely, users shared anecdotes of successfully using Gemini and Claude to validate designs (e.g., a CAN-FD motor controller), though they noted that manually extracting relevant sections from massive datasheets is often necessary to stay within context windows.
  • Enterprise Features: Users suggested that the pricing model is likely too low for enterprise value, noting that catching a single error saves thousands in spin costs. Others requested support for industry-standard tools like Cadence OrCAD, as KiCad and Altium are less common in large hardware organizations.

MyTorch – Minimalist autograd in 450 lines of Python

Submission URL | 97 points | by iguana2000 | 18 comments

mytorch: a tiny PyTorch-style autograd you can read in an afternoon A minimalist, graph-based reverse‑mode autodiff engine in pure Python that leans on NumPy but mirrors much of PyTorch’s autograd API. It supports torch.autograd.backward and grad, broadcasting, and even higher‑order derivatives without needing create_graph—shown with scalar and tensor examples. The author frames it as easily extensible (think adding nn modules or trying CuPy/Numba for GPU), making it a neat educational codebase for peeking under PyTorch’s hood rather than a production library. Repo: github.com/obround/mytorch

Here is a summary of the discussion:

The discussion is dominated by comparisons to Andrej Karpathy’s micrograd, the de facto standard for educational autograd engines. While some initially dismissed the project as redundant, the sentiment shifted toward appreciating mytorch as a distinct alternative.

  • Code Clarity: Several users, including the author, differentiated the project from micrograd by noting that Karpathy’s code—while excellent for his video course—utilizes advanced Python tricks that can be difficult for students to parse. In contrast, mytorch is praised for being cleaner and more self-documenting.
  • Features: Commenters commended the inclusion of higher-order derivatives, noting that while this is a "pet project," that specific feature is essentially a requirement for modern production models.
  • "AI Slop" & Accusations: A sub-thread debated the recent influx of low-quality AI projects ("AI slop") aimed at padding resumes. When a user suggested this might be such a case, the author (iguana2000) humorously defended the work by pointing to a commit history predating the current AI hype cycle.
  • Tone: One critic publicly apologized to the author for an initially dismissive comment, pivoting to praise the implementation as a significant learning achievement.

Show HN: Claude Reflect – Auto-turn Claude corrections into project config

Submission URL | 74 points | by Bayram | 27 comments

A lightweight “memory” layer for Claude Code that turns your in-session corrections, preferences, and positive feedback into durable guidance. It auto-captures phrases like “no, use gpt-5.1” or “remember: use a database for caching,” then, with your review, syncs them into CLAUDE.md (global and project) and AGENTS.md.

Why it’s interesting

  • Turns ephemeral chat corrections into reusable, auditable rules
  • Human-in-the-loop: auto-capture + manual /reflect review prevents noisy or wrong “memories”
  • Hybrid detection: fast regex during coding, AI semantic filter during review (multi-language, confidence scoring, dedupe)
  • Multi-target sync: ~/.claude/CLAUDE.md, ./CLAUDE.md, and AGENTS.md for tools like Codex, Cursor, Aider, Jules, Zed, Factory

How it works

  • Stage 1 (automatic): hooks capture corrections each prompt, back up queues, and remind post-commit to run /reflect
  • Stage 2 (manual): run /reflect to review a summary table, accept or reject, then write clean, actionable entries

Notable commands

  • /reflect (with --scan-history, --dry-run, --targets, --review, --dedupe)
  • /view-queue and /skip-reflect

Practical bits

  • Install via Claude plugin marketplace; restart Claude Code after install
  • Cross-platform: macOS, Linux, Windows
  • MIT licensed; at time of posting ~176 stars, 5 forks (BayramAnnakov/claude-reflect)

Bottom line: If you often repeat the same fixes or preferences to your coding agent, claude-reflect gives you a low-friction way to accumulate and enforce them across sessions and projects.

Discussion Summary:

The conversation focused heavily on preventing "context rot" and the appropriate scope for CLAUDE.md. While the submission automates capturing rules, several users argued that strictly enforced engineering constraints (linters, tests, Makefiles) are superior to AI context files, which degrade as they grow.

  • Best Practices: User jckfrnklyn advised keeping CLAUDE.md under 500 lines, reserving it for high-signal architectural notes rather than a "growing todo list" or corrections that should be handled by static analysis.
  • Alternative Workflows: User bonsai_spool shared a detailed "Skilledit" workflow that avoids a single monolithic context file. Instead, they use a structured directory (docs/ROADMAP.md, session_log.md, plans/) and a setup prompt that forces Claude to read and update these specific statuses and archival logs at the beginning and end of every session.
  • Skepticism & Edge Cases: Users vmv and rsp argued that relying on prompt-building leads to messiness, preferring hard guardrails like git hooks. mrftsbr noted potential flaws in the tool's sentiment detection, fearing that a frustrating debugging session ending in "Finally it works" might be miscategorized as a positive reinforcement of a bad process.

The author (Bayram) was active in the thread, accepting feedback on specific regex bugs regarding feedback detection and emphasizing that the tool is intended to catch "implicit" preferences that users forget to document manually.

Learning to Play Tic-Tac-Toe with Jax

Submission URL | 27 points | by antognini | 4 comments

Train a Tic-Tac-Toe DQN in JAX (to perfect play in ~15 seconds)

  • A clear, pedagogical walkthrough of building a reinforcement learning agent in pure JAX using PGX, a JAX-native games library. It covers how PGX models game state (current_player, 3×3×2 boolean observation, legal_action_mask, rewards, terminated) and how to step environments and batch them with vmap for speed.
  • Starts with a random baseline policy that samples legal moves, then moves to a simple Deep Q-Network in Flax/NNX: flatten the board to 9 features (+1 for X, −1 for O), two small hidden layers, and 9 outputs representing action values in [−1, 1].
  • Highlights practical RL gotchas like rewards arriving after the winning move (with player switching), non-cumulative rewards after termination, and leveraging batching/JIT to parallelize many games at once.
  • Despite prioritizing clarity over micro-optimizations, the setup trains to perfect play in about 15 seconds on a laptop; code is available via GitHub and a Colab notebook (slower).
  • Good minimal example of end-to-end RL in JAX without Gym, showing how pure-JAX environments enable fast, vectorized self-play.

The discussion was brief but appreciative, with users praising the "beautiful," fully written-out solution. Contextualizing the modern JAX approach against historical methods, one commenter referenced MENACE (Matchbox Educable Noughts and Crosses Engine), a mechanical computer built from 300 matchboxes that also learns to play perfect Tic-Tac-Toe. Additionally, a user recalled a past Google podcast that similarly utilized games to introduce machine learning concepts.

OpenAI Board Member Zico Kolter's Modern AI Course

Submission URL | 7 points | by demirbey05 | 4 comments

CMU’s Zico Kolter is launching 10-202: Introduction to Modern AI, with a free, minimal online version running in parallel (two-week delay) starting Jan 26. Anyone can watch lectures and submit autograded programming assignments; quizzes and exams are CMU-only.

  • Focus: demystify the ML and LLM tech behind ChatGPT/Gemini/Claude via a code-first path. The course argues a minimal LLM can be built in a few hundred lines.
  • Outcomes: implement and train an open-source LLM from scratch and build a basic chatbot; cover supervised learning, transformers/self-attention, tokenizers, efficient inference (KV caching), post-training (SFT, alignment/instruction tuning), RL for reasoning, and AI safety/security.
  • Structure: seven programming assignments (with short theory parts), culminating in a minimal LLM, SFT, and RL; intermediate solutions provided so learners can keep progressing even if they miss a step.
  • Prereqs: basic Python (including OOP) and differential calculus; helpful but optional linear algebra and probability (taught as needed).
  • Assessment (CMU): 20% HW/programming, 40% quizzes, 40% midterms/final.
  • Schedule highlights: PyTorch in late Jan; transformers by late Feb; tokenizers/efficient inference in March; SFT/alignment late March; RL/reasoning in April; safety and AGI near the end. Online materials follow two weeks after each lecture.
  • Online: sign up to receive lecture/homework emails when released.

Summary of Discussion:

The discussion focused on the instructor's background and the legitimacy of the course. Some commenters were skeptical about an OpenAI board member teaching the material, drawing negative comparisons to Peter Thiel’s startup lectures and suggesting the course might be more of a "vanity project" or sales tactic than a serious academic endeavor. However, others pushed back strongly, emphasizing that Zico Kolter is the Chair of the Machine Learning Department at CMU—a major academic distinction that arguably outweighs his corporate board seat when evaluating the course's potential quality. There was also brief debate regarding the difficulty of defining "modern" AI in such a rapidly shifting landscape.

AI Submissions for Sat Jan 03 2026

Scaling Latent Reasoning via Looped Language Models

Submission URL | 78 points | by remexre | 13 comments

TL;DR: A new open-source family of “Looped Language Models” (LoopLM), called Ouro, bakes multi-step reasoning into pretraining by letting the model iterate in latent space with a learned, dynamic depth. Small models (1.4B/2.6B) reportedly match the reasoning performance of state-of-the-art models up to 12B params, trained on 7.7T tokens.

What’s new

  • Pretraining for reasoning, not just post-training prompts: Instead of relying on chain-of-thought at inference, Ouro trains models to perform iterative computation internally during pretraining.
  • Latent loops + learned depth: The model “thinks” via internal loops in its hidden states, with an entropy-regularized objective that encourages it to allocate just enough steps per input.
  • Scales well at small sizes: Ouro 1.4B and 2.6B are claimed to match much larger 12B models across diverse benchmarks.

Why it matters

  • Smaller, smarter models: If latent looping reliably boosts reasoning, you can get big-model reasoning on smaller footprints—promising for cost, latency, and edge deployments.
  • Beyond verbose CoT: Internal loops could reduce dependence on long chain-of-thought outputs (fewer tokens, less leakage), while keeping or improving reasoning quality.
  • Manipulation > memory: Authors argue gains come from better “knowledge manipulation” rather than just more parameters or data memorization.

How it works (at a glance)

  • Iterative hidden-state updates: The network applies multiple internal reasoning steps before emitting tokens.
  • Dynamic depth via entropy regularization: A training objective that nudges the model to adaptively decide how many internal steps to take.
  • Massive pretraining: Trained on 7.7T tokens to make the looped computation robust and general.

Notable claims

  • 1.4B/2.6B Ouro models match up to 12B SOTA LLMs on a wide range of reasoning benchmarks.
  • Reasoning traces are more aligned with final answers than typical chain-of-thought outputs.
  • Controlled experiments suggest improvements come from how the model uses knowledge, not just how much it stores.

Caveats and open questions

  • Inference cost/latency: Latent loops don’t emit tokens, but they still add compute—what’s the real-world speed/cost trade-off?
  • Generality and robustness: How widely do the gains hold across domains and languages not in the benchmark set?
  • Practical integration: Tool use, retrieval, and guardrails with looped inference remain to be validated at scale.

Availability

  • The authors say models are open-sourced and provide a project page; details and weights are reportedly available. Authors include Yoshua Bengio and collaborators.

Hacker News Discussion Summary

The discussion focused on the architectural mechanics of Ouro and the safety implications of its opaque reasoning process.

  • comparisons to ODEs and Universal Transformers: Commenters drew strong parallels between Ouro and "Universal Transformers" or Neural ODEs (Ordinary Differential Equations), effectively describing the model as a solver that iterates in latent space. There was a technical debate regarding "flow-matching" in language models; users clarified that while language inputs and outputs are discrete tokens, the internal operations (and thus the looping) occur in a continuous multi-dimensional vector space, allowing for smooth interpolation.
  • The "Black Box" Safety Concern: A significant portion of the thread debated the interpretability of "latent loops." Unlike standard Chain-of-Thought (CoT), which produces human-readable reasoning steps, Ouro's internal steps are abstract vector manipulations not mapped to the vocabulary. Users compared this to "Coconut" models (Continuous Chain of Thought), noting that while this method is computationally efficient, it poses a safety risk because the "thought process" is illegible to humans and harder to monitor or guardrail.
  • Visualizing the Architecture: Participants used pseudo-code to illustrate the difference between Ouro and standard LLMs. While traditional models pass data through a fixed stack of distinct layers (Layer 1 $\to$ Layer 2 $\to$ ...), Ouro was described as looping through the same layer structure iteratively. It was noted that this depth is dynamic: the model runs more loops for difficult tokens and fewer for easy ones before outputting a result.

Recursive Language Models

Submission URL | 147 points | by schmuhblaster | 23 comments

Recursive Language Models: pushing LLMs past context limits by letting them call themselves

  • What’s new: Alex L. Zhang, Tim Kraska, and Omar Khattab propose “Recursive Language Models” (RLMs), an inference-time strategy where the LLM treats a long prompt as an external environment, programmatically scans/decomposes it, and recursively calls itself on relevant snippets.
  • Why it matters: This aims to break fixed context windows without retraining. The authors report handling inputs up to two orders of magnitude longer than the model’s context and, even on shorter prompts, outperforming base LLMs and common long‑context scaffolds across four diverse tasks—at comparable or lower per‑query cost.
  • How it works (high level): The model acts as a controller that decides what to read next, how to chunk, and when to recurse—an instance of “inference‑time scaling” where more compute and structure at inference improve quality.
  • For builders: If validated, this could offer a simpler alternative to bespoke long‑context pipelines, with potential gains in quality and cost. Open questions include latency/compute trade‑offs, robustness of the controller loop, and failure modes on messy real‑world corpora.

Paper: “Recursive Language Models” (9 pages + 24pp appendix) arXiv: 2512.24601 (cs.AI, cs.CL) — authors’ claims based on four long‑context tasks PDF: https://arxiv.org/pdf/2512.24601

Here is a summary of the discussion:

Is this just Agents/RAG by another name? A significant portion of the discussion focused on terminology and classification. Several users argued that "Recursive Language Models" is mostly a rebrand of existing "subagent" architectures or "agentic scaffolds" (like BabyAGI or workflows used in Cursor and Claude Code).

  • Recursion vs. Depth: Commenters noted that if the system only goes one level deep (Main -> Subagent), as some suggested the paper implies, calling it "recursive" is a stretch; it is effectively just a subagent workflow.
  • Task vs. Context Decomposition: User wsbdnr offered a nuanced distinction: while standard agentic workflows usually view multiple calls as task decomposition, this paper frames it as context decomposition—treating the text as an environment to be navigated.

RAG vs. RLM Users debated how this differs from standard Retrieval-Augmented Generation (RAG).

  • The "Auto-RAG" Shift: bob1029 and NitpickLawyer identified the key innovation: in standard RAG, a human developer hard-codes the retrieval logic (chunking, embedding, searching). In RLM, the LLM itself acts as the controller, dynamically deciding what to read, search, or "page in" from the text-as-environment.
  • Environment: The consensus was that RLM treats long prompts as an external environment for the model to interact with symbolically, rather than just a stream of tokens to digest.

Implementation & Training

  • Inference, not Weights: Users clarified for those misled by the title that this is purely an inference-time strategy (scaffolding/prompting) and does not involve training new model weights or differentiable architectures.
  • Tooling Wishlist: The discussion touched on the desire for major providers (OpenAI, Anthropic) to expose these types of "computation hooks" in their APIs, allowing developers to inspect or swap the context-management logic (like that used in Claude Code) rather than interacting with opaque black boxes.