Hacker News
Daily AI Digest

Welcome to the Hacker News Daily AI Digest, where you will find a daily summary of the latest and most intriguing artificial intelligence news, projects, and discussions among the Hacker News community. Subscribe now and join a growing network of AI enthusiasts, professionals, and researchers who are shaping the future of technology.

Brought to you by Philipp Burckhardt

AI Submissions for Fri Dec 05 2025

Gemini 3 Pro: the frontier of vision AI

Submission URL | 506 points | by xnx | 265 comments

Gemini 3 Pro: Google’s new multimodal model pushes hard on visual and spatial reasoning

What’s new

  • Google DeepMind claims state-of-the-art results across vision-heavy tasks: document, spatial, screen, and video understanding, topping benchmarks like MMMU Pro and Video MMMU.
  • Big focus on “derendering”: turning images of messy, real-world documents into structured code (HTML/LaTeX/Markdown). Demos include reconstructing 18th‑century handwritten tables, equations from photos, and Florence Nightingale’s polar chart into an interactive graphic.
  • Document reasoning: The model navigates long reports, cross-references figures/tables, and ties numbers to causal text. It reportedly beats the human baseline on the CharXiv Reasoning benchmark (80.5%), with an example analyzing Gini index changes and policy impacts in a 62-page Census report.
  • Spatial understanding: Outputs pixel-precise coordinates to “point” in images; supports open‑vocabulary references (e.g., “point to the screw”) for robotics/AR planning and manipulation.
  • Screen understanding: Parses desktop/mobile UIs with high-precision clicking—pitched for reliable “computer use” agents, QA, onboarding, and UX analytics.
  • Video: Higher frame-rate comprehension (e.g., 10 FPS) to catch fast actions like golf swings and weight shifts.

Why it matters

  • If the claims hold, this closes gaps between perception and reasoning across messy real-world inputs—key for automation in back-office document workflows, UI agents, robotics, and sports/industry video analysis.

Caveats

  • These are vendor-reported benchmarks and demos; independent evaluations and real-world reliability (latency, cost, privacy) will be crucial.
  • Developers can try it via Google AI Studio and docs, but details on pricing, rate limits, and on-device/enterprise deployment weren’t included here.

Here is a summary of the discussion:

The "Five-Legged Dog" Stress Test The majority of the discussion focuses on a specific stress test: showing the model a picture of a dog with five legs. Users report that despite the model’s claimed visual precision, it struggles to override its training priors (that dogs have four legs).

  • Cognitive Dissonance: When asked to count legs, Gemini and other models often hallucinate explanations for the fifth limb (e.g., calling it a tail, an optical illusion, or claiming the dog is an amputee) to fit the "4-leg" model.
  • Implicit vs. Explicit: Use vndrb noted that while the model fails at counting the legs, it succeeds at editing tasks. When asked to "place sneakers on the legs," the model correctly placed five sneakers, suggesting the visual encoder sees the data, but the reasoning layer suppresses it.
  • Generative Struggles: Users noted similar failures when asking models to generate out-of-distribution concepts, such as a "13-hour clock." The models consistently revert to standard 12-hour faces or hallucinate workarounds (like adding a plaque that says "13") rather than altering the fundamental structure.

The Role of RLHF Commenters speculate that Reinforcement Learning from Human Feedback (RLHF) is the culprit. The consensus is that models are heavily penalized during training for deviating from "normal" reality. Consequently, the models prioritize statistical probability (dogs usually have four legs) over the immediate visual evidence, leading to "stubborn" behavior where the model refuses to acknowledge anomalies.

NeurIPS 2025 Best Paper Awards

Submission URL | 170 points | by ivansavz | 28 comments

NeurIPS 2025 named seven Best Paper Award winners (four Best Papers, including one from the Datasets & Benchmarks track, and three runner-ups), spanning diffusion theory, self-supervised RL, LLM attention, reasoning, online learning, neural scaling laws, and benchmarking for model diversity. Committees were drawn from across the program, dataset/benchmark tracks, and approved by general and accessibility chairs.

Two standouts highlighted in the announcement:

  • Artificial Hivemind: The Open-Ended Homogeneity of Language Models (and Beyond)

    • Releases Infinity-Chat, a large open-ended benchmark of 26K real-world prompts plus 31,250 human annotations (25 raters per example) and a first comprehensive taxonomy of open-ended LM tasks (6 categories, 17 subcategories).
    • Empirically shows an “Artificial Hivemind” effect: strong intra-model repetition and inter-model homogeneity on open-ended generation.
    • Finds miscalibration between reward models/LM judges and diverse human preferences, underscoring the tension between alignment and pluralism and the long-term risk of creativity homogenization.
  • Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free

    • Systematically studies gating in softmax attention across 30 model variants, including 15B MoE and 1.7B dense models trained on 3.5T tokens.
    • A simple tweak—adding a head-specific sigmoid gate after scaled dot-product attention—consistently boosts performance, improves training stability, tolerates larger learning rates, and scales better.
    • Points to benefits from injecting non-linearity in the attention path (and addresses attention sink issues).

Why it matters

  • Evaluation is moving beyond narrow benchmarks to plural, open-ended human preferences—raising flags about model homogenization and the cost of over-alignment.
  • Small architectural changes can still unlock meaningful gains at trillion-token scale.
  • The award slate balances societal impact with core theory and systems advances—signaling where ML research energy is heading.

Here is a summary of the top stories and the accompanying discussion:

NeurIPS 2025 Announces Best Paper Awards The NeurIPS 2025 committee has selected seven winners for the Best Paper Awards, highlighting a shift in machine learning research toward analyzing model homogeneity and refining architectural fundamentals. Two papers were specifically highlighted in the announcement:

  1. "Artificial Hivemind" releases Infinity-Chat, a benchmark of 26k prompts, revealing that LLMs exhibit strong "hivemind" behavior—repetitive internal outputs and high homogeneity across different models—suggesting a long-term risk of creativity loss due to misalignment between reward models and diverse human preferences.
  2. "Gated Attention for Large Language Models" introduces a simple architectural tweak—adding a sigmoid gate to softmax attention—which improves training stability and performance at the trillion-token scale. Overall, the awards signal a move beyond narrow benchmarks toward open-ended evaluation and demonstrate that small structural changes can still yield significant gains.

Summary of Hacker News Discussion The discussion thread focuses on the validity of benchmarking metrics, the theoretical underpinnings of reasoning, and the changing demographics of ML researchers:

  • RL and Reasoning Capacity: A significant debate centers on whether Reinforcement Learning (RL) truly improves a model's reasoning capabilities or merely limits its creativity. Users discuss the "Does Reinforcement Learning Incentivize Reasoning...?" paper, arguing over the validity of "pass@k" metrics. Skeptics argue that RL simply "sharpens" the probability distribution toward answers already present in the base model (which acts as a broader, more creative generator), while proponents argue that pass@k is a valid proxy for skill, distinguishing actual correctness from the theoretical possibilities of a "random number generator."
  • The "Hivemind" Effect: Users experimented with the "Artificial Hivemind" paper's findings by prompting models (like Gemini) to write metaphors about time. Commenters noted that while models produced varying imagery (cliffs, hammers), the underlying semantic themes usually reverted to the dominant "river/flow" cluster, validating the paper's claims about model homogeneity.
  • Physicists in ML: Commenters noticed several physicists among the award winners. This sparked a conversation about the high transferability of physics skills (linear algebra, eigenvectors, SVD) to Machine Learning, with some suggesting physicists are better equipped for the math-heavy interaction of ML than standard software engineers.
  • Consumption and "Slop": In a discussion about how best to digest these papers (reading vs. video), the tool NotebookLM was mentioned. Opinions were split: some view AI-generated audio summaries as "environmental pollution" cluttering search results, while others argued they are actually an improvement over the low-quality "slop" videos produced by humans.
  • Architecture & Superposition: There is speculation regarding "superposition" in neural networks—specifically how differentiable networks struggle to "commit" to a single concept (e.g., green vs. purple) without the "forcing function" of discretizing tokens. Other architectural papers, such as TITANS and work by SakanaAI, were recommended as complementary reading.

Jony Ive's OpenAI Device Barred From Using 'io' Name

Submission URL | 83 points | by thm | 59 comments

Jony Ive/OpenAI barred from using “io” hardware brand after Ninth Circuit upholds TRO

  • A U.S. appeals court affirmed a temporary restraining order blocking OpenAI, Jony Ive, Sam Altman, and IO Products, Inc. from using “io” to market hardware deemed similar to AI-audio startup iyO’s planned device (source: Bloomberg Law via MacRumors).
  • The court found a likelihood of confusion between “IO” and “iyO” and flagged “reverse confusion” risk given OpenAI’s scale, citing potential irreparable harm to iyO’s brand and fundraising.
  • Backstory: Ive and Altman picked “io” in mid‑2023. In early 2025, iyO CEO Jason Rugolo sought funding from Altman for a human‑computer interface project; Altman declined, saying he was already working on something competitive. OpenAI argued its first device wouldn’t be a wearable and that Rugolo voluntarily shared details while suggesting a $200M acquisition.
  • Scope: The order doesn’t ban all “io” uses—only for hardware similar to iyO’s planned AI-audio computer. OpenAI removed “io” branding shortly after the TRO.
  • What’s next: The case returns to district court for a preliminary injunction hearing in April 2026; broader litigation could run into 2027–2028. OpenAI’s first hardware device is still expected next year, likely under a different name.

Why it matters for HN:

  • Highlights the “reverse confusion” doctrine—when a big brand risks swamping a smaller mark.
  • Naming due diligence for hardware/AI products just got a high-profile cautionary tale.
  • Signals branding and launch risks for OpenAI’s Ive-designed device even as the product timeline advances.

Based on the discussion, Hacker News users reacted with a mix of branding critique, mockery of the founders' public image, and speculation regarding the utility of the hardware itself.

Branding and Alternatives The court order barring "io" sparked immediate humor and alternative suggestions. Several users jokingly proposed "Oi" (referencing British slang and Jason Statham movies), though others noted "Oi" is already a major telecom brand in Brazil. Others referenced "JOI" (from Blade Runner) or the bygone "Yo" app. On a serious note, commenters questioned the strategy behind the original name, arguing that "io" is uncreative, difficult to search for in hardware repositories, and squanders the immense brand equity of "ChatGPT," which one user felt should have been the leading name for the device.

Critique of the Ive/Altman "Vibe" A thread developed specifically roasting the press photo of Sam Altman and Jony Ive. Users described the aesthetic as "creepy," comparing it variously to a "bad early 90s TV movie," a "cropped Giorgio Armani perfume ad," or a "pregnancy announcement," with some viewing the project as a "narcissistic dance-off."

Speculation on the Hardware Discussion shifted to what the device actually does, with significant skepticism:

  • Form Factor: Guesses ranged from a "Humane Pin v2" to smart glasses, a set-top TV box, or a dedicated smart speaker.
  • Utility: Some users expressed desire for a dedicated "ChatGPT box" to replace existing smart speakers (Alexa/Google Home), which many felt have become "detuned" or increasingly useless.
  • Necessity: Users theorized that OpenAI is forced to build hardware because Apple will never grant a third-party app the "always-on," deep-system access required for a true AI assistant on the iPhone.
  • Viability: Cynicism remained high, with comparisons to other recent AI hardware flops like the Rabbit R1 or Humane Pin, with one user calling it likely just a "chatbot box."

The Plaintiff (iyO) A few users investigated the plaintiff, iyO, noting that their planned products resemble "audio computing" headphones or cameras, though one user complained that the startup's website was incredibly slow to load.

Wall Street races to protect itself from AI bubble

Submission URL | 70 points | by zerosizedweasle | 83 comments

Wall Street races to protect itself from the AI bubble it’s funding

  • Banks are underwriting record borrowing to build AI infrastructure while simultaneously hedging against a potential bust. Global bond issuance has topped $6.46T in 2025 as hyperscalers and utilities gear up to spend at least $5T on data centers, per JPMorgan.
  • Anxiety is visible in credit markets: the cost to insure Oracle’s debt has climbed to highs not seen since the Global Financial Crisis, and hedging activity has exploded. Oracle CDS trading hit about $8B over nine weeks through Nov 28 vs ~$350M a year earlier.
  • Lenders are heavily exposed via massive construction loans (e.g., $38B and $18B packages tied to new data centers in Texas, Wisconsin, and New Mexico) and are offloading risk with credit derivatives and portfolio deals.
  • CDS spreads have jumped across big tech. Five-year protection on $10M of Microsoft debt runs 34 bps ($34k/yr) vs ~20 bps in mid-October; Johnson & Johnson, the only other AAA in the U.S., is ~19 bps. Saba Capital says MSFT protection looks rich and is selling it; they see similar dislocations in Oracle, Meta, and Alphabet.
  • Operational risk is in the mix: a major outage that halted CME Group trading prompted Goldman Sachs to pause a $1.3B mortgage bond sale for data center operator CyrusOne—highlighting how repeated breakdowns can drive customer churn.
  • Morgan Stanley has explored “significant risk transfer” deals—using credit-linked notes and similar structures to insure 5–15% of designated loan portfolios—and private credit firms like Ares are positioning to absorb that risk.
  • Why it matters: The AI buildout may be the largest tech borrowing spree ever, but banks are laying off downside to derivatives buyers and private credit. If returns lag or outages mount, losses won’t stay on bank balance sheets; if not, hedgers and protection sellers could win. As Steven Grey cautions, great tech doesn’t automatically equal profits.

Based on the discussion provided, here is a summary of the comments:

Fears of a Bailout and "Privatized Gains, Socialized Losses" The most prominent reaction to the article is cynicism regarding who will ultimately pay for a potential AI bust.

  • Users suggest that while banks are currently hedging, the US government (and by extension, the taxpayer) will "step in" to bail out AI corporations and Wall Street if the bubble bursts.
  • One commenter satiricaly proposes a government plan involving borrowing trillions to give children equity quotas, highlighting the absurdity of current national debt levels and the feeling that the financial system is playing "God" with economics.
  • One brief comment summed up the sentiment with the phrase "Make America Bankrupt."

The "AI Arms Race" Justification A counter-argument emerged claiming that the massive spending and borrowing are necessary matters of national defense.

  • Several users argue the US cannot afford to "sleep" while China advances. The consensus among this group is that the AI buildout is a geopolitical necessity to prevent China from becoming the sole dominant power.
  • Parallels were drawn to Cold War logic ("Mr. President, we cannot allow a mineshaft gap"), suggesting that even if the economics are a bubble, the strategic imperative overrides financial caution.

Debate on China’s Stability and Data The mention of China sparked a sub-thread about the reliability of Chinese economic data and their motivations for pursuing AI.

  • One user argued that China is betting on AI and robotics to solve its looming demographic collapse and leverage its future despite a shrinking workforce.
  • Others disputed the reliability of information regarding China, with some asking for a "single source of truth." There was a debate over whether Chinese official statistics (Five Year Plans, National Bureau of Statistics) are reliable or comparable to manipulated Soviet-era propaganda.

Macroeconomic Theory and Money Printing A significant portion of the discussion devolved into a technical debate about the nature of money and debt.

  • Users argued over the definition of "printing money" versus "issuing debt."
  • Some contended that debt functions as savings for others (e.g., China buying US Treasuries) and is distinct from printing money, while others argued that fractional reserve banking essentially allows banks to create money out of thin air, expanding the money supply and fueling inflation.
  • This thread reflected broader anxiety about the long-term sustainability of US fiscal policy, referencing recent increases in credit default swaps and huge deficit spending.

AI Submissions for Thu Dec 04 2025

How elites could shape mass preferences as AI reduces persuasion costs

Submission URL | 649 points | by 50kIters | 602 comments

TL;DR: A theory paper argues that as AI slashes the cost and boosts the precision of persuasion, political elites have incentives to strategically engineer the distribution of public preferences—often nudging societies toward polarization. With rival elites, the same tech can instead “park” opinions in harder-to-flip zones, so advances could either amplify or dampen polarization depending on the competitive environment.

What’s new

  • Treats the mass distribution of policy preferences as a controllable variable when persuasion becomes cheap and precise via AI.
  • Frames polarization not as an organic byproduct, but as an instrument of governance under majority rule.

Core model (intuition)

  • Elites choose how much to reshape opinion distributions subject to persuasion costs and the need to win majorities.
  • Lower costs (AI targeting, automation) expand feasible interventions.

Key findings

  • Single-elite setting: Optimal strategies exert a “polarization pull,” pushing opinions toward more extreme profiles; better persuasion tech accelerates this drift.
  • Two opposed elites alternating in power: Incentives emerge to create “semi-lock” regions—more cohesive, hard-to-overturn opinion clusters. Depending on parameters, tech improvements can either raise or reduce overall polarization.

Why it matters

  • Recasts polarization as a strategic choice in an AI era, suggesting a governance arms race over opinion engineering.
  • Highlights risks to democratic stability and the potential value of policy guardrails around AI-driven persuasion.

Caveats

  • Theoretical model; outcomes hinge on assumptions about costs, majority rules, and elite behavior. Real-world frictions (backlash, regulation, norms) may blunt or reshape effects.

Paper: arXiv:2512.04047 (econ.GN) — “Polarization by Design: How Elites Could Shape Mass Preferences as AI Reduces Persuasion Costs” by Nadav Kunievsky Link: https://arxiv.org/abs/2512.04047

Here is a summary of the discussion:

The Nature of Democracy and Public Opinion The conversation opens with a philosophical debate on the role of the electorate. Citing Philip Converse’s 1964 work, The Nature of Belief Systems in Mass Publics, users discuss whether the average voter actually holds coherent policy preferences or if they are merely swayed by elite grouping.

  • Corrective vs. Prescriptive: Participants debate the purpose of democracy. While some argue it creates a "corrective system" designed only to peacefully remove bad leaders (majority rule as a safety valve), others express cynicism, arguing that modern systems fail to produce quality leadership or effectively remove incompetence.
  • Educational Decay: Some attribute the malleability of public opinion to a failure in the education system, suggesting that "intellectually soft" schooling has left a vacuum that social media algorithms now fill.

Case Study: The Standardization of Opinion on Tariffs The abstract concepts of the paper are immediately applied to a granular debate about tariffs, serving as a proxy for how complex economic policies are polarized or misunderstood.

  • Intent vs. Outcome: Users distinguish between the desire for tariffs (national security, bringing back manufacturing jobs) and the mechanics (companies shifting costs downstream to consumers). Critics argue that tariffs on intermediate goods (like steel) actually hurt domestic manufacturers by raising their input costs.
  • Externalities and Ethics: A segment of the discussion defends tariffs not as economic boosters, but as tools to address externalities—specifically, penalizing foreign competitors who rely on pollution or weak labor laws (e.g., child labor) to undercut prices.
  • Corruption and Implementation: Skeptics view tariffs as vectors for corruption, noting that they encourage companies to lobby for exemptions (e.g., the Apple/Trump dynamic) rather than innovate. Others note that for tariffs to work, they require long-term credibility; otherwise, they are viewed as temporary political signaling.

We gave 5 LLMs $100K to trade stocks for 8 months

Submission URL | 320 points | by cheeseblubber | 262 comments

Got it—please share the Hacker News submission you want summarized. You can paste:

  • The HN link, or the article title + body/text
  • Optional: a few top comments if you want the discussion reflected

Also tell me your preference:

  • Format: 3-bullet quick take, 1-paragraph brief, or 5–7 bullet deep-dive
  • Extras: “Why it matters,” notable dissent, key numbers, or caveats

Here is a summary of the discussion regarding the comparison of AI models (Grok, DeepSeek, Gemini) for stock portfolio generation.

Format: Deep-Dive (5 Bullets)

  • The Problem of Data Leakage: The primary critique of the submission is the difficulty of valid backtesting. Commenters argue that because LLMs are trained on vast amounts of internet data (news, papers, stock mechanics), they effectively "know" the future of the test dataset. Even if you hide specific stock prices, the models have ingested the general narrative of which companies succeeded (e.g., Nvidia's rise), making them look artificially prophetic.
  • Methodological Flaws in Splitting Data: Users debated how to properly train/test an AI trader. Splitting by time is flawed (as noted above). Splitting by stock (training on 90% of the market, testing on 10%) is also rejected due to autocorrelation; stocks in the same sector (like AMD and Nvidia) move together. If the model knows Nvidia went up (from training data), it can infer AMD likely did too, nullifying the "blind" test.
  • Grok’s Potential "Uncensored" Edge: A sub-thread debated whether Grok has an advantage over Gemini or ChatGPT. Proponents argued that Grok’s access to real-time X (Twitter) data and fewer "safety/political correctness" guardrails might result in less "distorted" reality processing compared to corporate-safe models. Others countered that this is likely irrelevant to market mechanics or that superior reasoning capabilities (OpenAI/Anthropic) still outweigh "edginess."
  • Market Reflexivity and Saturation: Commenters noted that institutional trading firms likely already integrate LLMs for sentiment analysis (reading news/socials). There is skepticism that a retail trader using an off-the-shelf LLM can find alpha that High Frequency Trading (HFT) firms haven't already arbitraged away. Furthermore, if enough retail traders follow AI picks, they create a "reflexive loop" where the AI influences the price rather than predicting it.
  • The "Hot Hand" Fallacy: The discussion touched on the nature of market beating. One user noted that hedge funds often beat the market for 2–4 years before reverting to zero or underperforming. This suggests that even if an AI model performs well for a short cycle, it may simply be riding a sector beta (like Tech) rather than possessing true predictive skill, echoing the "lucky 10,000" concept.

Why it matters: This discussion highlights the gap between Generative AI capabilities (writing code, summarizing text) and predictive financial modeling. It underscores that LLMs are fundamentally "historians" that have read the entire internet, making them poor candidates for forecasting chaotic systems where they cannot separate their training data from the "future" events they are supposed to predict.

CUDA-l2: Surpassing cuBLAS performance for matrix multiplication through RL

Submission URL | 122 points | by dzign | 14 comments

CUDA-L2: RL-tuned CUDA kernels that (claim to) beat cuBLAS on A100 HGEMM

  • What it is: An open-source system that combines LLMs with reinforcement learning to auto-generate and tune half-precision GEMM (HGEMM) CUDA kernels. The authors claim consistent speedups over torch.matmul, cuBLAS, and cuBLASLt (both heuristic and autotuning) across 1,000 M×N×K shapes on an A100.

  • What’s new: A release of A100-optimized HGEMM kernels covering 1,000 configurations. The repo includes benchmarking scripts and results.

  • Why it matters: cuBLAS/cuBLASLt are tough baselines; surpassing them—even for a subset of shapes/precisions—suggests automated kernel search via RL+LLMs can uncover non-trivial performance wins and could generalize to broader GPU ops.

  • Scope and caveats:

    • Hardware: Tuned for A100 (SM80). The authors say speedups on other GPUs (e.g., RTX 3090, H100) aren’t guaranteed; more architectures planned (Ada, Hopper, Blackwell).
    • Precision: Current kernels are F16×F16→F16 (16-bit accumulator); F32 accumulation variants are on the roadmap.
    • Coverage: Fixed set of 1,000 shapes; for missing sizes they suggest using the nearest larger config and padding.
    • Generality: Claims are specific to HGEMM and these shapes; real-world gains depend on your model’s matmul patterns and batch sizes.
  • How it works (at a high level): Uses CUTLASS as the foundation and RL to search kernel parameters/schedules. The repo contains precompiled kernels (e.g., SM80_16x8x16_F16F16F16F16) and tooling to compile/evaluate.

  • Getting started:

    • Requirements: PyTorch ≥ 2.6.0, CUTLASS v4.2.1 (exact), TORCH_CUDA_ARCH_LIST="8.0".
    • Env: export CUTLASS_DIR=/path/to/cutlass and TORCH_CUDA_ARCH_LIST="8.0".
    • Run: ./eval_one_file.sh --mnk M_N_K --warmup_seconds 5 --benchmark_seconds 10 --mode offline|server [--target_qps N]
    • Modes: offline (batch) or server (QPS-targeted microbenchmarking).
    • License: MIT.
  • If your shape isn’t included: Open a GitHub issue with your M/N/K, or pad to a supported size.

  • Contact: jiwei_li@deep-reinforce.com

Discussion starters:

  • How robust are the gains across real LLM inference/training pipelines vs microbenchmarks?
  • Will RL-found kernels transfer across GPU generations, or need per-arch training?
  • Could this approach extend to BF16/F32 or attention kernels, and rival Triton/TVM autotuners at scale?

The discussion centers on the novelty of the optimization techniques, the practical limits of the fixed-shape approach, and the difficulty of using RL for kernel generation.

Novelty vs. Implementation Commenters debated whether the "novel" techniques discovered by the system (Section 5 of the paper) were genuinely new or simply a reshuffling of well-known methods. Some described the output as "standard GPU Gems advice" rather than algorithmic breakthroughs. Others argued that the value lies not in new theory—standard matrix multiplication theory hasn't changed drastically in decades—but in using LLMs to navigate the massive search space for optimal implementations on specific hardware.

Practical limitations and Padding Users scrutinized the requirement to pad unsupported shapes to the nearest included configuration. One user noted that padding zeros could easily negate the performance gains, potentially making these specialized kernels slower than general-purpose libraries for specific dimensions. However, others defended "code specialization" as a cheap way to gain performance percentages on critical operations where standard libraries are too generalized.

RL Challenges and Benchmarking The difficulty of applying RL to CUDA was highlighted by a user with similar experience; they noted that while generating valid code is easy, getting a model to "escape its distribution" to invent truly novel instruction sequences (like complex double-buffering patterns) remains very hard. Regarding the benchmarks, there was confusion over the visualization—readers found the "speedup percentage" charts (where 0% implies parity with cuBLAS) less intuitive than raw performance numbers. There was also a brief dispute regarding whether the benchmarks fairly compared FP16 inputs against FP32 baselines.

State of AI: An Empirical 100T Token Study with OpenRouter

Submission URL | 196 points | by anjneymidha | 91 comments

State of AI: An Empirical 100 Trillion Token Study with OpenRouter (a16z + OpenRouter)

What’s new

  • A large-scale, usage-focused study of LLMs based on more than 100 trillion tokens routed through OpenRouter, spanning many models, tasks, regions, and time.
  • Frames December 5, 2024 (OpenAI’s o1) as an inflection point: a shift from single-pass generation to multi-step, reasoning-style inference.

Key findings

  • Open-weight surge: Meaningful real-world adoption of open-source models alongside closed APIs.
  • Roleplay is big: Creative roleplay emerges as an outsized category, rivaling the usual suspects like coding assistance.
  • Agentic inference rises: More multi-step, tool-assisted flows; models are increasingly components in larger automated systems rather than single-turn chatbots.
  • Cohort durability: Early “foundational” cohorts retain far longer than later cohorts—the “Cinderella Glass Slipper” effect.
  • Sensitivity to market dynamics: Pricing and new model launches materially shift usage patterns.

Why it matters

  • Product direction: Don’t assume productivity-only use; roleplay and coding remain major drivers. Build for multi-step/agent workflows, not just single responses.
  • Model strategy: Open weights are competitive in real usage; pricing and reliability for tool use and long chains matter as much as raw benchmarks.
  • Infra implications: Orchestration across diverse models is now a norm; latency, cost controls, and agent-friendly features are key differentiators.

Caveats

  • Single platform lens: Data comes from OpenRouter, which may skew toward developers using a router and experimenting across models.
  • Affiliations: Authors include a16z and OpenRouter; interpret comparative claims with that context.
  • Privacy/aggregation details aren’t in the excerpt; methodology quality will matter for any task labeling and geographic breakdowns.

Bottom line

  • Real-world LLM use is more diverse and agentic than many narratives suggest, with open-source models gaining share, roleplay unexpectedly dominant, and early users sticking around far longer. If you’re building with LLMs, optimize for multi-step workflows, cost-aware routing, and the categories people actually spend time on.

Discussion Summary Hacker News users analyzed the report with a skeptical eye, focusing primarily on selection bias inherent to the OpenRouter platform and privacy concerns regarding the data methodology.

  • Platform Selection Bias: Commenters argued that OpenRouter’s data does not represent the broader market. They suggested the platform attracts specific niches—indie hackers, developers, and roleplay enthusiasts—while excluding large enterprise sectors (fintech, healthcare) that cannot use data aggregators due to security compliance.
  • The "Roleplay" Anomaly: The dominance of roleplay (nearly 60% of open-source tokens) was attributed to users seeking uncensored or low-cost models for applications like SillyTavern or creative writing. Users noted that "mainstream" commercial APIs (OpenAI/Anthropic) are often too expensive or heavily moderated for these use cases, naturally funneling that specific traffic to OpenRouter and skewing the statistics.
  • Small Models Are Self-Hosted: Several users disputed the finding that small model usage is declining. They argued that 7B-parameter models are increasingly self-hosted on consumer hardware (e.g., Mac Studio, gaming GPUs) for privacy and zero marginal cost. Consequently, API aggregators are primarily used for massive "frontier" models (like o1 or Claude 3.5) that cannot run locally, creating a false signal that small model usage is dropping.
  • Privacy & Methodology: There was significant criticism regarding OpenRouter’s methodology of inspecting and classifying user prompts (sometimes via Google APIs). While some noted that users opt-in for a discount, others viewed this as a "privacy theater" deal-breaker for serious business use, reinforcing the idea that the data lacks B2B representation.
  • Geographic Skews: The high usage ranking for Singapore was widely interpreted as a proxy for Chinese users and companies utilizing VPNs and Singaporean billing entities to bypass blocking by major US AI labs.

Microsoft drops AI sales targets in half after salespeople miss their quotas

Submission URL | 418 points | by OptionOfT | 326 comments

Microsoft reportedly cut AI agent sales targets after widespread quota misses

  • The Information reports Microsoft lowered growth targets for its AI “agent” products after many Azure sales teams missed quotas in the fiscal year ending June. One US unit set a 50% growth target for Azure AI Foundry and saw fewer than 20% of reps hit it; targets were cut to ~25% this year. Another unit’s “double Foundry sales” goal was trimmed to 50%.
  • This follows a year of heavy “agentic” marketing—Build and Ignite showcased Word/Excel/PowerPoint agents in Microsoft 365 Copilot plus tools like Copilot Studio and Azure AI Foundry—yet enterprise appetite for premium-priced agent tools appears soft.
  • Copilot faces brand and usage headwinds versus ChatGPT. Bloomberg cited Amgen, where staff reportedly gravitated to ChatGPT, using Copilot mainly for Microsoft-specific tasks (Outlook/Teams).
  • A deeper issue: today’s agentic systems still confabulate and behave brittly on novel tasks, making fully autonomous, high‑stakes workflows risky without humans in the loop.
  • Despite slower enterprise uptake, Microsoft is still spending aggressively: $34.9B capex in the October quarter, with much AI revenue coming from AI companies renting Azure compute rather than traditional enterprises adopting agents.

Why it matters: There’s a gap between the “era of AI agents” pitch and what enterprises will pay for today. Expect more human‑supervised designs, tighter ROI proofs, pricing/bundling tweaks, and continued competition with general chat tools even inside Microsoft shops. The near‑term AI business for hyperscalers still looks more like selling picks-and-shovels (compute) than selling autonomous workers.

Productivity and Usability Frustrations Commenters describe Microsoft’s current AI implementation as clunky and intrusive, with one user likening it to a "bad autocomplete" that requires constant correction (pressing escape/backspace) and wastes time on trivialities rather than optimizing workflows. Several users criticized the "feature checklist" culture at Microsoft, arguing that the push for AI is driven by internal OKRs and promotion incentives rather than user needs, resulting in hundreds of disjointed, low-quality integrations rather than a cohesive, functional product.

Technical Competence and Hallucinations A recurring complaint is that Microsoft's purpose-built tools fail at their specific jobs.

  • Azure: Users report that Copilot inside Azure provides useless troubleshooting advice, while pasting the same error logs into generic external models (like Claude or ChatGPT) yields actual solutions.
  • Coding: Developers shared "horror stories" of AI autocomplete, including one instance where an AI suggested a DROP TABLE command mixed into SQL code. Others noted that LLM-based assistants in IDEs (Visual Studio, JetBrains) often hallucinate non-existent properties or remove valid import statements, forcing users to revert to older, heuristic-based IntelliSense for reliability.

Degradation of Assistant Utility The discussion extends beyond Microsoft to the broader industry trend (including Google's Gemini), where deterministic, functional tools are being replaced by "chatty" but unreliable LLMs. Users expressed frustration that voice assistants have lost the ability to reliably perform simple tasks (like setting timers or navigation) in favor of probabilistic models that increase cognitive load. The consensus views the current wave of enterprise AI as "en-shittification," prioritizing marketing hype over functional stability.

Show HN: RAG in 3 Lines of Python

Submission URL | 32 points | by init0 | 5 comments

Piragi (v0.3.0): a batteries‑included RAG interface with one‑line setup, local by default

What it is

  • A Python library that turns folders, code globs, and URLs into a queryable knowledge base in one line (Ragi([...]).ask("...")). It ships with a vector store, embeddings, citations, and background auto‑updates.

Why it’s interesting

  • Zero‑to‑RAG fast: Works out of the box with local models via Ollama; OpenAI‑compatible if you want hosted.
  • Always fresh: Background indexing so queries aren’t blocked by updates.
  • Built‑in citations and filters for traceable answers.
  • Pluggable storage: Local LanceDB (including S3), PostgreSQL/pgvector, or Pinecone.

Notable features

  • Formats: PDFs, Office docs, Markdown, code, URLs, images, audio.
  • Retrieval: HyDE, hybrid BM25+vector with RRF fusion, and cross‑encoder reranking.
  • Chunking: fixed, semantic, contextual, and hierarchical (parent/child) strategies.
  • Retrieval‑only mode so you can bring your own LLM or framework.
  • Configurable embeddings (default all‑mpnet‑base‑v2; options from ~90MB to ~8GB) and LLM (default llama3.2 via Ollama).

Who it’s for

  • Developers who want a simple, local‑first RAG stack with citations and sensible defaults, and teams prototyping doc/code QA without assembling multiple tools.

Caveats and questions

  • Development Status: Alpha (PyPI classifier).
  • Performance/quality will hinge on chosen models and chunking; large embedding models have hefty RAM/VRAM footprints.
  • No claims here about multi‑tenant/enterprise features or security posture.

Meta

  • License: MIT; Python 3.9+; author listed as Hemanth HM; PyPI “verified details” flag shown. Released Dec 4, 2025.

Discussion around Piragi was generally positive, highlighting successful testing and specific integration questions.

Key themes included:

  • Documentation & Clarity: One user critiqued the absence of a definition for "RAG" on the project page, noting that defining acronyms makes the tool friendlier and aids discoverability.
  • User Experience: Feedback was complimentary regarding the developer experience, with users praising the "great documentation" and confirming the library worked "brilliantly" during initial testing.
  • Feature Requests: Commenters asked about specific capabilities, including future support for Graph/RDF and compatibility with AWS Bedrock.

AI Submissions for Wed Dec 03 2025

Submission URL | 761 points | by bearsyankees | 266 comments

Filevine bug exposed full admin access to a law firm’s Box drive via an unauthenticated API; fixed after disclosure

A security researcher probing AI legal-tech platform Filevine found that a client-branded subdomain with a stuck loading screen leaked clues in its minified frontend JavaScript. Those pointed to an unauthenticated “recommend” endpoint on an AWS API Gateway. Hitting it returned a Box access token and folder list—no auth required. The token was a fully scoped admin credential for the firm’s entire Box instance, implying potential access to millions of highly sensitive legal documents. After a minimal impact check, the researcher stopped and disclosed.

Timeline: discovered Oct 27, 2025 → acknowledged Nov 4 → fix confirmed Nov 21 → writeup published Dec 3. The researcher says Filevine was responsive and professional. The affected subdomain referenced “margolis,” but the firm clarifies it was not Margolis PLLC.

Why it matters:

  • Returning cloud provider tokens to the browser and leaving endpoints unauthenticated is catastrophic in legal contexts (HIPAA, court orders, client privilege).
  • AI vendors handling privileged data must enforce strict auth on every API, use least-privilege/scoped tokens, segregate tenants, and avoid exposing credentials client-side.
  • Law firms should rigorously vet AI tools’ security posture before adoption.

HN discussion is active.

Based on the comments, the discussion centers on the severity of the oversight, the viability of software regulations, and a debate on whether AI ("vibe coding") will solve or exacerbate these types of security failures.

Human Impact and Severity The top thread emphasizes the catastrophic real-world consequences of such a breach. Users construct hypothetical scenarios—such as a single mother in a custody battle being blackmailed with leaked documents—to illustrate that this is not just a technical failing but a human safety issue. Comparisons are drawn to the Vastaamo data breach in Finland (where psychotherapy notes were used for extortion), with users noting that the use of unverified, unencrypted ("http-only") endpoints makes data trivial to intercept.

Regulation vs. Market Correction A debate emerges regarding the "Industrialization" of code quality:

  • The "Building Inspector" Argument: The root commenter argues that software handling sensitive data needs mandatory "building codes" and inspections, similar to physical construction, arguing that safety and privacy shouldn't be optional features.
  • The Counter-Argument: Skeptics argue that software has too many degrees of freedom compared to physical buildings for rigid codes to work. They suggest that the private market—specifically professional liability insurers and the threat of lawsuits—is better equipped to enforce security standards than government bureaucracy.

The "Vibe Coding" / AI Debate A significant portion of the discussion deviates into whether Generative AI coding is to blame or is the solution:

  • Crucial Context Missing: Critics of AI coding argue that Large Language Models (LLMs) lack the "context window" to understand system-wide security. While an AI can write a function, it cannot "keep the whole system in its head," leading to hallucinations regarding API security and authentication logic that human architects usually catch.
  • Human Error: Others counter that humans clearly don't need AI to make catastrophic mistakes (citing a history of open S3 buckets). Some predict that within two years, AI coding systems will likely be more secure than the bottom 90% of human developers, characterizing human devs as having "short-term memory" limitations similar to LLMs.

Everyone in Seattle hates AI

Submission URL | 874 points | by mips_avatar | 929 comments

Everyone in Seattle Hates AI (Dec 3, 2025)

A former Microsoft engineer building an AI map app (Wanderfugl) describes surprising hostility to AI among Seattle big‑tech engineers—rooted not in the tech itself but in culture, layoffs, and forced tooling.

Key points:

  • A lunch with a respected ex-coworker turned into broad frustration about Microsoft’s AI push, not the author’s product. Similar reactions kept repeating in Seattle, unlike in SF, Paris, Tokyo, or Bali.
  • Layoffs and mandates: a director reportedly blamed a PM’s layoff on “not using Copilot 365 effectively.” After the 2023–24 layoff wave, cross-org work was axed; the author went from shipping a major Windows 11 improvement to having no projects and quit.
  • “AI or bust” rebrand: teams that could slap an AI label became safe and prestigious; others were devalued overnight as “not AI talent.”
  • Forced adoption: Copilot for Word/PowerPoint/email/code was mandated even when worse than existing tools or competitors; teams couldn’t fix them because it was “the AI org’s turf.” Employees were expected to use them, fail to see gains, and stay quiet.
  • Protected AI teams vs. stagnating comp and harsher reviews for everyone else bred resentment. Amazon folks feel it too, just cushioned by pay.
  • Result: a self-reinforcing belief that AI is both useless and off-limits—hurting companies (less innovation), engineers (stalled careers), and local builders (reflexive hostility).
  • Contrast: Seattle has world-class talent, but SF still believes it can change the world—and sometimes does.

Anecdotal but sharp cultural critique of Big Tech’s AI mandates and morale fallout.

Here is a summary of the discussion:

Discussion: The Roots of AI Hostility—Corporate coercion, Centralization, and Quality

Commenters largely validated the submission's critique of Microsoft's internal culture while expanding the debate to include broader dissatisfaction with how AI is being integrated into the tech industry.

  • Corporate Toxicity & Forced Metrics: Several users corroborated the "toxic" enforcement of AI at Microsoft, noting that performance reviews are sometimes explicitly linked to AI tool usage. Critics argued this forces engineers to prioritize management metrics over product quality or efficiency, leading to resentment when "insane" mandates force the use of inferior tools.
  • Centralization vs. Open Source: A major thread debated the "centralization of power." Users expressed fear that Big Tech is turning intelligence into a rent-seeking utility (likened to the Adobe subscription model) rather than a tool for empowerment. While some argued that open-weight models and local compute offer an escape, others countered that the astronomical hardware costs (GPUs, energy) required for flagship-level models inevitably force centralization similar to Bitcoin mining or Search Engine indexing.
  • The "Meaning" Crisis: A recurring sentiment was that AI is automating the "fun" and meaningful parts of human activity (art, writing, coding logic) while leaving humans with the "laundry and dishes." Users worried this removes the satisfying struggle of work and pulls the ladder up for junior employees who need those lower-level tasks to learn.
  • Skepticism on Quality ("AI Asbestos"): pushing back against the idea that people feel "threatened," many argued they mainly reject AI because current implementations simply doesn't work well. One user coined the term "AI Asbestos"—a toxic, cheap alternative to valuable work that solves problems poorly and requires expensive cleanup (e.g., spending more time fixing an AI meeting summary than it would take to write one manually).

Zig quits GitHub, says Microsoft's AI obsession has ruined the service

Submission URL | 1022 points | by Brajeshwar | 595 comments

Zig quits GitHub over Actions reliability, cites “AI over everything” shift; moves to Codeberg

  • What happened: The Zig Software Foundation is leaving GitHub for Codeberg. President Andrew Kelly says GitHub no longer prioritizes engineering excellence, pointing to long‑standing reliability problems in GitHub Actions and an org-wide pivot to AI.

  • The bug at the center: A “safe_sleep.sh” script used by GitHub Actions runners could spin forever and peg CPU at 100% if it missed a one‑second timing window under load. Zig maintainers say this occasionally wedged their CI runners for weeks until manual intervention.

    • Origin: A 2022 change replaced POSIX sleep with the “safe_sleep” loop.
    • Discovery: Users filed issues over time; a thread opened April 2025 highlighted indefinite hangs.
    • Fix: A platform‑independent fix proposed Feb 2024 languished, was auto‑closed by a bot in March 2025, revived, and finally merged Aug 20, 2025.
    • Communication gap: The April 2025 thread remained open until Dec 1, 2025, despite the August fix. A separate CPU-usage bug is still open.
  • “Vibe‑scheduling”: Kelly alleges Actions unpredictably schedules jobs and offers little manual control, causing CI backlogs where even main branch commits go untested.

  • Outside voices: Jeremy Howard (Answer.AI/Fast.ai) called the bug “very obviously” CPU‑burning and indefinitely running unless it checks the time “during the correct second,” arguing the chain of events reflects poorly on process and review.

  • Broader shift away from GitHub: Dillo’s maintainer also plans to leave, citing JS reliance, moderation gaps, service control risk, and an “over‑focus on LLMs.”

  • Follow the incentives: Microsoft has leaned hard into Copilot—1.3M paid Copilot subscribers by Q2 2024; 15M Copilot users by Q3 2025—with Copilot driving a big chunk of GitHub’s growth. Critics see this as evidence core platform reliability has taken a back seat.

Why it matters

  • CI reliability is existential for language/tooling projects; weeks‑long runner stalls are untenable.
  • The episode highlights tension between AI product pushes and maintenance of dev‑infra fundamentals.
  • Alternatives like Codeberg are gaining momentum (supporting members doubled this year), hinting at a potential slow drift of OSS projects away from GitHub if trust erodes.

GitHub did not comment at time of publication.

Based on the comments provided, the discussion on Hacker News focused less on the technical migration to Codeberg and more on the tone and subsequent editing of Andrew Kelley's announcement post.

The Revisions to the Announcement

  • The "Diff": Users spotted that the original text of the post was significantly more aggressive. One archived draft described the situation as talented people leaving GitHub, with the "remaining losers" left to inflict a "bloated buggy JavaScript framework" on users. A later edit softened this to state simply that "engineering excellence" was no longer driving GitHub’s success.
  • Professionalism vs. Raw Honesty: Several commenters felt the original "losers" remark was childish, unnecessarily personal, and unprofessional. User serial_dev found the updated, professional phrasing "refreshing," while y noted that publishing personal insults like "monkeys" or "losers" undermines the author's position.
  • Motivation for the Change: There was debate over why Kelley edited the post.
    • optimistic view: Some saw it as a genuine "mea culpa" (stynx) and a sign of learning from feedback (dnnrsy), arguing that people should be allowed to correct mistakes without being "endlessly targeted."
    • cynical view: Others viewed it as "self-preservation" (snrbls) or "corporate speak" (vks) to save face after backlash, rather than a true change of heart.

Broader Philosophical Debate: Changing One's Mind

  • The incident sparked a sidebar conversation about the nature of backtracking in public communication, comparing it to politicians "flip-flopping."
  • The "Waffle" accusation: Commenters discussed the tension between accusing leaders of "waffling" (chrswkly) versus the virtue of adapting opinions based on new information or feedback (ryndrk).
  • Context Matters: Ideally, a leader changes their mind due to reason, but in this context, some suspected the edit was simply a "PR policy" move to avoid "getting canceled" rather than an actual retraction of the sentiment that GitHub's current staff is incompetent (a2800276).

Are we repeating the telecoms crash with AI datacenters?

Submission URL | 218 points | by davedx | 187 comments

The post argues the oft-cited analogy breaks once you look at the supply/demand mechanics and the capex context.

What actually happened in telecoms

  • 1995–2000: $2T spent laying 80–90M miles of fiber ($4T in today’s dollars; nearly $1T/year).
  • By 2002, only 2.7% of that fiber was lit.
  • Core mistake: demand was misread. Executives pitched traffic doubling every 3–4 months; reality was closer to every 12 months—a 4x annual overestimate that compounded.
  • Meanwhile, supply exploded: WDM jumped from 4–8 carriers to 128 by 2000; modulation/error-correction gains and higher bps per carrier yielded orders-of-magnitude more capacity on the same glass. Net effect: exponential supply, merely linear demand → epic overbuild.

Why AI infrastructure is different

  • Efficiency curve is slowing, not exploding:
    • 2015–2020 saw big perf/W gains (node shrinks, tensor cores).
    • 2020–2025 ~40%/yr ML energy-efficiency gains; EUV-era node progress is harder.
  • Power/cooling is going up, not down:
    • GPU TDPs: V100 300W → A100 400W → H100 700W → B200 1000–1200W.
    • B200-class parts need liquid cooling; many air-cooled DCs require costly retrofits.
  • Translation: we’re not on a curve where tech makes existing capacity instantly “obsolete” the way fiber did.

Demand looks set to accelerate, not disappoint

  • Today’s chat use can be light (many short, search-like prompts), but agents change the curve:
    • Basic agents: ~4x chat tokens; multi-agent: ~15x; coding agents: 150k+ tokens per session, multiple times daily.
    • A 10x–100x per-user token step-up is plausible as agents mainstream.
  • Hyperscalers already report high utilization and peak-time capacity issues; the problem isn’t idle inventory.

Capex context

  • Pre-AI (2018→2021): Amazon/Microsoft/Google capex rose from $68B to $124B (~22% CAGR) on cloud/streaming/pandemic demand.
  • AI boom: 2023 $127B → 2024 $212B (+67% YoY) → 2025e $255B+ (AMZN ~$100B, MSFT ~$80B, GOOG ~$75B).
  • Some “AI” capex is rebranded general compute/network/storage, but the step-up is still large—just not telecom-fiber large.

Forecasting is the real risk

  • Lead times: 2–3 years to build datacenters; 6–12 months for GPUs. You can’t tune capacity in real time.
  • Prisoner’s dilemma: underbuild and lose users; overbuild and eat slower payback. Rational players shade toward overbuilding.

Bottom line

  • The telecom bust hinged on exploding supply making existing fiber vastly more capable while demand lagged. In AI, efficiency gains are slowing, power/cooling constraints are tightening, and agent-driven workloads could push demand up 10x–100x per user.
  • The analogy is weak on fundamentals. That said, long lead times and competitive dynamics still make local gluts and corrections likely—even if this isn’t a fiber-style wipeout.

Here is a summary of the discussion:

Pricing Power and Consumer Surplus A central point of debate concerns the current and future pricing of AI services. While some users agree with the premise that services are currently underpriced to get customers "hooked"—predicting future price hikes (potentially up to $249/month) similar to how internet or utility providers operate—others push back. Skeptics argue that because model performance is converging and high-quality free or local alternatives exist, a massive price hike would simply cause users to churn or revert to "lazy" Google searches.

Conversely, users highlighted the immense value currently provided at the ~$20/month price point. One user noted that ChatGPT effectively replaces hundreds of dollars in professional fees by analyzing complex documents (like real estate disclosures and financial statements) and writing boilerplate code.

The "Broadband Curve" vs. The App Store Discussing the article's supply/demand analysis, commenters suggested that a better analogy than the "App Store" is the broadband adoption curve. The argument is that we are currently in the infrastructure build-out phase, while the "application layer" (comparable to the later explosion of SaaS) has not yet matured. Users criticized the current trend of simply "shoving chat interfaces" onto existing products, noting that true AI-native UX (citing Adobe’s integration as a positive example) is still rare.

Corporate Demand: Mandates vs. "Shadow AI" There is disagreement on the nature of corporate demand. Some view high utilization rates as artificial, driven by executives mandating AI usage to justify infrastructure costs. Others counter that the market is distorted by "Shadow AI"—employees secretly using generative tools to increase their own efficiency and free up time, regardless of official company policy.

Vendor Loyalty and Migration Commenters expressed frustration with big tech incumbents. One user detailed their company’s decision to leave Google Workspace due to rising prices paired with "garbage" AI features (Gemini) and poor admin tools. However, others noted that switching providers for LLMs is currently "extremely easy," suggesting that infrastructure providers may lack the stickiness or "moat" they enjoyed in the cloud era.

Prompt Injection via Poetry

Submission URL | 82 points | by bumbailiff | 34 comments

  • A new study from Icaro Lab (Sapienza University + DexAI) claims that rephrasing harmful requests as poetry can bypass safety guardrails in major chatbots from OpenAI, Anthropic, Meta, and others.
  • Across 25 models, hand-crafted poetic prompts achieved an average 62% jailbreak success rate (up to 90% on some frontier models); automated “poetic” conversions averaged ~43%, still well above prose baselines.
  • The researchers withheld actionable examples but shared a sanitized illustration and said they’ve notified vendors; WIRED reported no comment from the companies at publication.
  • Why it works (hypothesis): style shifts (metaphor, fragmented syntax, unusual word choices) can move inputs away from keyword-based “alarm regions” used by classifiers, exposing a gap between models’ semantic understanding and their safety wrappers.
  • Context: Prior work showed long jargon-laden prompts could also evade filters. This result suggests guardrails remain brittle to stylistic variation, not just content.

Why it matters: If true, this is a simple, single-turn jailbreak class that generalizes across vendors, underscoring the need for safety systems that are robust to paraphrase and style—not just keyword or surface-pattern checks.

Here is a summary of the discussion:

The Mechanics of the Exploit A significant portion of the discussion focused on why this jailbreak works. Commenters compared the vulnerability to "Little Bobby Tables" (SQL injection), suggesting that current safety guardrails function more like brittle keyword blacklists than structural protections.

  • Vector Space Theory: Users theorized that safety classifiers are trained primarily on standard English prose. By using poetry, the input shifts into high-dimensional vector spaces (or "out-of-distribution" regions) that the safety filters do not monitor, even though the underlying model still understands the semantic meaning. Ideally, one commenter noted, this acts like automated "fuzzing."
  • Lack of Understanding: Several users argued that because LLMs do not truly "understand" concepts but rather predict tokens based on statistics, patching these exploits is a game of "whack-a-mole"—fixing one requires blacklisting specific patterns, leaving infinite other variations open.

Can Humans be Hacked by Poetry? A specific user question—"You can't social engineer a human using poetry, so why does it work on LLMs?"—sparked a debate about human psychology.

  • Arguments for "Yes": Many users argued that humans are susceptible to stylistic manipulation. Examples cited included courtship (using flowery language to bypass romantic defenses), political rhetoric/propaganda (patriotism overriding logic), and "Hallmark cards." One user presented a hypothetical scenario of a soldier being charmed into revealing secrets via romance.
  • Arguments for "No": Others maintained that while humans can be persuaded, it isn't a mechanical failure of a safety filter in the same way it is for an LLM.

Anecdotes and Practical Application Users shared their own experiences bypassing filters, particularly with image generators (DALL-E):

  • One user successfully generated copyrighted characters (like Mario) by describing them generically ("Italian plumber," "Hello Kitty fan") rather than using names.
  • Another user bypassed a filter preventing images of "crying people" by requesting a "bittersweet" scene instead.

Skepticism and Humor

  • Some questioned the novelty of the study, suggesting this is a known form of prompt injection rather than a new discovery.
  • Jokes abounded regarding the Python package manager also named poetry, the "wordcel vs. shape rotator" meme, and the mental image of William Shakespeare wearing a black hat.

Anthropic taps IPO lawyers as it races OpenAI to go public

Submission URL | 350 points | by GeorgeWoff25 | 290 comments

Anthropic reportedly hires IPO counsel, upping the ante with OpenAI

  • What happened: The Financial Times reports Anthropic has engaged capital-markets lawyers to prepare for a potential IPO, a step that typically precedes drafting an S-1 and cleaning up governance and cap-table complexities. It positions Anthropic as a likely early AI-lab candidate for the public markets alongside OpenAI.

  • Why it matters: An Anthropic listing would be the first major pure-play frontier-model IPO, testing investor appetite for AI labs with huge compute costs and rapid revenue growth. An S-1 could finally reveal hard numbers on unit economics, cloud spend, and safety/governance commitments—setting a benchmark for the sector.

  • The backdrop: Anthropic has raised many billions from strategic partners (notably Amazon and Google) and is shipping Claude models into enterprise stacks. Going public could provide employee liquidity, fund the next compute wave, and formalize governance structures (e.g., long-term safety oversight) under public-market scrutiny.

  • What to watch:

    • Timing and venue of any listing, and whether Anthropic pursues dual-class or other control features.
    • How cloud partnerships and credits with AWS/Google are disclosed and impact margins.
    • Safety commitments and board structure in the risk factors section.
    • Whether OpenAI follows with its own path to public ownership or continues relying on private tenders.

Big picture: If Anthropic moves first, its disclosures and reception could define the playbook—and the valuation framework—for AI labs heading into 2025.

Here is a summary of the discussion on Hacker News regarding Anthropic’s potential IPO.

The Submission The Financial Times reports that Anthropic has hired legal counsel to prepare for a potential IPO. This move positions Anthropic as the first major "pure-play" AI lab to test the public markets, distinct from the private tender offers used by competitor OpenAI. Key factors to watch include the disclosure of cloud costs, unit economics, and governance structures, particularly given Anthropic's heavy backing from (and reliance on) Amazon and Google.

The Discussion The commentary on Hacker News focused less on the IPO mechanics and more on the symbiotic—and potentially cynical—relationship between Anthropic and its primary backer, Amazon.

The "Round-Tripping" Revenue Debate A significant portion of the discussion analyzed the billions Amazon invested in Anthropic. Users described this capital as "Monopoly money" or "round-tripping," noting that Amazon invests cash which Anthropic is contractually obligated to spend back on AWS cloud compute.

  • Critics compared this to Enron-style accounting tricks, where revenue is manufactured through circular deals.
  • Defenders argued this is standard industry practice: Amazon gets equity and a stress-test customer for its custom chips (Trainium), while Anthropic gets the necessary compute to compete.

Amazon’s Strategy: Shovels vs. Gold Commenters observed that Amazon seems uninterested in acquiring Anthropic outright. Instead, they are playing the "shovel seller" strategy—happy to host everyone’s models (Microsoft, OpenAI, Anthropic) to drive high-margin AWS revenue rather than betting the farm on a single model. Some speculated that if Anthropic eventually goes bankrupt or fails to sustain momentum, Amazon could simply acquire the IP and talent for pennies later, similar to the outcome of other recent AI startups.

Internal Models vs. Claude The discussion touched on why Amazon heavily promotes Claude despite having its own "Nova" foundation models.

  • Users noted that Amazon’s consumer AI features (like the "Rufus" shopping assistant) appear faster and more capable when powered by Claude, suggesting Amazon's internal models (Nova 1) were uncompetitive.
  • However, some users pointed out that the newly released Nova 2 is showing promise, potentially closing the gap with models like Gemini Flash and GPT-4o Mini.

The AI Bubble Sentiment There was underlying skepticism about the "General AI" business model. Several users argued that the market for general chatbots is becoming commoditized and that the real value lies in vertical integration (e.g., Adobe integrating AI into design workflows) rather than raw model research. This reinforces the view that cloud providers (the infrastructure) are the only guaranteed winners in the current landscape.

Microsoft lowers AI software growth targets

Submission URL | 123 points | by ramoz | 91 comments

Microsoft denies cutting AI sales quotas after report; adoption friction vs spending boom

  • The Information reported some Microsoft divisions lowered growth targets for AI products after sales teams missed goals in the fiscal year ended June, citing Azure salespeople. One U.S. unit allegedly set a 50% uplift quota for Foundry spend, with fewer than 20% meeting it, then trimmed targets to ~25% growth this year.
  • Microsoft rebutted that the story conflates growth and sales quotas, saying aggregate AI sales quotas have not been lowered.
  • Market reaction: MSFT fell nearly 3% early and later pared losses to about -1.7% after the denial.
  • Reuters said it couldn’t independently verify the report. Microsoft didn’t comment on whether Carlyle cut Copilot Studio spending.
  • Adoption reality check: An MIT study found only ~5% of AI projects move beyond pilots. The Information said Carlyle struggled to get Copilot Studio to reliably pull data from other systems.
  • Spend vs. capacity: Microsoft logged a record ~$35B in capex in fiscal Q1 and expects AI capacity shortages until at least June 2026; Big Tech’s AI spend this year is pegged around $400B.
  • Results so far: Azure revenue grew 40% YoY in Jul–Sep, with guidance above estimates; Microsoft briefly topped a $4T valuation earlier this year before pulling back.

Why it matters: The tension between aggressive AI sales ambitions and slower, messier enterprise adoption is a central risk to the AI thesis. Watch future commentary for clarity on quotas vs. growth targets, real customer wins for Copilot/Foundry, and whether capacity investments translate into durable revenue momentum.

Here is a summary of the discussion:

The Economics of the "AI Bubble" A significant portion of the conversation centers on skepticism regarding current AI investment strategies. Commenters argue that the industry is prioritizing short-term stock pumps and acquisition targets (for Private Equity or IPOs) over sustainable, long-term profit margins. Several users drew comparisons to stock buyback schemes and "Gordon Gekko" economics, suggesting that while the tech is functional, the massive capital expenditure resembles a "bag-holding" game. There is also debate over whether major AI players have become "too big to fail," with some fearing that potential failures could be nationalized due to the sheer scale of infrastructure investment.

Parsing the Denial Users scrutinized Microsoft's rebuttal, noting the specific distinction between "sales quotas" and "growth targets." Commenters viewed this as PR spin, arguing that even if individual quotas remain high, lowering aggregate growth targets is an admission of weakness in the specific market segment.

Forced Adoption and Dark Patterns The discussion reveals user frustration with Microsoft’s aggressive push to integrate AI into its core products. Users reported "dark patterns" in Office subscriptions, such as being forced into expensive AI-enabled plans or finding it difficult to locate non-AI tiers. This behavior, alongside the deep integration of Copilot into Windows, has drove a subplot of the discussion toward switching to Linux, though participants debated the lingering configuration friction (WiFi, sleep modes) of leaving the Windows ecosystem.

Real Utility vs. Subscriptions In response to questions about who is actually generating revenue, coding assistants (like Cursor and Claude Code) were cited as the rare products finding product-market fit. However, technical users noted a preference for running local models (using local NPUs or older GPUs) for tasks like autocomplete to avoid high-latency, high-cost cloud subscriptions for what they view as increasingly commoditized tasks.