AI Submissions for Fri Dec 05 2025
Gemini 3 Pro: the frontier of vision AI
Submission URL | 506 points | by xnx | 265 comments
Gemini 3 Pro: Google’s new multimodal model pushes hard on visual and spatial reasoning
What’s new
- Google DeepMind claims state-of-the-art results across vision-heavy tasks: document, spatial, screen, and video understanding, topping benchmarks like MMMU Pro and Video MMMU.
- Big focus on “derendering”: turning images of messy, real-world documents into structured code (HTML/LaTeX/Markdown). Demos include reconstructing 18th‑century handwritten tables, equations from photos, and Florence Nightingale’s polar chart into an interactive graphic.
- Document reasoning: The model navigates long reports, cross-references figures/tables, and ties numbers to causal text. It reportedly beats the human baseline on the CharXiv Reasoning benchmark (80.5%), with an example analyzing Gini index changes and policy impacts in a 62-page Census report.
- Spatial understanding: Outputs pixel-precise coordinates to “point” in images; supports open‑vocabulary references (e.g., “point to the screw”) for robotics/AR planning and manipulation.
- Screen understanding: Parses desktop/mobile UIs with high-precision clicking—pitched for reliable “computer use” agents, QA, onboarding, and UX analytics.
- Video: Higher frame-rate comprehension (e.g., 10 FPS) to catch fast actions like golf swings and weight shifts.
Why it matters
- If the claims hold, this closes gaps between perception and reasoning across messy real-world inputs—key for automation in back-office document workflows, UI agents, robotics, and sports/industry video analysis.
Caveats
- These are vendor-reported benchmarks and demos; independent evaluations and real-world reliability (latency, cost, privacy) will be crucial.
- Developers can try it via Google AI Studio and docs, but details on pricing, rate limits, and on-device/enterprise deployment weren’t included here.
Here is a summary of the discussion:
The "Five-Legged Dog" Stress Test The majority of the discussion focuses on a specific stress test: showing the model a picture of a dog with five legs. Users report that despite the model’s claimed visual precision, it struggles to override its training priors (that dogs have four legs).
- Cognitive Dissonance: When asked to count legs, Gemini and other models often hallucinate explanations for the fifth limb (e.g., calling it a tail, an optical illusion, or claiming the dog is an amputee) to fit the "4-leg" model.
- Implicit vs. Explicit: Use vndrb noted that while the model fails at counting the legs, it succeeds at editing tasks. When asked to "place sneakers on the legs," the model correctly placed five sneakers, suggesting the visual encoder sees the data, but the reasoning layer suppresses it.
- Generative Struggles: Users noted similar failures when asking models to generate out-of-distribution concepts, such as a "13-hour clock." The models consistently revert to standard 12-hour faces or hallucinate workarounds (like adding a plaque that says "13") rather than altering the fundamental structure.
The Role of RLHF Commenters speculate that Reinforcement Learning from Human Feedback (RLHF) is the culprit. The consensus is that models are heavily penalized during training for deviating from "normal" reality. Consequently, the models prioritize statistical probability (dogs usually have four legs) over the immediate visual evidence, leading to "stubborn" behavior where the model refuses to acknowledge anomalies.
NeurIPS 2025 Best Paper Awards
Submission URL | 170 points | by ivansavz | 28 comments
NeurIPS 2025 named seven Best Paper Award winners (four Best Papers, including one from the Datasets & Benchmarks track, and three runner-ups), spanning diffusion theory, self-supervised RL, LLM attention, reasoning, online learning, neural scaling laws, and benchmarking for model diversity. Committees were drawn from across the program, dataset/benchmark tracks, and approved by general and accessibility chairs.
Two standouts highlighted in the announcement:
-
Artificial Hivemind: The Open-Ended Homogeneity of Language Models (and Beyond)
- Releases Infinity-Chat, a large open-ended benchmark of 26K real-world prompts plus 31,250 human annotations (25 raters per example) and a first comprehensive taxonomy of open-ended LM tasks (6 categories, 17 subcategories).
- Empirically shows an “Artificial Hivemind” effect: strong intra-model repetition and inter-model homogeneity on open-ended generation.
- Finds miscalibration between reward models/LM judges and diverse human preferences, underscoring the tension between alignment and pluralism and the long-term risk of creativity homogenization.
-
Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free
- Systematically studies gating in softmax attention across 30 model variants, including 15B MoE and 1.7B dense models trained on 3.5T tokens.
- A simple tweak—adding a head-specific sigmoid gate after scaled dot-product attention—consistently boosts performance, improves training stability, tolerates larger learning rates, and scales better.
- Points to benefits from injecting non-linearity in the attention path (and addresses attention sink issues).
Why it matters
- Evaluation is moving beyond narrow benchmarks to plural, open-ended human preferences—raising flags about model homogenization and the cost of over-alignment.
- Small architectural changes can still unlock meaningful gains at trillion-token scale.
- The award slate balances societal impact with core theory and systems advances—signaling where ML research energy is heading.
Here is a summary of the top stories and the accompanying discussion:
NeurIPS 2025 Announces Best Paper Awards The NeurIPS 2025 committee has selected seven winners for the Best Paper Awards, highlighting a shift in machine learning research toward analyzing model homogeneity and refining architectural fundamentals. Two papers were specifically highlighted in the announcement:
- "Artificial Hivemind" releases Infinity-Chat, a benchmark of 26k prompts, revealing that LLMs exhibit strong "hivemind" behavior—repetitive internal outputs and high homogeneity across different models—suggesting a long-term risk of creativity loss due to misalignment between reward models and diverse human preferences.
- "Gated Attention for Large Language Models" introduces a simple architectural tweak—adding a sigmoid gate to softmax attention—which improves training stability and performance at the trillion-token scale. Overall, the awards signal a move beyond narrow benchmarks toward open-ended evaluation and demonstrate that small structural changes can still yield significant gains.
Summary of Hacker News Discussion The discussion thread focuses on the validity of benchmarking metrics, the theoretical underpinnings of reasoning, and the changing demographics of ML researchers:
- RL and Reasoning Capacity: A significant debate centers on whether Reinforcement Learning (RL) truly improves a model's reasoning capabilities or merely limits its creativity. Users discuss the "Does Reinforcement Learning Incentivize Reasoning...?" paper, arguing over the validity of "pass@k" metrics. Skeptics argue that RL simply "sharpens" the probability distribution toward answers already present in the base model (which acts as a broader, more creative generator), while proponents argue that pass@k is a valid proxy for skill, distinguishing actual correctness from the theoretical possibilities of a "random number generator."
- The "Hivemind" Effect: Users experimented with the "Artificial Hivemind" paper's findings by prompting models (like Gemini) to write metaphors about time. Commenters noted that while models produced varying imagery (cliffs, hammers), the underlying semantic themes usually reverted to the dominant "river/flow" cluster, validating the paper's claims about model homogeneity.
- Physicists in ML: Commenters noticed several physicists among the award winners. This sparked a conversation about the high transferability of physics skills (linear algebra, eigenvectors, SVD) to Machine Learning, with some suggesting physicists are better equipped for the math-heavy interaction of ML than standard software engineers.
- Consumption and "Slop": In a discussion about how best to digest these papers (reading vs. video), the tool NotebookLM was mentioned. Opinions were split: some view AI-generated audio summaries as "environmental pollution" cluttering search results, while others argued they are actually an improvement over the low-quality "slop" videos produced by humans.
- Architecture & Superposition: There is speculation regarding "superposition" in neural networks—specifically how differentiable networks struggle to "commit" to a single concept (e.g., green vs. purple) without the "forcing function" of discretizing tokens. Other architectural papers, such as TITANS and work by SakanaAI, were recommended as complementary reading.
Jony Ive's OpenAI Device Barred From Using 'io' Name
Submission URL | 83 points | by thm | 59 comments
Jony Ive/OpenAI barred from using “io” hardware brand after Ninth Circuit upholds TRO
- A U.S. appeals court affirmed a temporary restraining order blocking OpenAI, Jony Ive, Sam Altman, and IO Products, Inc. from using “io” to market hardware deemed similar to AI-audio startup iyO’s planned device (source: Bloomberg Law via MacRumors).
- The court found a likelihood of confusion between “IO” and “iyO” and flagged “reverse confusion” risk given OpenAI’s scale, citing potential irreparable harm to iyO’s brand and fundraising.
- Backstory: Ive and Altman picked “io” in mid‑2023. In early 2025, iyO CEO Jason Rugolo sought funding from Altman for a human‑computer interface project; Altman declined, saying he was already working on something competitive. OpenAI argued its first device wouldn’t be a wearable and that Rugolo voluntarily shared details while suggesting a $200M acquisition.
- Scope: The order doesn’t ban all “io” uses—only for hardware similar to iyO’s planned AI-audio computer. OpenAI removed “io” branding shortly after the TRO.
- What’s next: The case returns to district court for a preliminary injunction hearing in April 2026; broader litigation could run into 2027–2028. OpenAI’s first hardware device is still expected next year, likely under a different name.
Why it matters for HN:
- Highlights the “reverse confusion” doctrine—when a big brand risks swamping a smaller mark.
- Naming due diligence for hardware/AI products just got a high-profile cautionary tale.
- Signals branding and launch risks for OpenAI’s Ive-designed device even as the product timeline advances.
Based on the discussion, Hacker News users reacted with a mix of branding critique, mockery of the founders' public image, and speculation regarding the utility of the hardware itself.
Branding and Alternatives The court order barring "io" sparked immediate humor and alternative suggestions. Several users jokingly proposed "Oi" (referencing British slang and Jason Statham movies), though others noted "Oi" is already a major telecom brand in Brazil. Others referenced "JOI" (from Blade Runner) or the bygone "Yo" app. On a serious note, commenters questioned the strategy behind the original name, arguing that "io" is uncreative, difficult to search for in hardware repositories, and squanders the immense brand equity of "ChatGPT," which one user felt should have been the leading name for the device.
Critique of the Ive/Altman "Vibe" A thread developed specifically roasting the press photo of Sam Altman and Jony Ive. Users described the aesthetic as "creepy," comparing it variously to a "bad early 90s TV movie," a "cropped Giorgio Armani perfume ad," or a "pregnancy announcement," with some viewing the project as a "narcissistic dance-off."
Speculation on the Hardware Discussion shifted to what the device actually does, with significant skepticism:
- Form Factor: Guesses ranged from a "Humane Pin v2" to smart glasses, a set-top TV box, or a dedicated smart speaker.
- Utility: Some users expressed desire for a dedicated "ChatGPT box" to replace existing smart speakers (Alexa/Google Home), which many felt have become "detuned" or increasingly useless.
- Necessity: Users theorized that OpenAI is forced to build hardware because Apple will never grant a third-party app the "always-on," deep-system access required for a true AI assistant on the iPhone.
- Viability: Cynicism remained high, with comparisons to other recent AI hardware flops like the Rabbit R1 or Humane Pin, with one user calling it likely just a "chatbot box."
The Plaintiff (iyO) A few users investigated the plaintiff, iyO, noting that their planned products resemble "audio computing" headphones or cameras, though one user complained that the startup's website was incredibly slow to load.
Wall Street races to protect itself from AI bubble
Submission URL | 70 points | by zerosizedweasle | 83 comments
Wall Street races to protect itself from the AI bubble it’s funding
- Banks are underwriting record borrowing to build AI infrastructure while simultaneously hedging against a potential bust. Global bond issuance has topped $6.46T in 2025 as hyperscalers and utilities gear up to spend at least $5T on data centers, per JPMorgan.
- Anxiety is visible in credit markets: the cost to insure Oracle’s debt has climbed to highs not seen since the Global Financial Crisis, and hedging activity has exploded. Oracle CDS trading hit about $8B over nine weeks through Nov 28 vs ~$350M a year earlier.
- Lenders are heavily exposed via massive construction loans (e.g., $38B and $18B packages tied to new data centers in Texas, Wisconsin, and New Mexico) and are offloading risk with credit derivatives and portfolio deals.
- CDS spreads have jumped across big tech. Five-year protection on $10M of Microsoft debt runs
34 bps ($34k/yr) vs ~20 bps in mid-October; Johnson & Johnson, the only other AAA in the U.S., is ~19 bps. Saba Capital says MSFT protection looks rich and is selling it; they see similar dislocations in Oracle, Meta, and Alphabet. - Operational risk is in the mix: a major outage that halted CME Group trading prompted Goldman Sachs to pause a $1.3B mortgage bond sale for data center operator CyrusOne—highlighting how repeated breakdowns can drive customer churn.
- Morgan Stanley has explored “significant risk transfer” deals—using credit-linked notes and similar structures to insure 5–15% of designated loan portfolios—and private credit firms like Ares are positioning to absorb that risk.
- Why it matters: The AI buildout may be the largest tech borrowing spree ever, but banks are laying off downside to derivatives buyers and private credit. If returns lag or outages mount, losses won’t stay on bank balance sheets; if not, hedgers and protection sellers could win. As Steven Grey cautions, great tech doesn’t automatically equal profits.
Based on the discussion provided, here is a summary of the comments:
Fears of a Bailout and "Privatized Gains, Socialized Losses" The most prominent reaction to the article is cynicism regarding who will ultimately pay for a potential AI bust.
- Users suggest that while banks are currently hedging, the US government (and by extension, the taxpayer) will "step in" to bail out AI corporations and Wall Street if the bubble bursts.
- One commenter satiricaly proposes a government plan involving borrowing trillions to give children equity quotas, highlighting the absurdity of current national debt levels and the feeling that the financial system is playing "God" with economics.
- One brief comment summed up the sentiment with the phrase "Make America Bankrupt."
The "AI Arms Race" Justification A counter-argument emerged claiming that the massive spending and borrowing are necessary matters of national defense.
- Several users argue the US cannot afford to "sleep" while China advances. The consensus among this group is that the AI buildout is a geopolitical necessity to prevent China from becoming the sole dominant power.
- Parallels were drawn to Cold War logic ("Mr. President, we cannot allow a mineshaft gap"), suggesting that even if the economics are a bubble, the strategic imperative overrides financial caution.
Debate on China’s Stability and Data The mention of China sparked a sub-thread about the reliability of Chinese economic data and their motivations for pursuing AI.
- One user argued that China is betting on AI and robotics to solve its looming demographic collapse and leverage its future despite a shrinking workforce.
- Others disputed the reliability of information regarding China, with some asking for a "single source of truth." There was a debate over whether Chinese official statistics (Five Year Plans, National Bureau of Statistics) are reliable or comparable to manipulated Soviet-era propaganda.
Macroeconomic Theory and Money Printing A significant portion of the discussion devolved into a technical debate about the nature of money and debt.
- Users argued over the definition of "printing money" versus "issuing debt."
- Some contended that debt functions as savings for others (e.g., China buying US Treasuries) and is distinct from printing money, while others argued that fractional reserve banking essentially allows banks to create money out of thin air, expanding the money supply and fueling inflation.
- This thread reflected broader anxiety about the long-term sustainability of US fiscal policy, referencing recent increases in credit default swaps and huge deficit spending.