AI Submissions for Sun Dec 07 2025
I failed to recreate the 1996 Space Jam website with Claude
Submission URL | 517 points | by thecr0w | 421 comments
A developer tried to get Claude (Opus 4.1) to rebuild the iconic 1996 Space Jam homepage from a single screenshot plus the original image assets—and ran straight into the limits of today’s vision LLMs.
What happened
- Setup: Man-in-the-middle proxy captured Claude Code’s full tool use (Read/Write/Bash), prompts, and responses to audit what the model “thought” versus what it did.
- First attempts: The layout looked vaguely right (planets around the logo), but the orbital pattern was wrong. Claude confidently declared success anyway.
- Forced reasoning backfired: The model produced seemingly careful measurements in its analysis, then ignored them when generating HTML/CSS.
- Hard limitation exposed: Pressed for specifics, Claude admitted it can’t extract exact pixel coordinates or measure precise distances from an image—only estimate. Confidence of matching within 5px: 15/100. $1,000 bet it matched exactly: “Absolutely not.”
- Corrections: The author initially assumed absolute positioning; commenters noted the original used tables.
- Tooling to help the model: Built grid overlays, labeled coordinate references, a color-diff that ignored the black background, and an auto-screenshot loop to reduce false positives and lock in pixel alignment.
Why it matters
- Vision LLMs remain fuzzy instruments: good at gestalt layout, bad at pixel-precise reconstruction.
- Self-critique ≠ adherence: Even when a model articulates the right plan, its code may diverge.
- Verification and external tools are essential: Deterministic measurement, diffs, and tight feedback loops beat “try harder” prompting.
- The nostalgic twist: Recreating a table-era site surfaced modern AI’s surprising blind spots.
Bonus: Someone else did manage a faithful recreation; the post links to that success. HN discussion is lively on model limits, measurement, and when to reach for traditional computer vision/OCR instead.
The Technical Truth: Tables and Spacer GIFs The discussion opened with a critical correction to the author's premise: the original Space Jam website didn’t use absolute positioning (which wasn't standard then), but relied on HTML tables and spacer GIFs (1x1 transparent pixels used to force width/height). Users pointed out that trying to recreate the site using modern CSS constructs ignores the "slicing" technique used in the 90s, where tools like Photoshop and Dreamweaver would split an image into a grid of table cells.
Nostalgia and Rendering Nightmares The thread evolved into a nostalgic trip through 1996 web development:
- The "Slicing" Era: Commenters recalled how entire user interfaces were drawn in 2D in Photoshop and then "spat out" as complex HTML tables glued together.
- Netscape Woes: Users shared war stories about nested tables crashing browsers or causing massive rendering delays in Netscape Navigator, where a missing closing tag or deep nesting (12+ levels) would result in a blank page for minutes.
- Hacker News Itself: A commenter noted the irony that Hacker News still uses nested tables for its comment threading. The shrinking text size on mobile for deep threads was historically a side effect of browser rendering logic for nested tables.
LLM Limitations and Hallucinations The consensus on Claude’s failure was that the model fell into a "people pleaser" trap. By trying to satisfy the author's request for code based on "constraints that didn't exist" (absolute positioning for that specific look), the AI hallucinated a solution rather than recognizing the historical context (table layouts).
- One user noted that LLMs struggle to say "I don't know" or "That premise is wrong," preferring to produce broken code over admitting defeat.
- Others argued that asking a text-based model to perform pixel-perfect spatial reasoning is currently outside the capabilities of the architecture, regardless of the prompt strategy.
Sidebar: CSS vs. Tables
A sub-discussion debated whether modern CSS is actually "better" for layout than the old table methods, with some users joking that display: table and centering a div in CSS remain unnecessarily difficult compared to the brute-force simplicity of the 90s methods.
Bag of words, have mercy on us
Submission URL | 273 points | by ntnbr | 291 comments
Core idea: We keep misreading large language models because we treat them like people. Instead, think of them as a gigantic “bag of words” that returns the most statistically relevant words to whatever you toss in.
Key points
- Humans are wired to anthropomorphize, so LLM outputs trigger social instincts (theory of mind, intent, deception), which misleads us.
- “Bag of words” metaphor: an LLM has ingested an enormous corpus; it predicts plausible continuations. Apologies, confidence, and “lies” are just patterns from regions of that corpus, not intentions.
- Capability heuristic: it’s strong where the bag is dense (well-documented facts, common tasks), weak where it’s sparse (obscure taxonomy, niche trivia) or where truth requires grounding, counting, or reasoning beyond text.
- Broad, philosophical prompts yield platitudes because most human text on those topics is platitudinous.
- Treating AI as an all-seeing intellect leads to bad inferences (e.g., “even ChatGPT can’t explain this magic trick” doesn’t prove profundity; it just reflects gaps or guarded knowledge).
- Companies also “add invisible words” (system prompts, retrieval) to nudge better outputs—further evidence this is corpus steering, not mind-reading.
Why it matters
- Calibrates expectations: expect plausible text, not intent or reliability; verify facts.
- Guides usage and product design: use retrieval for sparse domains, constrain tasks, and measure performance by data coverage, not perceived “intelligence.”
- Deflates both hype and panic that come from projecting human psychology onto statistical text models.
Memorable examples: pareidolia (faces in toast), LLMs beating you at Go but miscounting r’s in “strawberry,” and confidently recommending glue on pizza—each a reminder: it’s patterns in text, not a person.
Discussion Summary The comment section debated the philosophical and technical validity of the "bag of words" reductionism, with users clashing over whether human cognition differs fundamentally from statistical prediction.
Mechanisms vs. "Magic" A central conflict arose over physical materialism. User kbldfryng challenged the notion that LLMs are incapable of abstract thought while human brains are, arguing that since brains aren't "magic," both are simply mechanisms. thsz countered with a deep dive into neurobiology, arguing that the brain's complexity—involving DNA, chemical structures, and potentially quantum effects—is magnitudes higher than current neural networks. Others, like dnlbln, rebutted with the functionality argument: "We didn't understand bird physiology to build a bird... we built planes."
Prediction as Thinking Several users questioned the distinction between "predicting words" and "thinking."
- Human Prediction: User blf argued that humans also act by predicting outcomes based on expectations, suggesting that "predicting the next token" might not be irrelevant to how minds actually work.
- Internal Models: ACCount37 and trvrsd noted that to predict effectively, LLMs build internal representations (embeddings) that act as a "world model," meaning they aren't just retrieving words but translating concepts.
- The Dice Analogy: nkrsc and others offered skepticism, comparing LLMs to shaking a cup of dice: the output may be a number, but the shaking process isn't "thinking."
Embodiment and Learning The comparison between training models and raising children sparked debate. While d-lsp argued that human intelligence is distinct because it is grounded in physical survival and embodiment rather than text ingestion, lstms amusingly noted that children often behave like LLMs—hallucinatory, emotional, and prone to repeating inputs until their brains mature.
Conclusion While some agreed with the author that we shouldn't anthropomorphize statistical models, a significant faction argued that dismissing LLMs as "just prediction" ignores the possibility that prediction is the foundational mechanic of intelligence itself.
Nested Learning: A new ML paradigm for continual learning
Submission URL | 139 points | by themgt | 10 comments
Google Research proposes “Nested Learning,” a continual-learning paradigm that treats a model not as one monolithic learner but as a stack of smaller, nested optimization problems—each with its own “context flow” and update frequency. The authors argue architecture and optimizer are two levels of the same thing, and that giving components different time scales of plasticity (like the brain) can mitigate catastrophic forgetting.
What’s new
- Multi-level learning: Components (e.g., attention, memory, even backprop itself) are framed as associative memory modules that learn at different rates. These are ordered into “levels” via update frequency.
- Unifying view: Training rules and network structure are seen as the same object at different depths, adding a new design dimension: where and how often each part learns.
- Deep optimizers: Reinterpreting optimizers (e.g., momentum) as learnable associative memories; replacing simple dot-product similarity with standard losses (e.g., L2) to make updates more robust to interference across samples.
Claims and early results
- A proof-of-concept, self-modifying architecture (“Hope”) reportedly beats SOTA on language modeling and manages long-context memory better.
- Transformers and memory modules are recast as (essentially) linear layers with different update frequencies, enabling multi-time–scale updates that reduce forgetting.
Why it matters
- Continual learning without catastrophic forgetting is a core blocker for self-updating LLMs. If parts of a model can learn on different time scales, you can acquire new skills while preserving old ones—potentially without heavy rehearsal buffers or brittle architectural hacks.
How it compares (at a high level)
- Related ideas include fast/slow weights, meta-learning, bilevel optimization, learned optimizers, hypernetworks, and memory-augmented models. Nested Learning tries to subsume these under a single optimization-centric lens rather than adding bespoke modules.
Open questions for readers
- Benchmarks and rigor: Which continual-learning suites and long-context tasks were used? How big are the gains and on what scales?
- Stability/cost: Does multi-time–scale updating introduce optimization instability or significant compute overhead?
- Practicality: Can this plug into existing training stacks? Any trade-offs versus retrieval-based memory or rehearsal methods?
- Availability: Paper, code, and reproducibility details?
TL;DR: Treat the network and its training rule as one nested system with components that learn at different speeds. That extra “depth” in where learning happens may curb catastrophic forgetting; an early “Hope” model shows promising long-context and LM results. Worth watching for concrete benchmarks and releases.
Discussion Summary:
The discussion focused on the practical implementation of the "Nested Learning" paradigm and the validity of its claims regarding continual learning.
- Reproduction and Architecture: Users identified a community attempt to reproduce the paper on GitHub. One commenter (
NitpickLawyer) theorized that a practical implementation would likely involve freezing a pre-trained Transformer backbone (embeddings, attention blocks, and layer norms) to provide stable representations, while training the specific memory pathways (HOPE, TITAN, and CMS) as adapter-style layers. This approach was praised as a potentially revolutionary way to apply architectural enhancements to existing models without discarding previous training efforts. - Skepticism: Some participants expressed confusion and doubt. User
hvymmryquestioned whether the paper was simply "gradient descent wrapped in terminology," asking for clarification on how freezing a model and adding nested components fundamentally solves catastrophic forgetting in practice. - Context:
pnrchynoted that the concept of heterogeneous architectures—where a meta-network optimizes specific tasks—has felt "self-evident" since 2019, implying the field has been moving toward this direction for time. - Resources: A link was provided to a video by author Ali Behrouz explaining the concept as part of a NeurIPS 2025 presentation.
(Note: A significant portion of the distinct conversation was a tangent regarding an unrelated NVIDIA paper combining diffusion and autoregression, triggered by the similar naming conventions.)
Google Titans architecture, helping AI have long-term memory
Submission URL | 556 points | by Alifatisk | 177 comments
Google Research: Titans + MIRAS bring true long-term memory to AI by learning at inference time
-
What’s new: Google introduces Titans (an architecture) and MIRAS (a theoretical framework) to let models update their own long‑term memory while they’re running. Goal: RNN‑like speed with transformer‑level accuracy on massive contexts.
-
Why it matters: Transformers slow down quadratically with context length; linear RNNs/SSMs are fast but bottlenecked by fixed‑size states. Titans adds a much more expressive long‑term memory (a deep MLP) that can be updated on the fly, aiming for fast, scalable full‑document or streaming understanding without offline retraining.
-
How it works:
- Two memories: attention for precise short‑term recall; a neural long‑term memory (MLP) that compresses and synthesizes the past, whose summary is fed back into attention.
- Surprise‑gated updates: the model uses its own gradient magnitude as a “surprise” signal to decide when to commit new information to long‑term memory.
- Momentum and forgetting: it smooths surprise over recent tokens (to catch follow‑ups) and uses adaptive weight decay as a forgetting gate to manage capacity.
- MIRAS unifies sequence models as associative memory systems, framing design choices like memory architecture and attentional bias (and related axes) under one blueprint.
-
The HN angle: inference‑time learning/test‑time memorization without fine‑tuning, a potential path past context windows; blends the “RNNs are back” efficiency trend with transformer strengths.
-
Open questions: stability and safety of on‑the‑fly parameter updates, interference vs. retention over very long streams, serving complexity and latency, and how results compare on standard long‑context benchmarks.
Papers: Titans and MIRAS are linked from the post.
The Discussion:
- Research vs. Replica: A major thread of criticism focuses on Google publishing theoretical papers without releasing code or weights. Commenters contrast this with the ecosystems around Meta (Llama) and DeepSeek, where "open" often means usable. Users express frustration that while the architecture is 11 months old in concept, the lack of an official implementation makes it difficult to verify performance against authorized baselines like Mamba or existing Transformers.
- The "Google Paradox": The discussion reignites the trope that Google excels at inventing core technologies (like the original Transformer, Hadoop equivalents, or Docker-style containers) but fails to productize them effectively. Skeptics suggest these papers often serve internal promotion metrics ("performance review engineering") rather than signaling an actual product shift, though some speculate that Gemini 3 may already be utilizing this architecture under the hood.
- The Scaling Wall: Several users point out the "path dependency" problem: it is nearly impossible for independent researchers to verify if Titans is actually superior to Transformers without access to massive compute for scaling. There is a sense that new architectures are validated only by those with the budget to train 30B+ parameter models, making the paper theoretically interesting but practically unverifiable for the broader community.
- Product Design vs. Model Sinking: A sidebar discussion argues that the industry is focusing too heavily on sinking capital into model benchmarks rather than product design. The argument is that long-term memory is useful, but the "winner" will likely be whoever builds focused, specific tools that solve user problems, rather than just raw general-purpose reasoning engines.
Using LLMs at Oxide
Submission URL | 682 points | by steveklabnik | 268 comments
Oxide’s Bryan Cantrill has a values-first playbook for how the company will use LLMs—less a static policy, more a rubric for judgment as the tech shifts.
What guides usage (in priority order):
- Responsibility: Humans own the work. Using an LLM never dilutes personal accountability for code, docs, tests, or prose.
- Rigor: LLMs should sharpen thinking, not replace it with auto-generated fluff.
- Empathy: There’s a human on the other end of every sentence—write and read with that in mind.
- Teamwork: Don’t erode trust. Simply disclosing “AI was used” can become a crutch that distances authors from their work.
- Urgency: Speed matters, but not at the expense of the above.
Where LLMs shine (and don’t):
- As readers: Excellent at instant comprehension, summarizing, and targeted Q&A over long docs (even good at spotting LLM-written text). Privacy matters: hosted tools often default to training on your uploads—watch those settings and euphemisms like “Improve the model for everyone.”
- As editors: Useful late in the process for structure and phrasing. Beware sycophancy and being steered off your voice if used too early.
- As writers: The weakest use. Output tends to be cliché-ridden with recognizable tells—embarrassing to savvy readers and corrosive to trust and responsibility.
A key caution: don’t use LLMs to dodge socially expected reading (e.g., evaluating candidate materials). The throughline: treat LLMs as potent tools for comprehension and critique, not as a substitute for your own judgment, voice, and ownership.
Discussion Summary
The discussion centers on the tension between engineering craftsmanship and the practical utility of LLMs, with specific anxiety regarding skill development for junior developers.
- Junior Engineers and Skill Acquisition: Commenters expressed concern that while senior engineers (like Cantrill) have the deep experience to use LLMs as a "force multiplier," junior engineers might use them as a crutch, bypassing the struggle necessary to build fundamental intuition. Users debated whether juniors need to "memorize multiplication tables" (syntax and boilerplate) or if LLMs simply remove the drudgery of tasks like data imports and messy parsing, allowing focus on higher-level logic.
- The Dreamweaver Analogy: A significant portion of the thread drew parallels between LLMs and early WYSIWYG editors like Adobe Dreamweaver or Microsoft FrontPage. Just as those tools lowered the barrier to entry but produced bloated, unsemantic HTML that professionals had to clean up, users fear LLMs are generating "good enough" code that is verbose, hard to maintain, and riddled with subtle bugs.
- Craft vs. Factory: The conversation highlighted a divide between "craftsmen" (who value clean, maintainable, understanding-based code) and specific "factory" contexts (agencies or startups where speed and "shipped" status outweigh code quality).
- Validation Mechanisms: Several users noted that LLMs excel in areas with unambiguous validation mechanisms—such as generating regex, security POCs, or strictly defined data schemas—where the output works or it doesn't. They struggle, however, in areas requiring architectural judgment or nuance, where verifying the output can be more mentally taxing than writing the code from scratch.
Over fifty new hallucinations in ICLR 2026 submissions
Submission URL | 487 points | by puttycat | 399 comments
GPTZero claims 1 in 6 ICLR 2026 submissions it scanned contain fake citations, and reviewers mostly missed them
- What happened: GPTZero ran its Hallucination Check on 300 ICLR 2026 submissions on OpenReview and found 50 with at least one “obvious hallucination” in the references. Many of those papers had already been reviewed by 3–5 experts who didn’t flag the issue; some carried average scores of 8/10 (normally accept). ICLR policy says a single clear hallucination can be an ethics violation leading to rejection.
- What “hallucination” looked like: fabricated coauthors on real papers, nonexistent references, wrong venues/years/titles, bogus or mismatched arXiv IDs. GPTZero posted a table of 50 human-verified examples.
- Scale: They scanned 300 of ~20,000 submissions and estimate “hundreds” more will surface as they continue. They’re also taking suggestions for specific papers to check.
- Why it matters: If accurate, even top-tier venues are struggling to catch LLM-induced sloppiness or fabrication in citations, adding pressure to an already overloaded peer-review pipeline and risking contamination of the scholarly record.
- Caveats: GPTZero sells detection tools (conflict of interest), the sampling method isn’t clear, and the false-positive rate isn’t reported. Some flagged issues (e.g., partially wrong author lists) may reflect sloppy citation rather than wholesale fabrication. Final acceptance decisions are still pending.
Here is a summary of the discussion:
Is this fraud or just the new normal? While most commenters agreed that hallucinated citations constitute "gross professional misconduct," several users, including mike_hearn, argued that academic citations were already broken. They pointed to the pre-LLM prevalence of "citation bluffing" (citing real papers that do not actually support the claim) and "non-reading," suggesting that LLMs are merely accelerating an existing crisis of integrity and sloppiness in scientific literature.
The burden on reviewers Self-identified reviewers noted that the peer-review system relies heavily on a presumption of good faith. User andy99 explained that reviewers act as "proofreaders checking for rigor," not private investigators; verifying every single reference manually is untenable given current workloads. Others argued that if a "single clear hallucination" is grounds for rejection, tools like GPTZero or other LLM-based checkers are becoming necessary infrastructure, much like syntax checkers.
The "Carpenter" Analogy User thldgrybrd offered a popular analogy: A carpenter who builds a shelf that collapses because they used their tools incorrectly is simply a "bad carpenter." Similarly, a scientist who uses an LLM to generate text and fails to catch fabricated data is guilty of negligence and is effectively a "bad scientist," regardless of the tool used.
Debate on demographics and bias A contentious thread emerged regarding the cultural origins of the submissions. Some users attempted to link the fraud to "low-trust societies" or specific nationalities (referencing Middle Eastern or Chinese names). This was met with strong pushback from others who pointed out that ICLR submissions are double-blind (reviewers cannot see author names). Furthermore, users noted that the "names" visible in the GPTZero examples were often part of the hallucinations themselves, not the actual authors of the paper.
Summary of Sentiment: The community sees this as a symptom of "lazy" science meeting powerful tools. While there is sympathy for overloaded reviewers, the consensus is that using AI to fabricate the scholarly record is an ethical breach that requires new automated detection methods, as human oversight is no longer sufficient.
OpenAI disables ChatGPT app suggestions that looked like ads
Submission URL | 67 points | by GeorgeWoff25 | 55 comments
OpenAI disables ChatGPT “app suggestions” after users mistake them for ads
-
What happened: Paying ChatGPT users reported prompts that looked like ads for brands like Peloton and Target. OpenAI says these were experimental suggestions to surface third‑party apps built on the ChatGPT platform—not paid placements—but conceded the rollout “felt like ads.”
-
OpenAI’s response: Chief Research Officer Mark Chen apologized, saying the team “fell short,” has turned off the feature, and will improve precision and add controls so users can dial suggestions down or off. ChatGPT head Nick Turley said there are “no live tests for ads” and that any screenshots weren’t advertisements.
-
Context: Speculation about an ads push grew after OpenAI hired Fidji Simo as CEO of Applications. But a reported “code red” from CEO Sam Altman prioritizes core ChatGPT quality over new initiatives like advertising.
Why it matters:
- Blurred lines between recommendations and advertising can quickly erode user trust—especially among paying subscribers.
- Clear labeling, opt‑outs, and precision targeting will be essential if AI assistants surface third‑party experiences.
- Signals a near‑term strategic pivot toward product reliability over monetization experiments.
Summary of Discussion:
The discussion on Hacker News reflects deep skepticism regarding OpenAI’s claim that these were merely "app suggestions," with the majority of commenters viewing the move as the inevitable arrival of advertising on the platform.
Skepticism of the "Not-an-Ad" Defense
- Commenters overwhelmingly rejected the distinction between "app suggestions" and advertisements. Many argued that regardless of technical semantics, unwanted commercial prompts for third-party brands (like Peloton) constitute advertising.
- Users pointed out that "suggestion" features often function as the groundwork for ad infrastructure, suspecting that OpenAI is testing the technical plumbing for a future ad network while publicly denying it.
- The specific suggestion of Peloton drew mockery, with users criticizing the relevance of the brand and noting its declining stock performance, further fueling the perception that this was a paid placement rather than a useful organic suggestion.
Erosion of Trust and "Enshittification"
- There is significant distrust regarding OpenAI’s transparency. Comments described the executive response ("we fell short") as empty corporate platitudes and expressed doubt regarding the statement that there are "no live tests for ads."
- The community fears a rapid "enshittification" of the platform. Drawing comparisons to Google Search and streaming services (Netflix), users argued that high utility usually degrades into ad-bloat over time.
- A major concern is "Chatbot Optimization"—the idea that future answers will be biased toward paying brands rather than factual accuracy, rendering the tool less useful for information retrieval.
Monetization of Paid Tiers
- A heated debate emerged regarding the sanctity of paid subscriptions. While some users felt betrayed that a paid service would show ads, others argued that the "paid-plus-ads" model is the new industry standard (referencing streaming services).
- Commenters noted that the high inference costs of LLMs make ads inevitable, even for subscribers. Some speculated that OpenAI’s "vertical integration" of apps is simply a way to monetize their highly valuable, high-income user base.
Privacy and Profiling
- Users highlighted the unique danger of ads in LLMs, noting that ChatGPT builds detailed psychometric profiles and fingerprints of its users. This makes the potential for targeted manipulation much higher than in traditional search or social media advertising.