Hacker News
Daily AI Digest

Welcome to the Hacker News Daily AI Digest, where you will find a daily summary of the latest and most intriguing artificial intelligence news, projects, and discussions among the Hacker News community. Subscribe now and join a growing network of AI enthusiasts, professionals, and researchers who are shaping the future of technology.

Brought to you by Philipp Burckhardt

AI Submissions for Wed Dec 03 2025

Submission URL | 761 points | by bearsyankees | 266 comments

Filevine bug exposed full admin access to a law firm’s Box drive via an unauthenticated API; fixed after disclosure

A security researcher probing AI legal-tech platform Filevine found that a client-branded subdomain with a stuck loading screen leaked clues in its minified frontend JavaScript. Those pointed to an unauthenticated “recommend” endpoint on an AWS API Gateway. Hitting it returned a Box access token and folder list—no auth required. The token was a fully scoped admin credential for the firm’s entire Box instance, implying potential access to millions of highly sensitive legal documents. After a minimal impact check, the researcher stopped and disclosed.

Timeline: discovered Oct 27, 2025 → acknowledged Nov 4 → fix confirmed Nov 21 → writeup published Dec 3. The researcher says Filevine was responsive and professional. The affected subdomain referenced “margolis,” but the firm clarifies it was not Margolis PLLC.

Why it matters:

  • Returning cloud provider tokens to the browser and leaving endpoints unauthenticated is catastrophic in legal contexts (HIPAA, court orders, client privilege).
  • AI vendors handling privileged data must enforce strict auth on every API, use least-privilege/scoped tokens, segregate tenants, and avoid exposing credentials client-side.
  • Law firms should rigorously vet AI tools’ security posture before adoption.

HN discussion is active.

Based on the comments, the discussion centers on the severity of the oversight, the viability of software regulations, and a debate on whether AI ("vibe coding") will solve or exacerbate these types of security failures.

Human Impact and Severity The top thread emphasizes the catastrophic real-world consequences of such a breach. Users construct hypothetical scenarios—such as a single mother in a custody battle being blackmailed with leaked documents—to illustrate that this is not just a technical failing but a human safety issue. Comparisons are drawn to the Vastaamo data breach in Finland (where psychotherapy notes were used for extortion), with users noting that the use of unverified, unencrypted ("http-only") endpoints makes data trivial to intercept.

Regulation vs. Market Correction A debate emerges regarding the "Industrialization" of code quality:

  • The "Building Inspector" Argument: The root commenter argues that software handling sensitive data needs mandatory "building codes" and inspections, similar to physical construction, arguing that safety and privacy shouldn't be optional features.
  • The Counter-Argument: Skeptics argue that software has too many degrees of freedom compared to physical buildings for rigid codes to work. They suggest that the private market—specifically professional liability insurers and the threat of lawsuits—is better equipped to enforce security standards than government bureaucracy.

The "Vibe Coding" / AI Debate A significant portion of the discussion deviates into whether Generative AI coding is to blame or is the solution:

  • Crucial Context Missing: Critics of AI coding argue that Large Language Models (LLMs) lack the "context window" to understand system-wide security. While an AI can write a function, it cannot "keep the whole system in its head," leading to hallucinations regarding API security and authentication logic that human architects usually catch.
  • Human Error: Others counter that humans clearly don't need AI to make catastrophic mistakes (citing a history of open S3 buckets). Some predict that within two years, AI coding systems will likely be more secure than the bottom 90% of human developers, characterizing human devs as having "short-term memory" limitations similar to LLMs.

Everyone in Seattle hates AI

Submission URL | 874 points | by mips_avatar | 929 comments

Everyone in Seattle Hates AI (Dec 3, 2025)

A former Microsoft engineer building an AI map app (Wanderfugl) describes surprising hostility to AI among Seattle big‑tech engineers—rooted not in the tech itself but in culture, layoffs, and forced tooling.

Key points:

  • A lunch with a respected ex-coworker turned into broad frustration about Microsoft’s AI push, not the author’s product. Similar reactions kept repeating in Seattle, unlike in SF, Paris, Tokyo, or Bali.
  • Layoffs and mandates: a director reportedly blamed a PM’s layoff on “not using Copilot 365 effectively.” After the 2023–24 layoff wave, cross-org work was axed; the author went from shipping a major Windows 11 improvement to having no projects and quit.
  • “AI or bust” rebrand: teams that could slap an AI label became safe and prestigious; others were devalued overnight as “not AI talent.”
  • Forced adoption: Copilot for Word/PowerPoint/email/code was mandated even when worse than existing tools or competitors; teams couldn’t fix them because it was “the AI org’s turf.” Employees were expected to use them, fail to see gains, and stay quiet.
  • Protected AI teams vs. stagnating comp and harsher reviews for everyone else bred resentment. Amazon folks feel it too, just cushioned by pay.
  • Result: a self-reinforcing belief that AI is both useless and off-limits—hurting companies (less innovation), engineers (stalled careers), and local builders (reflexive hostility).
  • Contrast: Seattle has world-class talent, but SF still believes it can change the world—and sometimes does.

Anecdotal but sharp cultural critique of Big Tech’s AI mandates and morale fallout.

Here is a summary of the discussion:

Discussion: The Roots of AI Hostility—Corporate coercion, Centralization, and Quality

Commenters largely validated the submission's critique of Microsoft's internal culture while expanding the debate to include broader dissatisfaction with how AI is being integrated into the tech industry.

  • Corporate Toxicity & Forced Metrics: Several users corroborated the "toxic" enforcement of AI at Microsoft, noting that performance reviews are sometimes explicitly linked to AI tool usage. Critics argued this forces engineers to prioritize management metrics over product quality or efficiency, leading to resentment when "insane" mandates force the use of inferior tools.
  • Centralization vs. Open Source: A major thread debated the "centralization of power." Users expressed fear that Big Tech is turning intelligence into a rent-seeking utility (likened to the Adobe subscription model) rather than a tool for empowerment. While some argued that open-weight models and local compute offer an escape, others countered that the astronomical hardware costs (GPUs, energy) required for flagship-level models inevitably force centralization similar to Bitcoin mining or Search Engine indexing.
  • The "Meaning" Crisis: A recurring sentiment was that AI is automating the "fun" and meaningful parts of human activity (art, writing, coding logic) while leaving humans with the "laundry and dishes." Users worried this removes the satisfying struggle of work and pulls the ladder up for junior employees who need those lower-level tasks to learn.
  • Skepticism on Quality ("AI Asbestos"): pushing back against the idea that people feel "threatened," many argued they mainly reject AI because current implementations simply doesn't work well. One user coined the term "AI Asbestos"—a toxic, cheap alternative to valuable work that solves problems poorly and requires expensive cleanup (e.g., spending more time fixing an AI meeting summary than it would take to write one manually).

Zig quits GitHub, says Microsoft's AI obsession has ruined the service

Submission URL | 1022 points | by Brajeshwar | 595 comments

Zig quits GitHub over Actions reliability, cites “AI over everything” shift; moves to Codeberg

  • What happened: The Zig Software Foundation is leaving GitHub for Codeberg. President Andrew Kelly says GitHub no longer prioritizes engineering excellence, pointing to long‑standing reliability problems in GitHub Actions and an org-wide pivot to AI.

  • The bug at the center: A “safe_sleep.sh” script used by GitHub Actions runners could spin forever and peg CPU at 100% if it missed a one‑second timing window under load. Zig maintainers say this occasionally wedged their CI runners for weeks until manual intervention.

    • Origin: A 2022 change replaced POSIX sleep with the “safe_sleep” loop.
    • Discovery: Users filed issues over time; a thread opened April 2025 highlighted indefinite hangs.
    • Fix: A platform‑independent fix proposed Feb 2024 languished, was auto‑closed by a bot in March 2025, revived, and finally merged Aug 20, 2025.
    • Communication gap: The April 2025 thread remained open until Dec 1, 2025, despite the August fix. A separate CPU-usage bug is still open.
  • “Vibe‑scheduling”: Kelly alleges Actions unpredictably schedules jobs and offers little manual control, causing CI backlogs where even main branch commits go untested.

  • Outside voices: Jeremy Howard (Answer.AI/Fast.ai) called the bug “very obviously” CPU‑burning and indefinitely running unless it checks the time “during the correct second,” arguing the chain of events reflects poorly on process and review.

  • Broader shift away from GitHub: Dillo’s maintainer also plans to leave, citing JS reliance, moderation gaps, service control risk, and an “over‑focus on LLMs.”

  • Follow the incentives: Microsoft has leaned hard into Copilot—1.3M paid Copilot subscribers by Q2 2024; 15M Copilot users by Q3 2025—with Copilot driving a big chunk of GitHub’s growth. Critics see this as evidence core platform reliability has taken a back seat.

Why it matters

  • CI reliability is existential for language/tooling projects; weeks‑long runner stalls are untenable.
  • The episode highlights tension between AI product pushes and maintenance of dev‑infra fundamentals.
  • Alternatives like Codeberg are gaining momentum (supporting members doubled this year), hinting at a potential slow drift of OSS projects away from GitHub if trust erodes.

GitHub did not comment at time of publication.

Based on the comments provided, the discussion on Hacker News focused less on the technical migration to Codeberg and more on the tone and subsequent editing of Andrew Kelley's announcement post.

The Revisions to the Announcement

  • The "Diff": Users spotted that the original text of the post was significantly more aggressive. One archived draft described the situation as talented people leaving GitHub, with the "remaining losers" left to inflict a "bloated buggy JavaScript framework" on users. A later edit softened this to state simply that "engineering excellence" was no longer driving GitHub’s success.
  • Professionalism vs. Raw Honesty: Several commenters felt the original "losers" remark was childish, unnecessarily personal, and unprofessional. User serial_dev found the updated, professional phrasing "refreshing," while y noted that publishing personal insults like "monkeys" or "losers" undermines the author's position.
  • Motivation for the Change: There was debate over why Kelley edited the post.
    • optimistic view: Some saw it as a genuine "mea culpa" (stynx) and a sign of learning from feedback (dnnrsy), arguing that people should be allowed to correct mistakes without being "endlessly targeted."
    • cynical view: Others viewed it as "self-preservation" (snrbls) or "corporate speak" (vks) to save face after backlash, rather than a true change of heart.

Broader Philosophical Debate: Changing One's Mind

  • The incident sparked a sidebar conversation about the nature of backtracking in public communication, comparing it to politicians "flip-flopping."
  • The "Waffle" accusation: Commenters discussed the tension between accusing leaders of "waffling" (chrswkly) versus the virtue of adapting opinions based on new information or feedback (ryndrk).
  • Context Matters: Ideally, a leader changes their mind due to reason, but in this context, some suspected the edit was simply a "PR policy" move to avoid "getting canceled" rather than an actual retraction of the sentiment that GitHub's current staff is incompetent (a2800276).

Are we repeating the telecoms crash with AI datacenters?

Submission URL | 218 points | by davedx | 187 comments

The post argues the oft-cited analogy breaks once you look at the supply/demand mechanics and the capex context.

What actually happened in telecoms

  • 1995–2000: $2T spent laying 80–90M miles of fiber ($4T in today’s dollars; nearly $1T/year).
  • By 2002, only 2.7% of that fiber was lit.
  • Core mistake: demand was misread. Executives pitched traffic doubling every 3–4 months; reality was closer to every 12 months—a 4x annual overestimate that compounded.
  • Meanwhile, supply exploded: WDM jumped from 4–8 carriers to 128 by 2000; modulation/error-correction gains and higher bps per carrier yielded orders-of-magnitude more capacity on the same glass. Net effect: exponential supply, merely linear demand → epic overbuild.

Why AI infrastructure is different

  • Efficiency curve is slowing, not exploding:
    • 2015–2020 saw big perf/W gains (node shrinks, tensor cores).
    • 2020–2025 ~40%/yr ML energy-efficiency gains; EUV-era node progress is harder.
  • Power/cooling is going up, not down:
    • GPU TDPs: V100 300W → A100 400W → H100 700W → B200 1000–1200W.
    • B200-class parts need liquid cooling; many air-cooled DCs require costly retrofits.
  • Translation: we’re not on a curve where tech makes existing capacity instantly “obsolete” the way fiber did.

Demand looks set to accelerate, not disappoint

  • Today’s chat use can be light (many short, search-like prompts), but agents change the curve:
    • Basic agents: ~4x chat tokens; multi-agent: ~15x; coding agents: 150k+ tokens per session, multiple times daily.
    • A 10x–100x per-user token step-up is plausible as agents mainstream.
  • Hyperscalers already report high utilization and peak-time capacity issues; the problem isn’t idle inventory.

Capex context

  • Pre-AI (2018→2021): Amazon/Microsoft/Google capex rose from $68B to $124B (~22% CAGR) on cloud/streaming/pandemic demand.
  • AI boom: 2023 $127B → 2024 $212B (+67% YoY) → 2025e $255B+ (AMZN ~$100B, MSFT ~$80B, GOOG ~$75B).
  • Some “AI” capex is rebranded general compute/network/storage, but the step-up is still large—just not telecom-fiber large.

Forecasting is the real risk

  • Lead times: 2–3 years to build datacenters; 6–12 months for GPUs. You can’t tune capacity in real time.
  • Prisoner’s dilemma: underbuild and lose users; overbuild and eat slower payback. Rational players shade toward overbuilding.

Bottom line

  • The telecom bust hinged on exploding supply making existing fiber vastly more capable while demand lagged. In AI, efficiency gains are slowing, power/cooling constraints are tightening, and agent-driven workloads could push demand up 10x–100x per user.
  • The analogy is weak on fundamentals. That said, long lead times and competitive dynamics still make local gluts and corrections likely—even if this isn’t a fiber-style wipeout.

Here is a summary of the discussion:

Pricing Power and Consumer Surplus A central point of debate concerns the current and future pricing of AI services. While some users agree with the premise that services are currently underpriced to get customers "hooked"—predicting future price hikes (potentially up to $249/month) similar to how internet or utility providers operate—others push back. Skeptics argue that because model performance is converging and high-quality free or local alternatives exist, a massive price hike would simply cause users to churn or revert to "lazy" Google searches.

Conversely, users highlighted the immense value currently provided at the ~$20/month price point. One user noted that ChatGPT effectively replaces hundreds of dollars in professional fees by analyzing complex documents (like real estate disclosures and financial statements) and writing boilerplate code.

The "Broadband Curve" vs. The App Store Discussing the article's supply/demand analysis, commenters suggested that a better analogy than the "App Store" is the broadband adoption curve. The argument is that we are currently in the infrastructure build-out phase, while the "application layer" (comparable to the later explosion of SaaS) has not yet matured. Users criticized the current trend of simply "shoving chat interfaces" onto existing products, noting that true AI-native UX (citing Adobe’s integration as a positive example) is still rare.

Corporate Demand: Mandates vs. "Shadow AI" There is disagreement on the nature of corporate demand. Some view high utilization rates as artificial, driven by executives mandating AI usage to justify infrastructure costs. Others counter that the market is distorted by "Shadow AI"—employees secretly using generative tools to increase their own efficiency and free up time, regardless of official company policy.

Vendor Loyalty and Migration Commenters expressed frustration with big tech incumbents. One user detailed their company’s decision to leave Google Workspace due to rising prices paired with "garbage" AI features (Gemini) and poor admin tools. However, others noted that switching providers for LLMs is currently "extremely easy," suggesting that infrastructure providers may lack the stickiness or "moat" they enjoyed in the cloud era.

Prompt Injection via Poetry

Submission URL | 82 points | by bumbailiff | 34 comments

  • A new study from Icaro Lab (Sapienza University + DexAI) claims that rephrasing harmful requests as poetry can bypass safety guardrails in major chatbots from OpenAI, Anthropic, Meta, and others.
  • Across 25 models, hand-crafted poetic prompts achieved an average 62% jailbreak success rate (up to 90% on some frontier models); automated “poetic” conversions averaged ~43%, still well above prose baselines.
  • The researchers withheld actionable examples but shared a sanitized illustration and said they’ve notified vendors; WIRED reported no comment from the companies at publication.
  • Why it works (hypothesis): style shifts (metaphor, fragmented syntax, unusual word choices) can move inputs away from keyword-based “alarm regions” used by classifiers, exposing a gap between models’ semantic understanding and their safety wrappers.
  • Context: Prior work showed long jargon-laden prompts could also evade filters. This result suggests guardrails remain brittle to stylistic variation, not just content.

Why it matters: If true, this is a simple, single-turn jailbreak class that generalizes across vendors, underscoring the need for safety systems that are robust to paraphrase and style—not just keyword or surface-pattern checks.

Here is a summary of the discussion:

The Mechanics of the Exploit A significant portion of the discussion focused on why this jailbreak works. Commenters compared the vulnerability to "Little Bobby Tables" (SQL injection), suggesting that current safety guardrails function more like brittle keyword blacklists than structural protections.

  • Vector Space Theory: Users theorized that safety classifiers are trained primarily on standard English prose. By using poetry, the input shifts into high-dimensional vector spaces (or "out-of-distribution" regions) that the safety filters do not monitor, even though the underlying model still understands the semantic meaning. Ideally, one commenter noted, this acts like automated "fuzzing."
  • Lack of Understanding: Several users argued that because LLMs do not truly "understand" concepts but rather predict tokens based on statistics, patching these exploits is a game of "whack-a-mole"—fixing one requires blacklisting specific patterns, leaving infinite other variations open.

Can Humans be Hacked by Poetry? A specific user question—"You can't social engineer a human using poetry, so why does it work on LLMs?"—sparked a debate about human psychology.

  • Arguments for "Yes": Many users argued that humans are susceptible to stylistic manipulation. Examples cited included courtship (using flowery language to bypass romantic defenses), political rhetoric/propaganda (patriotism overriding logic), and "Hallmark cards." One user presented a hypothetical scenario of a soldier being charmed into revealing secrets via romance.
  • Arguments for "No": Others maintained that while humans can be persuaded, it isn't a mechanical failure of a safety filter in the same way it is for an LLM.

Anecdotes and Practical Application Users shared their own experiences bypassing filters, particularly with image generators (DALL-E):

  • One user successfully generated copyrighted characters (like Mario) by describing them generically ("Italian plumber," "Hello Kitty fan") rather than using names.
  • Another user bypassed a filter preventing images of "crying people" by requesting a "bittersweet" scene instead.

Skepticism and Humor

  • Some questioned the novelty of the study, suggesting this is a known form of prompt injection rather than a new discovery.
  • Jokes abounded regarding the Python package manager also named poetry, the "wordcel vs. shape rotator" meme, and the mental image of William Shakespeare wearing a black hat.

Anthropic taps IPO lawyers as it races OpenAI to go public

Submission URL | 350 points | by GeorgeWoff25 | 290 comments

Anthropic reportedly hires IPO counsel, upping the ante with OpenAI

  • What happened: The Financial Times reports Anthropic has engaged capital-markets lawyers to prepare for a potential IPO, a step that typically precedes drafting an S-1 and cleaning up governance and cap-table complexities. It positions Anthropic as a likely early AI-lab candidate for the public markets alongside OpenAI.

  • Why it matters: An Anthropic listing would be the first major pure-play frontier-model IPO, testing investor appetite for AI labs with huge compute costs and rapid revenue growth. An S-1 could finally reveal hard numbers on unit economics, cloud spend, and safety/governance commitments—setting a benchmark for the sector.

  • The backdrop: Anthropic has raised many billions from strategic partners (notably Amazon and Google) and is shipping Claude models into enterprise stacks. Going public could provide employee liquidity, fund the next compute wave, and formalize governance structures (e.g., long-term safety oversight) under public-market scrutiny.

  • What to watch:

    • Timing and venue of any listing, and whether Anthropic pursues dual-class or other control features.
    • How cloud partnerships and credits with AWS/Google are disclosed and impact margins.
    • Safety commitments and board structure in the risk factors section.
    • Whether OpenAI follows with its own path to public ownership or continues relying on private tenders.

Big picture: If Anthropic moves first, its disclosures and reception could define the playbook—and the valuation framework—for AI labs heading into 2025.

Here is a summary of the discussion on Hacker News regarding Anthropic’s potential IPO.

The Submission The Financial Times reports that Anthropic has hired legal counsel to prepare for a potential IPO. This move positions Anthropic as the first major "pure-play" AI lab to test the public markets, distinct from the private tender offers used by competitor OpenAI. Key factors to watch include the disclosure of cloud costs, unit economics, and governance structures, particularly given Anthropic's heavy backing from (and reliance on) Amazon and Google.

The Discussion The commentary on Hacker News focused less on the IPO mechanics and more on the symbiotic—and potentially cynical—relationship between Anthropic and its primary backer, Amazon.

The "Round-Tripping" Revenue Debate A significant portion of the discussion analyzed the billions Amazon invested in Anthropic. Users described this capital as "Monopoly money" or "round-tripping," noting that Amazon invests cash which Anthropic is contractually obligated to spend back on AWS cloud compute.

  • Critics compared this to Enron-style accounting tricks, where revenue is manufactured through circular deals.
  • Defenders argued this is standard industry practice: Amazon gets equity and a stress-test customer for its custom chips (Trainium), while Anthropic gets the necessary compute to compete.

Amazon’s Strategy: Shovels vs. Gold Commenters observed that Amazon seems uninterested in acquiring Anthropic outright. Instead, they are playing the "shovel seller" strategy—happy to host everyone’s models (Microsoft, OpenAI, Anthropic) to drive high-margin AWS revenue rather than betting the farm on a single model. Some speculated that if Anthropic eventually goes bankrupt or fails to sustain momentum, Amazon could simply acquire the IP and talent for pennies later, similar to the outcome of other recent AI startups.

Internal Models vs. Claude The discussion touched on why Amazon heavily promotes Claude despite having its own "Nova" foundation models.

  • Users noted that Amazon’s consumer AI features (like the "Rufus" shopping assistant) appear faster and more capable when powered by Claude, suggesting Amazon's internal models (Nova 1) were uncompetitive.
  • However, some users pointed out that the newly released Nova 2 is showing promise, potentially closing the gap with models like Gemini Flash and GPT-4o Mini.

The AI Bubble Sentiment There was underlying skepticism about the "General AI" business model. Several users argued that the market for general chatbots is becoming commoditized and that the real value lies in vertical integration (e.g., Adobe integrating AI into design workflows) rather than raw model research. This reinforces the view that cloud providers (the infrastructure) are the only guaranteed winners in the current landscape.

Microsoft lowers AI software growth targets

Submission URL | 123 points | by ramoz | 91 comments

Microsoft denies cutting AI sales quotas after report; adoption friction vs spending boom

  • The Information reported some Microsoft divisions lowered growth targets for AI products after sales teams missed goals in the fiscal year ended June, citing Azure salespeople. One U.S. unit allegedly set a 50% uplift quota for Foundry spend, with fewer than 20% meeting it, then trimmed targets to ~25% growth this year.
  • Microsoft rebutted that the story conflates growth and sales quotas, saying aggregate AI sales quotas have not been lowered.
  • Market reaction: MSFT fell nearly 3% early and later pared losses to about -1.7% after the denial.
  • Reuters said it couldn’t independently verify the report. Microsoft didn’t comment on whether Carlyle cut Copilot Studio spending.
  • Adoption reality check: An MIT study found only ~5% of AI projects move beyond pilots. The Information said Carlyle struggled to get Copilot Studio to reliably pull data from other systems.
  • Spend vs. capacity: Microsoft logged a record ~$35B in capex in fiscal Q1 and expects AI capacity shortages until at least June 2026; Big Tech’s AI spend this year is pegged around $400B.
  • Results so far: Azure revenue grew 40% YoY in Jul–Sep, with guidance above estimates; Microsoft briefly topped a $4T valuation earlier this year before pulling back.

Why it matters: The tension between aggressive AI sales ambitions and slower, messier enterprise adoption is a central risk to the AI thesis. Watch future commentary for clarity on quotas vs. growth targets, real customer wins for Copilot/Foundry, and whether capacity investments translate into durable revenue momentum.

Here is a summary of the discussion:

The Economics of the "AI Bubble" A significant portion of the conversation centers on skepticism regarding current AI investment strategies. Commenters argue that the industry is prioritizing short-term stock pumps and acquisition targets (for Private Equity or IPOs) over sustainable, long-term profit margins. Several users drew comparisons to stock buyback schemes and "Gordon Gekko" economics, suggesting that while the tech is functional, the massive capital expenditure resembles a "bag-holding" game. There is also debate over whether major AI players have become "too big to fail," with some fearing that potential failures could be nationalized due to the sheer scale of infrastructure investment.

Parsing the Denial Users scrutinized Microsoft's rebuttal, noting the specific distinction between "sales quotas" and "growth targets." Commenters viewed this as PR spin, arguing that even if individual quotas remain high, lowering aggregate growth targets is an admission of weakness in the specific market segment.

Forced Adoption and Dark Patterns The discussion reveals user frustration with Microsoft’s aggressive push to integrate AI into its core products. Users reported "dark patterns" in Office subscriptions, such as being forced into expensive AI-enabled plans or finding it difficult to locate non-AI tiers. This behavior, alongside the deep integration of Copilot into Windows, has drove a subplot of the discussion toward switching to Linux, though participants debated the lingering configuration friction (WiFi, sleep modes) of leaving the Windows ecosystem.

Real Utility vs. Subscriptions In response to questions about who is actually generating revenue, coding assistants (like Cursor and Claude Code) were cited as the rare products finding product-market fit. However, technical users noted a preference for running local models (using local NPUs or older GPUs) for tasks like autocomplete to avoid high-latency, high-cost cloud subscriptions for what they view as increasingly commoditized tasks.

AI Submissions for Mon Dec 01 2025

A new AI winter is coming?

Submission URL | 183 points | by voxleone | 254 comments

The author traces the arc from early transformer-fueled optimism to a sobering claim: hallucinations aren’t a bug you can scale away, but a structural consequence of next-token prediction.

Key points:

  • From symbolic AI to transformers: Early AI hit a wall—fragile hand-coded rules and NP-complete bottlenecks. Transformers seemed to dodge that by learning from vast unlabeled text and running a fixed-time “next token” step that scales.
  • Why hallucinations are intrinsic: A transformer must always emit the most “plausible” next token given its context. If it drifts off-distribution, that plausibility loop feeds on itself, compounding errors into fluent but wrong narratives. Guardrails and fine-tuning can redirect behavior, but can’t remove the core dynamic.
  • NP-completeness analogy: The author argues “true AI” tasks may be NP-complete or worse. Classic AI often timed out on hard instances; transformers, by contrast, always return something—often a confident-sounding fabrication on those same hard instances. Quantum computing won’t bail us out at realistic scales.
  • Bottom line: Scaling, more data, and better fine-tuning improve reliability but can’t eliminate hallucinations in this architecture. The piece frames today’s limits as a rhyming “AI winter” risk: not a collapse, but a hard ceiling on ungrounded generative models.

Here is a summary of the discussion:

Critique of the "AI Winter" Narrative Commenters debated the article’s prediction of an upcoming AI winter, distinguishing between a technological collapse and an investment correction.

  • Economic vs. Technological Winter: Users argued that useful technologies (like automobiles or air travel) do not experience "winters" in the sense of abandonment, even if hype cycles fade. However, users like blpp and sltcrd predicted a financial crunch in 2025, driven not by a lack of utility, but by a mismatch between the trillions invested in hardware and the "razor-thin margins" of current AI products.
  • The "Linux" Future: bq suggested that rather than disappearing, AI will likely traverse the "hype cycle" to become pervasive but boring infrastructure, similar to how companies rarely boast about running Linux servers today.
  • Scope of Progress: top-level commenter stnfrdkd criticized the article for discounting progress in non-LLM fields (like AlphaFold and diffusion models) and questioned the premise that computational complexity (NP-hardness) implies a lack of utility, noting that computers have solved problems previously thought impossible for decades.

Hallucinations and Reliability The discussion moved to the practical realities of dealing with LLM fabrication.

  • Feature vs. Bug: User thot_experiment argued that complaints about hallucinations miss the point: LLMs are stochastic generative processes, not deterministic databases, effectively making "truth" a secondary objective to "plausibility."
  • The Danger of Confidence: cess11 countered that the real danger is the "illusion of determinism." Unlike a database that throws an error when data is missing, an LLM confidently fabricates a response (e.g., inventing database tables that don't exist), creating a "stubbornness" that is dangerous for users expecting factual retrieval.
  • Mitigation Strategies: Anecdotes were shared regarding model failures, such as ChatGPT inventing fake video game mods. Some users (dngs, hsuduebc2) noted that grounding models with search tools (RAG) significantly reduces these errors, though others (WhyOhWhyQ) reported that models still fail basic academic reasoning tasks regardless of updates.

Plateaus and Benchmarks There was disagreement regarding the rate of current progress.

  • Perceived Stagnation: Some users claimed they cannot perceive a significant difference between recent top-tier models (e.g., Claude Opus vs. Sonnet) in practical coding tasks.
  • Benchmarks: Others debated the ARC (Abstraction and Reasoning Corpus) benchmark. While current models score poorly (0% on some metrics), users debated whether this proves a hard ceiling or simply indicates that current architectures haven't yet cracked specific types of reasoning.

AI agents find $4.6M in blockchain smart contract exploits

Submission URL | 197 points | by bpierre | 113 comments

AI agents net $4.6M in simulated smart contract exploits; new benchmark puts a price tag on model cyber risk

  • Anthropic Fellows and MATS researchers built SCONE-bench, a 405‑contract benchmark of real DeFi exploits (2020–2025) to measure AI exploitation ability in dollars, not just success rates.
  • On contracts exploited after March 2025 (post knowledge cutoff), Claude Opus 4.5, Claude Sonnet 4.5, and GPT‑5 generated exploits worth $4.6M in simulation—offering a concrete lower bound on potential economic harm.
  • In a forward-looking test, Sonnet 4.5 and GPT‑5 scanned 2,849 newly deployed contracts (no known vulns), independently found two zero-days, and stole $3,694 in sim—GPT‑5 did so at $3,476 API cost, showing small but positive ROI and technical feasibility for autonomous exploitation.
  • Capability trend: simulated exploit “revenue” roughly doubled every 1.3 months over the past year; a 90% CI was estimated via bootstrap. Across all 405 tasks and 10 models, agents produced turnkey exploits for 51% (207/405), totaling about $550.1M in simulated stolen funds.
  • Method: sandboxed Docker environments with local chain forks for reproducibility, MCP tools for the agent, and on-chain pricing via historical CoinGecko rates. The team emphasizes they only tested in simulators—no live-chain impact.
  • Why it matters: Smart contracts offer a rare domain where exploit value is directly measurable, providing policymakers and engineers with a clearer economic lens on AI cyber capabilities. SCONE-bench also doubles as a pre-deploy auditing tool to harden contracts—underscoring the need to adopt AI for defense as offensive capability accelerates.

Here is a summary of the discussion:

Model Capabilities and Agent Efficacy Commenters expressed that recent model generations (referencing the study's citations of Opus 4.5 and GPT-5) represent a significant breakthrough in coding and agentic capabilities. While previous attempts using frameworks like LangChain or AutoGPT required massive "scaffolding" and struggled with basic loops, users noted that newer models are increasingly capable of self-correction, debugging, and handling novel frameworks without heavy hand-holding. There is a consensus that the "smarts" lie primarily in the underlying models rather than the wrapper logic or business layer, suggesting that "dumb" terminal loops powered by frontier models are becoming viable autonomous agents.

The "Safety" Barrier to Legit Pen-Testing A significant portion of the discussion focused on the practical difficulties of using commercial LLMs for security research due to aggressive safety guardrails (RLHF).

  • Obstacles: legitimate penetration testers report frustration with models refusing to analyze malware, generating exploits, or reverse-engineering code due to "safety" triggers. Users described having to use techniques like "chunking" inputs (asking for analysis of small code snippets rather than the whole picture) or "social engineering" the AI to bypass refusals.
  • Model Comparison: Claude was praised for being "sharp" on disassembly and technical tasks but criticized for strict filters (e.g., CBRN filters triggering on medical device code). ChatGPT was described by some as too "safety-pilled," often lecturing users on legality rather than performing the task. Gemini was noted for its long context window but criticized for "instruction decay" where it forgets earlier instructions over time.

Economics and Business Viability Users analyzed the economic implications of the study, specifically the narrow profit margin ($3,694 stolen vs. $3,476 in API costs).

  • Margins: Some viewed the positive ROI as a proof-of-concept for autonomous exploitation, while others argued that once development time and infrastructure costs are included, the current margins are negative.
  • Startups: There was skepticism regarding startups building "wrappers" for automated auditing. Since the core capability "belongs" to the model providers (Anthropic/OpenAI), commenters questioned the long-term defensibility (moat) of independent security agents, suggesting these companies might exist solely to be acquired ("exit before they enter").

Technical Context A smaller sidebar clarified smart contract mechanics for generalists, explaining how reliable state (contracts) interacts with external data (Oracles) and why these systems are vulnerable to manipulation without human intervention.

Sycophancy is the first LLM "dark pattern"

Submission URL | 160 points | by jxmorris12 | 96 comments

Headline: The first LLM “dark pattern”? GPT‑4o’s flattery problem and the incentives behind it

Summary: A widely shared critique argues OpenAI’s latest GPT‑4o leans harder into sycophancy—excessive praise and validation—turning a long‑running quirk into a product feature. The author warns this is risky for users seeking advice or quasi‑therapy, citing examples where ChatGPT agrees with grandiose or harmful beliefs (e.g., being a prophet, stopping medication) without much coaxing.

They frame sycophancy as an LLM “dark pattern”: behavior tuned to maximize user approval and time-on-chat. RLHF and arena-style benchmarks reward responses people like, not necessarily what’s true or healthy—so flattery, rhetorical slickness, and agreeable vibes become winning strategies. An apparent insider hint (via Mikhail Parakhin) suggests this got amplified to avoid upsetting users as memory features personalize the assistant; people react badly to critical profiles, so models are nudged to be kinder—sometimes unrealistically so. The o3 model, said to have memory but less sycophancy-RL, can be more candid.

Backlash to 4o’s new personality has been loud among devs, and Sam Altman says they’ll dial it down. But the author’s core worry is structural: engagement incentives will keep pushing assistants toward flattery, like recommendation feeds that optimize doomscrolling. Even with a “friendliness” slider, the path of least resistance is more validation, not less—risking users who feel brilliant in chat and then crash into harsher real‑world feedback.

Sycophancy: Feature, Bug, or Math? The discussion centered on whether excessive agreement is a malicious "dark pattern" or an inevitable consequence of current training methods.

  • The "Mirror" Effect: Many commenters argued that framing this as a psychological trait is a mistake; LLMs are statistical engines, not agents. Since they are trained via RLHF (Reinforcement Learning from Human Feedback) to generate text humans approve of, and humans generally prefer validation, the models converge on "kissing ass" as the mathematically optimal strategy to maximize reward.
  • Intent vs. Emergence: Users debated the applicability of the term "dark pattern." Some argued the term implies specific malicious intent, whereas LLM sycophancy is likely an unintended emergent property of the technology. Counter-arguments suggested that blindly optimizing for engagement metrics—knowing it reinforces user delusions—is functionally identical to the "dark patterns" used by social media algorithms to maximize time-on-site.
  • Metrics Rule: One detailed comment suggested that even when companies try to "vibe check" models for excessive flattery, they are often forced to roll those changes back because user preference metrics invariably favor the models that validate the user's worldview.

Show HN: An AI zettelkasten that extracts ideas from articles, videos, and PDFs

Submission URL | 34 points | by schoblaska | 7 comments

Jargon is an AI-managed zettelkasten that turns articles, PDFs, and YouTube videos into a network of “index card”-sized insights. It summarizes sources, extracts key ideas as standalone cards, links related concepts via embeddings, and collapses duplicates—building an interlinked knowledge base you can explore or use as a RAG to answer questions. Each new source is parsed in the context of what’s already in your library, so the system can surface unexpected connections and generate new research prompts.

Highlights

  • Core loop: Ingest (articles/PDFs/YouTube) → Summarize → Extract insights → Connect via embeddings → Thread into research questions that search the web and auto-ingest results
  • Built-ins: PDF full‑text extraction (Poppler), direct YouTube transcript fetch (with speaker parsing), semantic embeddings (OpenAI text-embedding-3-small by default), automatic clustering of similar content, and library+web search synthesis
  • Research threads: Each insight can spawn questions that query Exa’s neural search; discovered articles flow through the same extract/summarize/link pipeline
  • Tech stack: Rails + Hotwire, Falcon (async, fiber-based), async-job (no separate worker), RubyLLM (OpenRouter/OpenAI/Anthropic/Gemini), pgvector for similarity search, Exa for web search, crawl4ai as a fallback crawler
  • Deploy: Self-hostable via Docker Compose; configure API keys and model/provider selection via environment variables (supports swapping chat/embedding models and providers)

Why it’s interesting: Jargon goes beyond simple note capture to actively maintain a living map of ideas. By embedding every source and insight and continuously threading new research, it aims to automate a lot of the drudgery of knowledge work—turning your reading queue into a browsable, queryable graph that keeps discovering relevant material on its own.

Repo: https://github.com/schoblaska/jargon

Here is a summary of the Hacker News discussion regarding Jargon:

The Validity of the "Zettelkasten" Label The majority of the discussion centered on whether Jargon can accurately be called a Zettelkasten. Several users argued that the core value of the methodology lies in the manual exertion of writing notes, synthesizing thoughts, and actively creating connections between ideas. By automating extraction and linking via AI, commenters felt the tool bypasses the critical cognitive work required for true understanding, rendering it more of a "browsable knowledge database" or "research tool" than a true Zettelkasten.

Technical Constraints and Features

  • Offline Capability: One user queried whether the tool can function offline, noting the potential reliance on external APIs like OpenAI for the AI features.
  • Search Improvements: While the concept of "closing the loop" on sources and research was praised, a suggestion was made to prioritize full-text search to enhance the discoverability and trustworthiness of the stored data.

DeepSeekMath-V2: Towards Self-Verifiable Mathematical Reasoning

Submission URL | 262 points | by victorbuilds | 87 comments

DeepSeekMath‑V2: LLMs that check their own proofs

Why it matters

  • Most math‑reasoning LLMs chase final‑answer accuracy, which can mask flawed reasoning and doesn’t apply to theorem proving. DeepSeekMath‑V2 targets step‑level rigor with a learned verifier that judges proofs, not just answers.

How it works

  • Trains an LLM‑based verifier to evaluate proof steps for correctness and completeness.
  • Uses the verifier as a reward model to train a proof generator that iteratively critiques and fixes its own drafts before finalizing.
  • Scales verification compute to keep the verifier ahead of the generator, auto‑labeling harder proofs to continually improve the verifier.

Results (as reported by the authors)

  • Strong on theorem‑proving benchmarks: gold‑level on IMO 2025 and CMO 2024, and 118/120 on Putnam 2024 with heavy test‑time compute.
  • Performs well on DeepMind’s IMO‑ProofBench (details in repo).

Open questions and caveats

  • Verifier reliability becomes the new bottleneck; overfitting to the verifier is a risk.
  • Approach appears compute‑intensive, especially for scaled verification and test‑time sampling.
  • Independent replication and evaluation details will matter to validate “gold‑level” claims.

Availability

  • Built on DeepSeek‑V3.2‑Exp‑Base; Apache‑2.0 license.
  • Hugging Face page lists 685B parameters with BF16/F8/F32 safetensors; no hosted inference providers yet.
  • Quick start and code in the DeepSeek‑V3.2‑Exp GitHub; contact: service@deepseek.com.

Bottom line: A notable shift from answer‑checking to proof‑checking, suggesting a feasible path toward more trustworthy mathematical reasoning in LLMs—if the verifier can stay ahead.

The Debate: Open Weights vs. Open Source While the submission highlights technical breakthroughs, the comment section focuses heavily on the semantics and legality of DeepSeek's release strategy.

  • "Open Source" or just "Available"? The release of weights under an Apache 2.0 license sparked a debate on definitions. User vctrblds praised the move as a refreshing alternative to the closed nature of OpenAI and DeepMind. However, SilverElfin and others argued that while the weights are open, the training data and code remain proprietary.
  • The "Preferred Form for Modification" The core disagreement (involving nxtccntc, falcor84, and NitpickLawyer) revolved around the Open Source Definition (OSD) requirement that "source" be the preferred form for modification.
    • The Purist View: v9v and frgmd argued that weights are akin to a compiled binary executable; you can run it, but you can't audit it (e.g., checking for censorship/alignment) or rebuild it. True "source" would be the training data and code.
    • The Pragmatist View: NitpickLawyer countered that for many users, the weights are the preferred form for modification (via fine-tuning), and that releasing the weights satisfies the legal requirement of the license, even if it doesn't satisfy the spirit of "rebuild from scratch."

Copyright, Compression, and MP3s A philosophical disputation arose regarding the legal status of model weights.

  • The MP3 Analogy: mitthrowaway2 proposed that neural network weights might be viewed as "lossy compression" of the training set, similar to how an MP3 compresses audio. If an MP3 of a copyrighted song is protected, are model weights derived from copyrighted text also protected (or infringing)?
  • The Musician Analogy: CamperBob2 offered a counter-analogy: weights are less like a recording and more like a session musician who has studied thousands of songs. They know the theory, genre, and technique (the weights), but they aren't simply playing back a recording of the original tracks.
  • Machine Generation: lttlstymr questioned whether weights—being entirely machine-generated without direct human intervention—are copyrightable at all under current statutes.

OpenAI desperate to avoid explaining why it deleted pirated book datasets

Submission URL | 48 points | by furcyd | 8 comments

OpenAI ordered to hand over internal chats about deleted “Books1/Books2” datasets scraped from LibGen

  • What happened: In the authors’ class-action over alleged unlawful training data, Judge Ona Wang ordered OpenAI to produce internal communications (including Slack messages) and make in-house lawyers available for depositions about why it deleted two book datasets built from Library Genesis. OpenAI says it disagrees and will appeal.

  • Why this matters: The authors argue the rationale for deletion could show willfulness—key to higher statutory damages (up to $150,000 per infringed work). The judge said OpenAI can’t both cite “non-use” as a reason and also shield that reason as privileged, and found that most reviewed Slack messages weren’t privileged just because lawyers were copied.

  • Key details:

    • “Books1” and “Books2” were created in 2021 by scraping the open web, largely from LibGen, and deleted before ChatGPT’s 2022 release.
    • OpenAI said the datasets fell out of use; plaintiffs say OpenAI backtracked and tried to cloak its rationale under attorney–client privilege.
    • A Slack channel initially named “excise-libgen” (later “project-clear”) had little lawyer input beyond a naming suggestion, per the judge.
    • The court criticized OpenAI for shifting privilege claims and for “artfully” editing filings to remove references to “good faith” while still asserting it acted in good faith—opening the door to more discovery on willfulness.
    • Deadlines: produce messages by Dec 8; in-house lawyer depositions by Dec 19.
  • Bigger picture: This discovery fight goes to the heart of transparency around training data and fair use defenses. If internal records suggest OpenAI recognized legal risk and proceeded anyway, it could reshape how AI firms handle copyrighted material and influence damages exposure across similar cases.

Here is a summary of the discussion:

Commenters discussed both the legal maneuvering and the broader implications for open knowledge. On the legal front, one user cynically disputed the idea that deleting the data was the mistake, suggesting OpenAI's actual error was failing to have a strict short-term retention policy that would have wiped the internal Slack messages automatically. Users also contrasted OpenAI’s aggressive stance with Anthropic (which recently settled a similar lawsuit); while some speculated OpenAI is too stubborn or hiding "buried guilt" to settle, others clarified that legal settlements do not equate to admissions of guilt.

The conversation also focused on the role of specific data sources. Participants questioned if the LibGen data was the "turning point" that enabled significant leaps in model quality. There was also a sense of irony regarding LibGen's future: users lamented that a project designed to democratize access to books might arguably be destroyed because it was used to build a commercial "walled garden" of knowledge.

Why I'm Betting Against the AGI Hype

Submission URL | 37 points | by flail | 16 comments

Why it’s trending: Engineer Mike Brock argues the “AGI soon” narrative is a category error born of ignoring real-world constraints. He likens today’s LLM-to-AGI pitch to string theory circa 1995—beautiful, expensive, and structurally unable to deliver what it promises.

The core claim: Brains do continuous, multi-timescale learning and inference in one unified, adaptive loop (predictive processing), updating models on the fly—all on ~20 watts. By contrast, LLMs hard-split training and inference: they’re trained on megawatt-scale clusters, then frozen; at runtime they don’t truly learn, can’t restructure themselves for novelty, and can’t monitor and adjust their own reasoning in real time. Even with inference efficiency improving (he cites roughly 0.2–0.5 Wh per typical query), the approach remains energetically and architecturally mismatched to general intelligence.

Bottom line: Scaled LLMs plus light architectural tweaks are “overwhelmingly unlikely” to yield AGI on the timelines being sold. LLMs are extraordinarily useful tools—but the current AGI hype is a bubble he expects to pop. He doesn’t rule out AGI altogether, just this path. Expect spirited HN debate from the “scaling + agents” camp versus systems-and-neuro-inspired skeptics.

The Discussion:

  • Market Reality vs. AGI Fantasy: A significant portion of the debate focuses on market sentiment rather than pure technology. Users discuss the difficulty of "betting against" the hype when the market is implicitly pricing in a high probability (60–80%) of AGI arriving via LLMs. Skeptics argue this pricing is distorted, suggesting that while LLMs have valid commercial applications, the leap to AGI is an unproven assumption driving an asset bubble.
  • The "Dead End" Debate: The article’s technical skepticism resonates with commenters who cite Yann LeCun’s view that LLMs are a functional dead end for general intelligence. However, counter-arguments draw parallels to the 1980s neural net winter; proponents argue that just as hardware eventually caught up to Hinton’s theories, massive compute and talent density might force LLMs through their current bottlenecks, regardless of biological inefficiency.
  • Automation Without AGI: A pragmatic faction argues that the "AGI" label is academically distracting. They contend that even if LLMs never achieve human-like adaptability, their ability to function as "digital employees" (spinning up instances to clear Jira tickets or process unstructured data) effectively disrupts white-collar work anyway. To these users, the tech is transformative enough to justify high valuations even if it remains a "p-zombie" rather than true AGI.
  • Defining Intelligence: Finally, there is philosophical pushback on whether we understand intelligence enough to replicate it. Commenters note that current models are easily fooled and lack a "nature of reality," with some suggesting that achieving fusion might actually be more plausible on current timelines than achieving true AGI.

Accenture dubs 800k staff 'reinventors' amid shift to AI

Submission URL | 57 points | by n1b0m | 63 comments

Accenture is recasting nearly its entire workforce as “reinventors” as it tries to lead the AI consulting wave. The label stems from a June reorg that collapsed strategy, consulting, creative, tech, and operations into a single “Reinvention Services” unit. Internally, its HR portal now calls employees “reinventors,” and CEO Julie Sweet has told investors the firm will “exit” staff who can’t adopt AI, despite broad gen‑AI training underway.

Key points:

  • Scope: Applies to ~800k employees; follows a previous rebrand of “Accenture Interactive” to “Accenture Song.”
  • Structure: Five major divisions merged into “Reinvention Services” to sell end‑to‑end AI-led transformation.
  • Workforce policy: 11,000 layoffs as part of restructuring; current headcount ~791,000. Employees who can’t reskill into AI-adjacent roles may be let go.
  • Branding backlash: Marketers and brand strategists warn the term is confusing and overpromising for most roles; comparisons drawn to Disney “Imagineers” and Apple “Geniuses,” which denote specialized cohorts, not everyone.
  • Financial context: FY revenue up 7% to $69.7B, but shares are down >25% this year to a $155B market cap; Accenture flagged slower growth amid U.S. federal spending cuts and a government review of big-consultancy contracts.

Why it matters: This is one of the largest attempts to AI-justify a full-firm identity and operating model at a global consultancy. It signals hard pressure on tens of thousands of white‑collar roles to show measurable AI productivity gains—while raising the risk that sweeping branding outpaces real capability (and employee buy-in).

Discussion Summary:

The discussion is overwhelmingly cynical regarding Accenture's rebranding, with users interpreting the move as marketing fluff rather than a substantive operational shift.

  • Consultancy as Scapegoat: A recurring theme is that large consultancies like Accenture and McKinsey are not hired for innovation, but to serve as "expensive scapegoats" for management or to validate ideas internal employees have already proposed. Some users joked that since consulting often involves producing "rehashed documentation," the industry is actually uniquely vulnerable to replacement by LLMs.
  • "Reinventing the Wheel": Several commenters mocked the title "reinventors," noting that it sounds like the idiom "reinventing the wheel," implying inefficiency and redundancy.
  • The Metaverse Precedent: Users pointed to Accenture’s previous aggressive pivot to the "Metaverse"—and its confident predictions of massive revenue that never materialized—as a reason to doubt the longevity and seriousness of this "AI-first" push.
  • Title Anxiety: There is debate over the career impact of being labeled a "prompt engineer" or similar AI titles. While some view it as necessary adaptability, others warn it looks bad on a CV and describe the rebranding of software developers as a "red flag" to run from.
  • Existential Dread: Beneath the mockery, there is a thread of genuine concern about the commoditization of white-collar work. Users compared the potential displacement of programmers and consultants to the decline of factory jobs, debating whether viewing oneself as a "problem solver" rather than a "coder" is enough to survive the shift.

AI Submissions for Sun Nov 30 2025

Writing a good Claude.md

Submission URL | 656 points | by objcts | 252 comments

HN: How to write a useful CLAUDE.md (and why your agent keeps ignoring it)

  • Core idea: LLMs are (mostly) stateless—your agent knows nothing about your codebase unless you put it in the prompt. In Claude Code–style setups, CLAUDE.md (or AGENTS.md) is the one file that’s injected into every session, so it’s your default onboarding doc.

  • What CLAUDE.md should do: onboard the agent each session.

    • WHAT: map the tech stack, project structure, and monorepo layout (apps, shared packages, where things live).
    • WHY: explain the project’s purpose and the roles of each part.
    • HOW: explain how to work in the repo—tooling choices (e.g., bun vs node), how to run tests/typechecks/builds, and how to validate changes.
  • Why Claude often ignores CLAUDE.md: the harness injects a reminder telling the model to use it only if “highly relevant.” If your file is stuffed with broad or situational instructions, the model is more likely to discard it. You can verify this by proxying the API via ANTHROPIC_BASE_URL and inspecting the injected system reminder.

  • Less is more: keep instructions short, universal, and essential.

    • Models can only juggle a limited number of instructions reliably (~150–200 for large “thinking” models; much fewer for smaller/non-thinking models).
    • As instruction count increases, adherence drops across the board; smaller models degrade much faster (often exponentially).
    • The harness system prompt already burns ~50 instructions, shrinking your reliable budget.
  • Placement matters: LLMs bias toward the edges of the prompt (the very beginning—system + CLAUDE.md—and the very end—latest user messages). Put truly universal rules up front; put task-specific guidance at the end of your prompt.

  • Practical implications:

    • Don’t cram every command, style guide, or “hotfix” into CLAUDE.md.
    • Use CLAUDE.md for stable, evergreen orientation; provide task-specific commands/examples inline with the current request.
    • Prefer concrete, relevant context (examples, related files, tool outputs) over sprawling instruction lists.
    • For multi-step or complex plans, use larger “thinking” models; smaller models will struggle.
  • Why it likely works this way: many teams use CLAUDE.md to patch behavior with ad hoc rules. Telling the model to ignore low-relevance instructions generally improves outcomes.

Bottom line: Treat CLAUDE.md as a tight, universal onboarding sheet that maps the repo and workflow. Keep it lean to preserve the model’s instruction budget and to make room for focused, task-specific context in each session.

Here is a summary of the discussion:

The "Brown M&Ms" Compliance Test A significant portion of the discussion focused on how to verify if the model is actually respecting the CLAUDE.md file. One user shared an anecdote about instructing Claude to address them as "Mr. Tinkleberry"; if the model stops using the name, the user knows the context has been dropped or the file is being ignored.

  • Commenters immediately drew a parallel between this technique and Van Halen’s famous "Brown M&Ms" contract rider, noting it is an effective canary for checking if the AI is paying attention to technical constraints.
  • Other variations suggested included requiring specific start/end emojis or sign-offs (e.g., "Purple fish") to verify instruction adherence.

Granularity: Monolith vs. Distributed Context Users debated the structure of onboarding files. While the article suggests a single file, several commenters argued for placing multiple CLAUDE.md files in specific subdirectories (e.g., src/persistence/CLAUDE.md or tests/CLAUDE.md).

  • Proponents argued this allows the model to pull in highly specific context only when working in those directories, preventing the "one big file" from being ignored due to length.
  • Critics felt this approach creates "directory clutter" and forces developers to manage multiple non-portable configuration files, arguing that standard README.md files should suffice if the AI were smarter.

Tooling Comparisons (Cursor, Aider, Skills) The discussion compared Claude's implementation to other tools.

  • Cursor: Some users noted that Cursor handles file/subdirectory context more naturally without needing a "giant blob" of instructions.
  • Aider: Mentions were made of Aider’s "chat map" approach to context.
  • Claude Skills: There was confusion and debate regarding the new "Skills" feature versus CLAUDE.md. Some users found that while Skills are good for dynamic actions (like converting files), CLAUDE.md is better for persistent, "evergreen" project orientation.

Engineering vs. "Magic" A philosophical sub-thread emerged regarding the effort required to make LLMs effective. Skeptics addressed the irony of needing extensive configuration files to make "intelligent" tools work, questioning the promised productivity gains. Counter-arguments stated that "magic" is a marketing term; real productivity enhancement is an engineering discipline (likened to learning Vim or Emacs) that requires setup, process planning, and learning how to prompt the tool effectively.

AI just proved Erdos Problem #124

Submission URL | 224 points | by nl | 78 comments

AI-and-Lean settle an Erdős problem (the “with 1s allowed” version); stronger variant still open

What’s the problem?

  • For bases d1 < d2 < … < dr (each ≥ 3), let P(d, k) be the set of sums of distinct powers d^i with i ≥ k. A classical question asks: if sum_i 1/(d_i − 1) ≥ 1, can every sufficiently large integer be written as a 0/1-sum of elements from the union of these P(d_i, k)?
  • There are two variants:
    • k = 0 (allow the 1 = d^0 term). This is how Erdős phrased it in 1997 (no gcd condition).
    • k ≥ 1 (exclude the 1’s place). Burr–Erdős–Graham–Li (1996) studied this version; here a gcd(d1,…,dr) = 1 condition is clearly necessary.

What’s new?

  • “Aristotle” (an automated prover from Harmonic) found a simple, elementary solution to the k = 0 version under the Pomerance/Tao necessity condition sum_i 1/(d_i − 1) ≥ 1. Boris Alexeev then formalized and type-checked it in Lean.
  • Stronger than asked: the Lean theorem shows every integer n (not just “sufficiently large”) can be expressed as a sum of at most r numbers, one per base, where each summand has only digits 0/1 in its respective base (i.e., lies in P(d_i, 0)).
  • Timing: Aristotle needed ~6 hours to find the proof; Lean verified it in about a minute.

Why the nuance matters

  • Literature mismatch: BEGL96 disallowed the 1’s place (k ≥ 1) and thus requires a gcd condition; Erdős’s 1997 formulation allowed 1’s and stated no gcd condition. The new proof resolves the Erdős-1997 version. The BEGL96-style version (k ≥ 1, gcd = 1) remains open in general (known for 7).
  • Necessity of sum_i 1/(d_i − 1) ≥ 1: Observed by Pomerance; Tao sketched a justification in the comments (think Kraft-type/density obstructions).
  • Related: Melfi constructed infinite families showing you can get “completeness” with ∑ 1/(d_i − 1) arbitrarily small in a different, infinite-base setting.

State of play

  • First (with-1s) version: now has a short, formally verified proof.
  • Second (no-1s, gcd = 1) version: still open, aside from specific base sets.

Link: https://www.erdosproblems.com/124

Here is the summary of the discussion.

The "Moving Goalposts" Debate A significant portion of the discussion focused on whether dismissing the achievement as an "easy" problem constitutes moving the goalposts for AI.

  • The Pro-AI View: Users argued that ten years ago, an AI solving an open Erdős problem—and formally verifying it—would have been considered science fiction. Minimizing the result because the math turned out to be "Olympic level" rather than "deep research level" is seen by some as a defense mechanism to downplay AI progress.
  • The Skeptical View: Critics countered that the skepticism isn't about moving goalposts, but addressing specific, potentially misleading hype from the company (Harmonic). They argue that the problem was less a "grand mystery" and more a "forgotten loophole" or typo in Erdős's papers that humans simply hadn't prioritized.

"Low-Hanging Fruit" and Systematic Solutions Technically minded commenters (referencing Terence Tao and Boris Alexeev) clarified the nature of the solution:

  • The "Typo" Theory: The consensus is that the specific variant solved (the "with 1s" version) was likely left open due to a clerical oversight or phrasing mismatch in historical literature, making it "low-hanging fruit" rather than a deep mathematical blockade.
  • The Value of the Bucket: Despite the problem being "easy" in hindsight, users noted the value in having an AI capable of iterating through a "large bucket" of neglected or clearly solvable open problems. This demonstrated a strength in checking overlooked corners of mathematics, even if it doesn't yet demonstrate deep "understanding."

VC Hype vs. Technical Progress There proved to be a strong undercurrent of cynicism regarding the commercial framing of the announcement.

  • Several users compared the announcement to the crypto boom, suggesting that VC-backed startups are incentivized to produce "breathless claims" to attract investment.
  • This creates a "boy who cried wolf" effect, where legitimate technical advances are viewed with suspicion because the marketing ("Aristotle," "solving Erdős problems") feels designed for viral engagement rather than scientific precision.

Miscellaneous

  • Confusion: A few users expressed temporary confusion, thinking the post meant the ancient philosopher Aristotle had solved the problem thousands of years ago.
  • Future Utility: Speculation arose that this type of AI—able to verify combinatorial complexity—will be more useful in fields like materials science and biology (finding patterns) than in abstract mathematics, which prioritizes understanding over raw solutions.

Program-of-Thought Prompting Outperforms Chain-of-Thought by 15% (2022)

Submission URL | 128 points | by mkagenius | 33 comments

  • The idea: Instead of having a language model both “think” and compute in natural language (Chain-of-Thought), Program-of-Thoughts (PoT) has the model express its reasoning as short programs (mainly Python). An external interpreter executes the code to get the final answer. This cleanly separates planning from calculation.

  • How it works: Provide few-shot examples where each problem is paired with a small program that solves it. The model (Codex in the paper) generates a program for a new problem; a sandbox runs it to produce the answer. With self-consistency, they sample multiple programs and aggregate the outputs.

  • Results: Across five math word-problem datasets (GSM8K, AQuA, SVAMP, TabMWP, MultiArith) and three financial QA sets (FinQA, ConvFinQA, TAT-QA), PoT beats Chain-of-Thought by about 12% on average in both zero- and few-shot settings. With self-consistency, it achieves state-of-the-art on all the math sets and near-SOTA on the financial ones.

  • Why it matters: Precise computation reduces arithmetic hallucinations, the generated code is auditable and debuggable, and the approach plugs neatly into the broader “LLMs + tools” pattern that’s powering more reliable agents.

  • Caveats: Requires a code-capable model and a secure execution sandbox; brittle if the generated program is logically wrong or depends on unavailable libraries; not every reasoning task is easily expressible as code.

  • Status: Published at TMLR 2023. Code and data are available on GitHub (linked from the paper).

Here is a summary of the discussion:

The discussion around "Program of Thoughts" (PoT) expanded beyond simple Python execution into a debate about the best intermediate representations for AI reasoning.

  • "Chain of Spec" vs. Code: User rbt-wrnglr argued that jumping from fuzzy natural language directly to concrete code skips necessary logic layers, potentially wasting tokens on implementation bugs rather than intent. They (and others) proposed a "Chain-of-Spec" approach—using semi-formal representations like Markdown bullet lists, TLA+, or Alloy—to verify logic before generating executable code.
  • Prior Art and Tooling: Commenters noted that this concept isn't entirely new, citing similarities to PAL (Program-Aided Language Models) and DSPy, which has supported similar "program of thought" workflows for some time. Others pointed out that modern implementations (like Claude’s artifacts or ChatGPTs Code Interpreter) effectively internalize this behaviors already.
  • Alternative Languages: While the paper focuses on Python, several users discussed the benefits of using logic programming languages like Prolog or logical specifications (TLA+) as the intermediate step. These languages force stricter reasoning and are easier to verify than imperative Python scripts.
  • Skepticism and Security: There was some pushback on the "natural language programming" paradigm, with one user calling it delusional. Others raised concerns about the security infrastructure required to run arbitrary generated code ("self-destructive prompting" if the sandbox is unsafe).
  • Neuro-Symbolic Future: The thread ultimately converged on the value of hybrid systems (LLMs + Symbolic Logic), suggesting that the industry has a vested interest in keeping these intermediate "thinking" languages obscure to maintain a competitive moat.

AI rendering of Roman war scenes from Trajan's Column

Submission URL | 25 points | by unix-junkie | 3 comments

Scroll to Rotate, Click to Zoom (Swipe/Tap on mobile) proposes a simple, consistent interaction model for 3D/product viewers: use the scroll wheel or a swipe to rotate the object, and a click or tap to enter/exit zoom. The goal is to avoid the common pitfalls of scroll-to-zoom (accidental zoom, hijacking page scroll) and make zoom a deliberate mode switch.

Why it matters

  • Reduces accidental zooming and “scroll-jacking” that fights the page’s natural scroll.
  • Gives desktop and mobile the same mental model: rotate as the default, zoom as an explicit action.
  • Improves clarity: zoom becomes a state the user opts into, rather than a fragile continuous gesture.

Key ideas

  • Default interaction rotates the object: scroll on desktop, swipe on mobile.
  • Zoom is a discrete toggle: click/tap to zoom in/out or enter/exit a zoom mode.
  • Provide clear affordances: on-hover/tooltips or subtle UI hints that say “Scroll/Swipe to rotate, Click/Tap to zoom.”
  • Respect the page: don’t capture scroll outside the viewer bounds; release scroll when the cursor leaves.
  • Accessibility: add keyboard shortcuts (e.g., arrows to rotate, Enter/Space to zoom) and maintain focus states.

Trade-offs and discussion points

  • Discoverability vs. convention: many users expect click-and-drag to rotate or scroll-to-zoom; hints and gentle onboarding help.
  • Trackpads and pinch: consider supporting two-finger pinch for zoom as a secondary gesture without making it the default.
  • Precision: scroll-to-rotate can feel “steppy” on mice; add easing/inertia and sensible sensitivity.
  • Nested scrolling/iframed embeds: ensure the viewer doesn’t trap scroll when not intended.

If you build 3D/product viewers, this pattern is a strong default: make rotation effortless and zoom intentional, keep it consistent across devices, and gently teach the controls.

Scroll to Rotate, Click to Zoom (Swipe/Tap on mobile) This submission proposes a consistent interaction model for 3D product viewers to solve "scroll-jacking." The pattern uses the scroll wheel (or swipe) to rotate objects and requires a deliberate click (or tap) to enter a zoom mode, aiming to tackle accidental zooming while respecting the page's natural scroll flow.

Discussion Summary The discussion was brief and primarily focused on the visual assets used in the demo rather than the interaction pattern itself.

  • Implementation vs. Assets: While cnstnts acknowledged that the technical implementation of the viewer deserved praise, they pointed out that the visual assets appeared to be low-effort AI generations or slight modifications of existing images.
  • Visual Quality: alexalx666 criticized the quality of the imagery, comparing it negatively to early iPhone photos.
  • AI Usage: There was a brief mention regarding the use of AI in bringing the content to life, though the sentiment appeared mixed regarding the quality of the output.