Hacker News
Daily AI Digest

Welcome to the Hacker News Daily AI Digest, where you will find a daily summary of the latest and most intriguing artificial intelligence news, projects, and discussions among the Hacker News community. Subscribe now and join a growing network of AI enthusiasts, professionals, and researchers who are shaping the future of technology.

Brought to you by Philipp Burckhardt

AI Submissions for Sat Feb 21 2026

How Taalas “prints” LLM onto a chip?

Submission URL | 306 points | by beAroundHere | 167 comments

Taalas “prints” Llama 3.1 8B onto an ASIC, claims 17,000 tokens/sec and 10x gains in cost and power

TL;DR: A 2.5-year-old startup, Taalas, built a fixed-function ASIC that hardwires Llama 3.1 8B’s weights into silicon, reportedly hitting ~17k tokens/sec with 3/6-bit quantization, while being ~10x cheaper to run and ~10x more energy-efficient than GPU inference.

How it works

  • No HBM/DRAM loop: Instead of shuttling weights over a memory bus each step, the model’s 32 layers are physically laid out on-chip. Inputs stream through layer-by-layer logic with pipeline registers; activations don’t round-trip to external memory.
  • Weights in silicon: The weights are “engraved” as transistors; Taalas hints at a “magic multiplier” that can store 4-bit data and perform its multiply in what they describe as a single-transistor element, enabling dense, low-power compute-in-memory–style MACs.
  • Minimal SRAM: On-chip SRAM is used for KV cache and to host LoRA adapters; there’s no external DRAM/HBM.
  • One model per chip: It’s a fixed-function device (think cartridge/CD-ROM). To target a new model, they customize only the top metal layers over a generic base fabric, which they say let them map Llama 3.1 8B in ~2 months.

Why it matters

  • Smashes the memory wall: By eliminating weight fetches over a memory bus, the design attacks the core bandwidth/latency bottleneck in today’s GPU LLM inference.
  • Throughput and efficiency: If the 17k tok/s and 10x cost/power claims hold, inference economics—especially at the edge or at massive scale—could shift sharply away from general-purpose GPUs for stable, high-volume models.

Caveats and open questions

  • Flexibility: It’s essentially one-model-per-chip; updating architectures or sizes requires a respin.
  • Quality trade-offs: Real-world accuracy with 3/6-bit quantization isn’t detailed; effects across tasks and long contexts remain to be seen.
  • Practical limits: KV cache size, max context length, batching, sampling features, and how the “single-transistor multiplier” works (analog vs. digital, precision, variability) are not fully explained.
  • Manufacturing/yield: Customizing top metal layers is faster than a full new chip, but still slower and riskier than software updates.

Here is a summary of the discussion:

Feasibility and quantization trade-offs Commenters crunched the numbers on the claim of packing ~8B coefficients into 53B transistors, concluding the math theoretically holds up if the device relies on aggressive quantization (likely 3-bit or "double FP4"). While some users were excited by the prospect of "model-to-VHDL" synthesis, others worried that hardwiring such strong quantization into silicon would permanently degrade model quality, making the chip useless for tasks requiring higher precision.

The inevitable hardware cycle Many users viewed this as a predictable evolution of computing, drawing parallels to the transition from CPU to GPU to ASIC in Bitcoin mining, or the move from software rendering to hardware acceleration in 3D graphics. While some suggested FPGAs as a middle ground, others argued FPGAs lack the efficiency/scaling needed to compete with GPUs or ASICs in this specific domain.

The "Inflexibility" bottleneck The primary skepticism revolved on the risk of obsolescence. With LLM architectures and weights changing almost daily, users noted that a fixed-function chip could become e-waste before it hits the market. Big tech companies likely haven't pursued this yet because they are constrained by fab capacity and cannot afford to bet on a model that might be outdated in six months.

Killer use-case: Edge and Latency Despite the flexibility concerns, users identified a strong niche for this tech: local inference.

  • Latency: Eliminating the 50-200ms network overhead of the cloud allows for sub-100ms response times, enabling real-time voice and video agents that current GPUs can't serve efficiently over the web.
  • Stable Appliances: It was suggested these chips are perfect for "frozen" models running on drones, phones, or appliances (e.g., a smart fridge) where the model doesn't need to be State-of-the-Art, just functional and offline.

Show HN: Llama 3.1 70B on a single RTX 3090 via NVMe-to-GPU bypassing the CPU

Submission URL | 321 points | by xaskasdf | 82 comments

NTransformer: runs Llama 70B on a single RTX 3090 by streaming layers over PCIe

What’s new

  • A C++/CUDA LLM inference engine that keeps only a subset of layers in VRAM and streams the rest from RAM/NVMe, enabling 70B models on a 24GB GPU. No PyTorch or cuBLAS; GGUF models with multiple quantizations supported.

How it works

  • 3-tier adaptive caching: VRAM-resident layers (no I/O), pinned RAM (H2D only), and NVMe/mmap fallback, auto-sized from your hardware.
  • NVMe direct I/O: a userspace driver reads weights straight into GPU-accessible pinned memory, overlapping disk, PCIe DMA, and compute (SLEP streaming).
  • Layer skipping: cosine-similarity–based calibration can skip ~20 of 80 layers per token at 0.98 threshold, with minimal quality loss.
  • Self-speculative decoding: uses resident layers as a draft model; no second model required.

Performance highlights (author’s tests, RTX 3090 + 48GB RAM)

  • Llama 3.1 8B Q8_0 (resident): ~48.9 tokens/s using ~10GB VRAM.
  • Llama 3.1 70B:
    • Q6_K tiered: ~0.2 tok/s at ~23.1GB VRAM (26 layers in VRAM, rest in RAM).
    • Q4_K_M tiered: ~0.3 tok/s at ~22.9GB VRAM (36 layers in VRAM).
    • Q4_K_M + layer skip: ~0.5 tok/s (fastest reported).
  • Claims up to 83x speedup over naive mmap streaming; bottleneck is PCIe H2D bandwidth (Gen3 x8 ~6.5 GB/s).

Caveats and setup

  • Linux + CUDA 13.1, gcc-14, CC 8.0+ GPU (3090 tested). Optional NVMe on a separate PCIe slot for best results.
  • For NVMe-direct mode, the setup script performs invasive system changes: disables IOMMU, patches NVIDIA DKMS for recent kernels, tweaks CUDA headers, and binds NVMe via VFIO with “unsafe noiommu” mode. Not recommended on multi-tenant/production systems; missteps can break your GPU driver.

Why it matters

  • A clever, low-level approach that makes 70B models usable on consumer GPUs by trading speed for capacity and I/O orchestration. Great for experimentation and edge cases where VRAM is the limiting factor—just be mindful of the heavy-duty system tweaks and modest 70B throughput.

Based on the discussion, here is a summary of the community's reaction:

Performance vs. Practicality Discussion focused heavily on whether 0.2–0.5 tokens/second is usable.

  • Chat vs. Batch: Most users agreed this is too slow for interactive chat, but several (like umairnadeem123) noted it is viable for automated background tasks (batch processing) where latency doesn't matter, offering a private, fixed-cost alternative to APIs.
  • Better Alternatives: Users like flrdtn pointed out that standard CPU offloading (system RAM + GPU) is currently faster than this method, citing ~1.5 t/s on a Ryzen 7950X + 3090.
  • Small Models: Some argued that for interactive use, a high-quality 8B model entirely in VRAM offers a better experience than a crippled 70B model.

Hardware Bottlenecks & Apple Comparisons

  • The Apple Factor: MarcLore and others drew comparisons to Apple’s M-series chips (Unified Memory), which handle 70B models natively with much higher throughput, though at a higher hardware entry price.
  • Author’s Constraints: The author (xsksdf) clarified that their benchmarks are severely bottlenecked by their specific hardware setup—a B450 motherboard limiting the GPU to PCIe 3.0 x8 speeds. A modern PCIe 4.0/5.0 x16 setup would likely yield significantly higher throughput.

The "Why" (PlayStation 2 Origins) In a surprising reveal, the author explained that this project stems from their background in retro-gaming development. They previously built a transformer engine for the PlayStation 2 (PS2-LLM), where the console's tiny 32MB RAM and 4MB VRAM forced them to master DMA (Direct Memory Access) and layer streaming. They simply applied the same "extreme constraint" logic to the RTX 3090.

Cost & Power There was a debate regarding the economics of running this locally versus using cheap APIs.

  • Energy: While esquire_900 calculated it might be cheaper than APIs over time, lvntysvn reminded the thread to factor in the 300W+ power draw of a 3090 running for hours to generate a single report.
  • Utilization: The author noted that due to the I/O bottleneck, the GPU isn't actually hitting full TDP (power limit), so electricity costs might be lower than expected.

zclaw: personal AI assistant in under 888 KB, running on an ESP32

Submission URL | 230 points | by tosh | 125 comments

zclaw: an 888 KiB AI assistant firmware for ESP32

  • What it is: A tiny C-based “agent” for ESP32 boards that turns a microcontroller into a natural-language assistant. It handles schedules (cron-style), GPIO control with guardrails, persistent memory, and user-defined tools. Chat via Telegram or a hosted web relay. Persona options include neutral, friendly, technical, and witty.

  • How it works: Runs fully on-device as an orchestrator with Wi‑Fi, TLS, and certs, but uses cloud LLMs (Anthropic, OpenAI, OpenRouter) for reasoning. Includes provisioning, rate limits (default 100/hour, 1000/day), and optional encrypted credentials in flash.

  • Footprint bragging rights: All-in firmware cap of 888 KiB, including ESP-IDF/FreeRTOS, networking, TLS/crypto, and cert bundle. Current build: ~869,952 bytes. App logic alone is ~35 KiB (~4%); the bulk is networking/TLS/runtime.

  • Hardware and dev: Tested on ESP32-C3/S3/C6 (recommended: Seeed XIAO ESP32-C3). QEMU profile available. One-line bootstrap, secure flash, provisioning, relay/serial benchmarking, and a web relay with mobile chat UI.

  • Why it’s interesting: It shows how much “agent” capability you can pack into a sub‑1 MB firmware on a $5 microcontroller—no local LLM, but solid tool composition, scheduling, and state, all in C.

  • License and repo: MIT. GitHub: https://github.com/tnm/zclaw — Docs: https://zclaw.dev

Notes:

  • Cloud LLM required (not on-device inference).
  • Guardrails for GPIO (including bulk reads).
  • Scripts cover flashing, provisioning, Telegram backlog clearing, emulation, and latency benchmarking.

Here is the summary of the discussion on Hacker News:

zclaw: an 888 KiB AI assistant firmware for ESP32

The comment section explores the utility of running AI agents on bare-metal microcontrollers versus full operating systems, alongside skepticism regarding the "agent" hype cycle.

  • ESP32 vs. Linux for Agents: umairnadeem123 argues that the primary appeal of zclaw is the "zero-maintenance" aspect of the ESP32; unlike a Linux box which requires updates and suffers from OOM kills, an ESP32 provides a simpler, predictable failure mode for always-on orchestration. However, hsbvhbzb counters that this approach introduces new points of failure—specifically reliance on cloud APIs, Wi-Fi stability, and the internet—suggesting that swapping an OS for a microcontroller doesn't inherently solve reliability problems.
  • Tamagotchis and Use Cases: GTP proposed building an "intelligent Tamagotchi" using this stack. tempaccount5050 shared their experience attempting this, noting that an LLM alone isn't enough; the project still requires a state machine to define constants (like "hunger") to prevent the AI from getting stuck in a loop. Others, like post_below, discussed more complex home automation, such as a self-hosted agent that manages grocery lists via Signal and automatically populates browser-based shopping carts.
  • The "Claw" Ecosystem & Protocols: There is confusion regarding the "OpenClaw" ecosystem compared to zclaw. blnsr compared OpenClaw to ROS (Robot Operating System) for distributed nodes, but TheDong quipped that the only real protocol here is English, stating we are in a "post-API world" where natural language turns into bash or browser tool invocations.
  • Security and Hype: The discussion veered into the risks of IoT agents. dlt713705 jokingly envisioned a future where vacuum cleaners declare war on refrigerators via Discord. On a serious note, h4ch1 criticized the "ostrich-head-in-the-sand" enthusiasm for agent frameworks, warning that giving unfettered API and tool access to unverified dependencies (likened to eating "cake made of plastic") is a disaster waiting to happen.
  • Technical Implementation: Dr_Birdbrain and others dismissed the project as merely a "tiny LLM power agent wrapper" connected to the internet, though some appreciated the engineering effort required to fit the TLS stack and runtime into less than 1 MB of flash.

AI uBlock Blacklist

Submission URL | 265 points | by rdmuser | 114 comments

AI uBlock Origin Blacklist: A crowdsourced filter list to hide AI-generated “content farm” sites. GitHub user alvi-se maintains a manually curated uBlock Origin list (and a uBlacklist version for search engines) that blocks domains and specific paths churning out SEO’d, low-value, ad/affiliate-heavy AI articles. Installation is via a one-click subscription link or by adding the raw list URL as a 3rd-party filter in uBlock. The author argues automated detection is unreliable, so entries are added by hand and guided by telltale signs: fluffy/baroque intros, “Comprehensive/Ultimate Guide” titles, few outbound links or sources, and aggressive referral links. Contributors are encouraged to file issues or PRs; the repo avoids blanket-banning platforms like Medium/dev.to by targeting offending blogs/paths only. Despite being personal and somewhat Italy-biased, the maintainer says the list is effective because the same spammy sites recur across searches. As of now: ~213 commits, ~349 stars.

Based on the discussion, here is a summary of the comments:

Concerns Regarding Maintenance and False Positives A significant portion of the discussion focuses on the risks associate with personal, manually curated blacklists. Several users criticize the specific maintainer of this list, describing them as having a "suspicious attitude" and believing themselves to be "infallible."

  • Lack of Recourse: Examples were shared of personal websites being blocked by similar lists on PiHole or uBlock; users noted that requests to be unblocked often go unanswered or are ignored entirely.
  • Domain Churn: Users pointed out that static blacklists fail to account for domain ownership changes. A domain currently hosting AI spam might later be purchased by a legitimate owner, but it remains in a "reputational blackhole" with no easy mechanism for removal.
  • Comparison to Anti-Cheat: The situation was likened to "VAC bans" in gaming, where false positives occur, but the system is treated as absolute.

The State of Search and "AI Slop" Despite the concerns about the list's implementation, many commenters expressed a desperate need for tools to filter AI-generated noise.

  • Search Quality: Users described the current search experience as being drowned in "slop," making it difficult to find human-created content (specifically on platforms like Reddit).
  • "Hater" Lists vs. Utility: There was debate regarding alternative lists (such as the "HUGE AI Blocklist"). Some argued these are merely "hater lists" that block sites for tangentially related reasons (like having an AI widget or unrelated grievances), while others defended aggressive blocking as the only way to improve the user experience.

Side Discussion: AI in the Workplace A tangible sub-thread emerged regarding the use of AI for writing text (emails, reports) in professional settings.

  • "Cosmetic Surgery" Analogy: One user described a coworker who uses Copilot to generate 20-paragraph emails as having "extraordinarily bad cosmetic surgery"—it looks polished at a glance but is fundamentally uncanny and distinct from human communication.
  • Skill vs. Laziness: Commenters debated whether this usage covers for "functional illiteracy" and language barriers, or if it simply encourages laziness and results in "mediocre crap" that colleagues are forced to read.

Alternatives and Technical Solutions Users shared various alternatives to the submitted list, including:

  • uBlacklist: Specifically mentioned as a tool to remove specific domains from search engine results pages (SERPs).
  • AdGuard/PiHole: Discussed as broader network-level solutions, though they suffer from the same false-positive risks if the underlying lists are poor.
  • Other Repos: Links to other GitHub repositories and Gists were shared for those looking for different filtering criteria.

Cord: Coordinating Trees of AI Agents

Submission URL | 151 points | by gfortaine | 75 comments

The pitch

  • Most multi‑agent frameworks make developers predefine roles, graphs, and handoffs. Cord flips this: you give a goal, and the agent plans, decomposes, parallelizes, blocks on dependencies, and asks humans when needed.

What’s different

  • Runtime decomposition: The agent decides the workflow as it goes, not from a static graph or role roster.
  • Spawn vs. Fork: Two context-flow primitives.
    • Spawn: clean slate; only the prompt plus explicit dependencies. Good for independent subtasks.
    • Fork: inherits all completed sibling results. Good for synthesis and analysis.
  • Explicit dependencies and blocking: Tasks can wait on others and on human answers, enabling predictable parallelism.

Example

  • Given “Should we migrate from REST to GraphQL?”, Cord:
    • Spawns parallel research and API audit
    • Asks a human about traffic scale, blocked on the audit
    • Forks a comparative analysis that inherits prior results
    • Writes a tailored recommendation after dependencies resolve

Why it matters

  • Moves from developer-scripted workflows to agent-discovered structure, matching how strong models plan and reason today.
  • Introduces simple, learnable context flow so agents can parallelize without losing necessary shared knowledge.

Under the hood

  • Each agent is a Claude Code CLI process with MCP tools, coordinated via a shared SQLite DB.
  • Minimal API: spawn, fork, ask, complete, read_tree.
  • Roadmap idea: first-class context_query to distill and pass only relevant context to children via a compaction subagent.

The Debate: Dynamic Planning vs. Deterministic Control The discussion centered on a fundamental divide in agentic engineering:

  • Reliability vs. autonomy: Some users argued that strict, discrete topologies (static DAGs) form the only viable path for reliable systems, warning that unconstrained agent planning compounds probabilistic errors.
  • Obsolescence of hardcoding: Counter-arguments suggested that modern models (like Claude 3.5 Sonnet) are now sufficiently capable of planning and decomposition that hardcoding task graphs is becoming obsolete.

Key Technical Feedback

  • Context flow primitives: The community reacted positively to the "Spawn" (clean state) vs. "Fork" (inherited context) distinction, viewing it as a clever strategy to manage context window pollution.
  • Feature Suggestion: One commenter proposed adding a distinct context_query primitive—a mechanism where a subagent requests specific data via natural language query rather than receiving a raw dump of the parent’s context, effectively acting as "context compression."
  • Comparisons: Users drew parallels to Anthropic’s internal tooling (Agent Tums) and Claude Code’s existing capabilities. The OP clarified that while Claude can spawn subagents, Cord aims to enable deeper, recursive trees where subagents can spawn their own sub-subagents.

Framework Fatigue & Skepticism

  • "Not another framework": Several commenters expressed fatigue with the proliferation of orchestrator tools (referencing LangGraph), with some preferring simple shell scripts or "roll-your-own" solutions over adopting new protocols.
  • AI-generated content: A meta-discussion emerged regarding the blog post's writing style; multiple users felt the prose was obviously AI-generated, which they argued detracted from the message reliability, though the OP acknowledged this and promised follow-up data.

Large Language Model Reasoning Failures

Submission URL | 40 points | by T-A | 80 comments

Large Language Model Reasoning Failures (Song, Han, Goodman) — a TMLR 2026 survey with Survey Certification — maps where LLMs still stumble, even on “simple” problems, and organizes a scattered literature into a single playbook.

What’s new

  • A two-axis taxonomy:
    • Types of reasoning: embodied vs. non-embodied; the latter split into informal (intuitive) vs. formal (logical).
    • Types of failures: fundamental (architecture-level), application-specific (domain-bound), and robustness issues (brittleness to small prompt/task variations).
  • For each failure class: clear definitions, evidence from prior studies, suspected root causes, and mitigation strategies collected from the literature.
  • A curated GitHub repository aggregating papers on LLM reasoning failures for quick entry into the area.

Why it matters

  • Gives researchers and product teams a shared vocabulary to diagnose errors, design evaluations across reasoning modes, and choose mitigations.
  • Highlights that many “reasoning” wins remain fragile, with inconsistent behavior under minor changes.

Links

  • Paper: arXiv:2602.06176 (with arXiv-issued DOI)
  • Repository: included via the paper’s GitHub link

The Discussion

The Hacker News discussion focuses heavily on whether the paper’s claims about "fundamental" failures hold up against the most recent state-of-the-art models, alongside a broader debate about the nature of machine intelligence.

Arithmetic, Tools, and "Cheating" A contentious debate erupted over the paper's assertion that LLMs fail at basic arithmetic (specifically large-number multiplication).

  • User smnwrds attempted to "falsify" the paper's claims by testing 20-digit multiplication on GPT-o1 Pro, which solved the problems correctly.
  • Others, notably rybswrld and chcknmprnt, countered that this doesn't prove the LLM can reason; rather, it highlights how frontier models increasingly rely on hidden tools. They argued that models often offload math to internal Python interpreters or obscure "Chain of Thought" processes effectively "faking" the alignment.
  • When chcknmprnt tested a local model (Mistral) where tool-use was explicitly disabled, the model hallucinated the answer, supporting the paper's thesis. smnwrds dismissed this as picking on "the worst model," while others maintained that even closed-source models likely rely on undocumented internal subsystems (like specific Rust optimizations) to patch these fundamental architectural weaknesses.

Anthropomorphism vs. Architecture Top-level commenter srgmtt welcomed the paper as a necessary check against anthropomorphism.

  • They argued that the identified failures—such as inability to count like a toddler or handle object permanence—stem from the nature of "next-token predictors" being fundamentally different from human general intelligence.
  • lnsbr and mttmg added that unlike humans, who evolve and maintain long-term dynamic memories, LLMs rely on frozen weights, making the comparison to human reasoning inherently flawed.
  • otabdeveloper4 cynically noted that these systems are sold as AGI primarily to sustain stock market narrratives.

Social and Moral Fragility Finally, Lapel2742 highlighted the paper's points on social reasoning failures.

  • The commenter ridiculed the idea that models are ready for ethical decision-making, noting they struggle with social norms and cultural context.
  • They joked that the industry has successfully created AI in the image of "Techbro-CEOs" rather than a system capable of broadly congruent human values. rnlszlrn agreed, suggesting current models embed values that are incompatible with large percentages of the global population.

Show HN: AI writes code – humans fix it

Submission URL | 5 points | by stasman | 3 comments

Humans-on-demand for broken AI code: a 24-hour bug-fix marketplace

A new service targets the growing “AI wrote it, now it’s broken” gap. You post a bug, set a price (from $49), and a vetted human developer delivers a fix within 24 hours—no meetings, no chat, just a PR.

Key details:

  • Workflow: Post task with context/screenshots → set your price → a verified dev gets read-only repo access → they propose a fix and submit a delivery → on approval, you receive a pull request.
  • Pricing: You choose the bounty (min $49) + 10% platform fee. Payment is charged when a dev accepts, held in escrow, released on your approval. If no one picks it up in 24 hours, the hold auto-expires. Cancel anytime before acceptance.
  • Quality/safety: Developers are manually vetted via LinkedIn/GitHub. You get 1 free revision if the first attempt misses. If deadlines slip or no one picks it up, you’re refunded.
  • Positioning: “Introvert-friendly” debugging—no calls, fast turnaround—aimed at users of tools like Bolt, Replit, Cursor, Claude Code, Windsurf, and Base44.

Why it matters: As AI code-gen accelerates, this is a lightweight, SLA-backed alternative to hiring a freelancer or slogging through fixes yourself—human-in-the-loop debugging as a service.

Humans-on-demand for broken AI code: a 24-hour bug-fix marketplace A new service proposes a bounty-based marketplace (minimum $49) where vetted developers fix broken, AI-generated code within 24 hours via pull request, functioning as a "human-in-the-loop" layer for tools like Replit or Cursor.

Discussion:

  • The Model: One commenter pointed out that this approach is backed by research suggesting human-AI pairs consistently outperform AI working autonomously.
  • Technical Glitches: Early feedback included a bug report regarding the onboarding process, with a user noting that Stripe incorrectly flagged their location as the Netherlands. They noted the idea was "cool" despite needing to contact the developer to resolve the payment issue.
  • Developer Experience: Sentiment regarding the work itself was mixed, with one user remarking that the prospect of fixing broken AI code "sounds miserable."

Why is Claude an Electron app?

Submission URL | 395 points | by dbreunig | 410 comments

Why Claude (and so many others) still ship as Electron apps, even in the agent era

  • The pitch: If coding agents can turn a spec and test suite into cross-platform code, why not ship snappy native apps per OS instead of bundling a browser with Electron?
  • The reality: Agents excel at the first 90%, but the last 10%—edge cases, real‑world quirks, regressions, and ongoing support—is where costs explode. Maintaining three native codebases (Mac/Win/Linux) triples the surface area for bugs and support.
  • Case in point: Anthropic’s much‑touted agent swarm spent ~$20k building a Rust‑based C compiler that flew through early tests but hit a wall on stability and completeness—impressive, yet largely unusable without heavy human cleanup.
  • Why Electron wins today: One codebase, familiar web stack, and instant cross‑platform reach outweigh bloat, lag, and weaker OS integration for most teams. The incentives favor shipping once over hand‑holding agents to production‑ready parity across three native apps.
  • Bottom line: Spec‑driven, agent‑powered native builds are promising, but the last mile and ongoing maintenance keep Electron in the lead—for now—even for AI leaders like Anthropic.

Based on the discussion, here is a summary of the comments:

The Insider Perspective A commenter identifying as an engineer on the project noted that the team had previous experience with Electron and preferred building non-natively to share code between web and desktop. However, they acknowledged that engineering tradeoffs might change in the future.

User Experience: Terminal vs. Desktop Users drew a sharp distinction between Anthropic's tools.

  • Claude Code (CLI): Was described as "magical" and highly effective, even on single terminals.
  • Claude Desktop (Electron): Received significant criticism for poor performance. Users reported it turning laptops into "toasters," causing fans to run wildly, and suffering from lag/freezing (one user noted delays of multiple seconds when switching tasks).
  • Workarounds: Some users resort to "disposable conversations" or stick strictly to the terminal interface to avoid the resource heaviness of the desktop app.

The "Coding is Solved" Irony A major theme of the discussion was the perceived contradiction between Anthropic’s marketing and their tech stack choices.

  • The Paradox: Commenters questioned why, if Claude is capable of "solving coding" or effortlessly porting code between languages, Anthropic cannot use their own agent to maintain three native codebases (Mac/Windows/Linux) instead of relying on Electron.
  • The Rebuttal: Others argued that "coding" isn't the bottleneck—maintenance is. Even if AI generates the code, maintaining three separate stateful architectures is a logistical nightmare compared to deploying a single web-stack application.

The Broader Electron Debate The thread evolved into a classic debate over the viability of Electron:

  • Defenders: Argued that performance complaints are often hyperbole. They cited VS Code and Gmail as examples of complex, successful web-stack applications. Some argued that "native app development is dead" outside of gaming and walled gardens (iOS), and that the browser is the only runtime that matters.
  • Detractors: Countered that VS Code is an outlier that relies heavily on native modules (Rust/C++) and WebGL optimizations to function well, implying standard Electron apps remain "junk." Users pointed to native alternatives (like Neovim or Thunderbird) as proof of the superior efficiency and speed of native code compared to web technologies.

How an inference provider can prove they're not serving a quantized model

Submission URL | 67 points | by FrasiertheLion | 48 comments

Tinfoil’s “Modelwrap” aims to solve a long‑standing gripe with inference APIs: you can ask for a specific model, but you can’t really know what you got. Providers can silently swap in different quantizations, tweak context windows under load, or drift over time—something users have observed across vendors and even within the same vendor.

What they built: verifiable inference that binds an API call to an exact set of model weights at runtime, without changing app code.

How it works

  • Public commitment to weights: Tinfoil publishes a single Merkle-tree root hash for the model’s weight files (e.g., 140 GB split into 4 KB blocks).
  • Enclave attestation, extended to data: They use secure enclaves, but go beyond “what binary booted” by attesting two things at launch: the committed root hash and the presence of an enforcement mechanism.
  • Kernel-enforced verification on every read: dm-verity in the Linux kernel checks each disk block read against the Merkle tree; if any byte doesn’t match the committed root, the read fails with an I/O error. Apps like vLLM don’t need modifications and can’t accidentally read uncommitted bytes.
  • Client-side verification: On each request, clients can verify the enclave’s attestation report contains the expected root hash and dm-verity configuration, tying the running server to the public commitment.
  • Analogy: This is the same mechanism behind Android Verified Boot (root hash + kernel-enforced Merkle checks), repurposed for model weights.

Why it matters

  • Proves you’re hitting the exact weights you pinned (no silent quantization or model swaps).
  • Stabilizes evals and regression tracking across time/providers.
  • Works for closed-source models too: you can’t see the weights, but you can verify you’re getting the same committed bits every time.

Caveats and open questions

  • Scope: This guarantees the bytes read from disk match the commitment; it doesn’t by itself prove anything about post-load transformations or runtime configuration unless those are also covered by attestation.
  • Trust base: You’re trusting the enclave/CPU vendor’s attestation and kernel integrity.
  • Practicalities: Update/rollout mechanics, performance overhead of dm-verity, and how broader server config (e.g., context window, KV cache policies) is pinned weren’t detailed here.

Bottom line: Modelwrap turns “trust us” into “verify us,” giving API users a cryptographic handle (a root hash) they can pin to—and a kernel-enforced path that makes serving anything else fail fast.

The discussion revolves around the technical limitations of "black box" verification (checking outputs) versus the cryptographic verification proposed by Tinfoil, with the author (FrasiertheLion) answering questions about the specific security architecture.

The feasibility of "Output Checking" The thread began with users questioning why complex attestation is necessary when users could simply check for deterministic outputs using a fixed seed.

  • The Consensus: Commenters (including trpplyns, jshlm, and msrblfnc) argued that checking outputs is unreliable. Floating-point math is not truly associative, meaning the order of operations matters.
  • Hardware Variance: At scale, providers split models across different GPU/CPU combinations and use optimizing compilers that change instruction scheduling. This results in slight numerical differences that break strict determinism, making it impossible to distinguish between a benign hardware change and a malicious model swap based on output alone.
  • Benchmarking issues: Aurornis noted that while external benchmarking sites exist, they are expensive to maintain and often produce noisy data rather than definitive proof of model degradation.

Attestation Mechanics & Trust A significant portion of the discussion focused on how the client effectively trusts the server.

  • The Mechanism: FrasiertheLion explained that the system relies on hardware-backed enclaves (Intel TDX, AMD SEV-SNP, Nvidia Confidential Computing).
  • Preventing "Replay" Attacks: Users (rbls, vrptr) asked how a client knows the provider isn't faking the attestation report. The author clarified:
    1. The enclave generates an ephemeral key pair at boot.
    2. The public key is embedded in the hardware-signed attestation report.
    3. The client encrypts their request using that public key.
    4. Only the specific, verified enclave instance can decrypt and process the request, preventing Man-in-the-Middle attacks or spoofed reports.
  • Trust Anchor: jlsdrn observed that this technology effectively shifts the "root of trust" from the API provider (who might cut corners) to the hardware manufacturer (Intel/AMD), who certifies the chip state.

Other Notes

  • Apple: There was interest in Apple’s similar approach with "Private Cloud Compute," which users felt offered strong integrity guarantees due to Apple's control over the entire hardware/software stack.
  • Quantization: rbrnd noted that quantization isn't inherently bad—users often want the trade-off of 99% quality for 50% cost—but the implication is that transparency about which version is running remains the key issue.

AI Submissions for Fri Feb 20 2026

Ggml.ai joins Hugging Face to ensure the long-term progress of Local AI

Submission URL | 786 points | by lairv | 206 comments

Headline: ggml.ai (team behind llama.cpp) joins Hugging Face to power the next phase of Local AI

What happened:

  • Georgi Gerganov and the ggml.ai team are joining Hugging Face while continuing to lead and maintain ggml and llama.cpp full-time.
  • The projects remain 100% open source and community-driven; technical decisions stay with the existing maintainers.
  • Hugging Face will provide long-term resources to sustain and grow the ecosystem.

Why it matters:

  • llama.cpp and the GGUF ecosystem have become a backbone for running powerful models locally on consumer hardware. This move shores up sustainability and accelerates support for new models and quantizations.
  • It formalizes a productive collaboration: HF engineers have already contributed core features, multimodal support, an inference server/UI, additional architectures, GGUF compatibility, and integrations with HF Inference Endpoints.

What to expect next:

  • Tighter, near “single-click” integration between transformers and ggml/llama.cpp for broader model coverage and easier validation.
  • Better packaging and UX to make local model deployment simpler and more ubiquitous.
  • Faster turnaround for new model support and quant releases.

Bigger picture:

  • This is a clear bet on local inference as a serious alternative to cloud AI, aiming to build an efficient, open inference stack across devices—framed by both teams as a step toward widely accessible, open-source “superintelligence.”

Impact for developers and users:

  • Smoother pipelines from transformers to GGUF/llama.cpp.
  • Improved tooling and installers for desktop/server setups.
  • Continued openness and community autonomy, with added stability and resourcing from HF.

Open questions to watch:

  • Details of governance and roadmap prioritization under the new arrangement.
  • How quickly new architectures and multimodal features flow through transformers → GGUF → llama.cpp.

Here is the daily digest summary for this submission and the accompanying discussion.

Top Story: ggml.ai (llama.cpp) joins Hugging Face

The Gist Georgi Gerganov and the team behind ggml.ai (the creators of llama.cpp and the GGUF file format) differ joining Hugging Face. The team will continue to work on their projects full-time with a commitment to keeping them 100% open-source and maintaining independent decision-making. Hugging Face is stepping in to provide the long-term resources and infrastructure needed to sustain the ecosystem.

Why it Matters This consolidates the "local AI" stack. llama.cpp has become the standard for running LLMs on consumer hardware (MacBooks, gaming PCs, etc.). By officially aligning with Hugging Face, users can expect a seamless pipeline where new models uploaded to HF are immediately compatible with local inference tools (GGUF), along with faster support for new architectures and easier "one-click" deployment options.

Hacker News Discussion Summary

The discussion on Hacker News focused heavily on the logistics of local AI, the sustainability of Hugging Face, and the technical nuance of quantization.

  • Hugging Face as the "Real" OpenAI: Commenters praised Hugging Face as the true champion of open-source AI, noting the immense, likely expensive service they provide by hosting petabytes of model weights for free. Several users expressed hope that HF has a sustainable business model so this resource doesn't disappear.
  • The "Data Cap" Bottleneck: A significant portion of the thread revolved around the practical struggles of local AI enthusiasts. Users reported downloading terabytes of models per week, triggering data caps from residential ISPs (like Comcast and AT&T). This led to a debate on why Hugging Face doesn't utilize BitTorrent to offload bandwidth costs. The counter-argument was that HF likely avoids torrents to maintain accurate download metrics (vanity metrics) and to manage access control for "gated" models (like Llama 3).
  • Quantization Quality Control: There was a technical exchange (involving Daniel Han of Unsloth) regarding the reliability of quantized models. Users expressed concern that aggressive quantization "lobotomizes" models in ways that are hard to detect with standard benchmarks (like perplexity). The consensus was that running comprehensive benchmarks on every quantization format is currently too expensive ($1k–$100k) for most open-source maintainers.
  • East vs. West Open Weights: A debate emerged regarding the quality of Western open models (like Mistral) versus Chinese open models (DeepSeek, Qwen, Kimi). Some users argued that Chinese models are currently winning on efficiency and reasoning benchmarks, while others countered that they still lack Western cultural knowledge and nuance required for tasks like creative writing or roleplay.

Every company building your AI assistant is now an ad company

Submission URL | 258 points | by ajuhasz | 135 comments

Juno opens pre-orders for a local, always-on AI assistant—and argues ads make rival assistants a surveillance risk

  • The pitch: Adam Juhasz says the next wave of assistants must be “always on”—hearing and seeing context across rooms, wearables, and time—to be truly proactive. Wake words are a dead end because the most valuable context happens in natural, unprompted conversations.

  • The clash: He argues nearly every major assistant effort is now ad-funded, creating structural incentives to capture and monetize continuous audio/visual data. Policies can change; architectures can’t. “Policy is a promise. Architecture is a guarantee.”

  • The remedy: Keep inference local. If models run on-device/in-home, there’s no API to hit, no telemetry to harvest, no data to subpoena. He cites Amazon’s Alexa/Ring histories as cautionary tales and notes OpenAI’s recent move to ads as a sign of where cloud-first models trend.

  • Tech claim: The edge stack is “ready now.” A small, fanless box can run real-time STT, memory, reasoning, and TTS with acceptable quality for home tasks. Remaining issues are framed as memory architecture and context handling, not model size.

  • The product: Juno’s Pioneer Edition—positioned as a local-first, always-listening home assistant—opens for pre-orders. The post doubles as a manifesto for on-device AI over ad-backed cloud assistants.

Why it matters: If “ambient AI” is inevitable, the business model and deployment architecture will decide who controls the feed from your life—users or advertisers. This piece argues a new category—local AI appliances—may be the only trustworthy path.

The Discussion The comment thread centers on the tension between Juno’s privacy promises and the inherent risks of "always-on" recording, with the creator (jhsz) actively addressing technical concerns.

  • Legal & Physical Risks: Users argued that "local-only" does not equal "immune to law enforcement." Commenters like zmmmmm and pxys noted that unless the device is legally and technically resistant to compelled decryption (e.g., warrants), a box full of intimate family conversations is a liability. There were also concerns about what happens if the hardware is stolen or if Juno is acquired by a data-hungry conglomerate.
  • The Creator's Defense: jhsz engaged with critics, explaining that the system is designed to minimize long-term raw data storage (tuning memory to extract facts and "forget" the audio quickly) and relies on hardware-encrypted storage (Nvidia Jetson). They argued that while no solution is magical, data "inside the walls" is fundamentally safer than data in a corporate cloud.
  • Target Audience Mismatch: Several users (BoxFour, bndrm) pointed out a contraction in the pitch: the specific demographic that cares enough about privacy to buy a local AI appliance is often the same demographic that refuses to have any always-on microphones in their home, regardless of architecture.
  • Philosophical Pushback: The debate extended to the utility of "perfect memory." Citing Borges’ Funes the Memorious, some users questioned whether an AI that remembers every interaction is a helpful tool or a dystopian burden, suggesting that human forgetfulness is a feature, not a bug.

The path to ubiquitous AI (17k tokens/sec)

Submission URL | 791 points | by sidnarsipur | 429 comments

Taalas: turning models into chips for 10x faster, cheaper inference; launches hard‑wired Llama 3.1 8B at 17K tok/s

What’s new

  • Taalas claims it can convert any AI model into custom silicon in ~2 months, producing “Hardcore Models” that are ~10x faster, ~20x cheaper to build, and ~10x lower power than GPU-based inference.
  • First product: HC1 chip hard‑wiring Llama 3.1 8B, offered as a chatbot and inference API. Claimed performance: 17,000 tokens/sec per user (1k/1k input/output), outpacing Nvidia H200/B200 baselines and specialist stacks (Groq, SambaNova, Cerebras) in their chart.
  • Design philosophy:
    • Total specialization: per‑model ASICs for maximum efficiency.
    • Merge storage and compute: single chip at DRAM‑level density to eliminate HBM, massive I/O, advanced packaging, liquid cooling.
    • Radical simplification: simpler systems = much lower total cost.
  • Flexibility: configurable context window; supports fine‑tuning via LoRA. Gen‑1 uses aggressive 3‑/6‑bit quantization (some quality hit). Gen‑2 (HC2) moves to standard 4‑bit floating point with higher density/speed.
  • Roadmap: mid‑sized reasoning LLM on HC1 this spring; “frontier” LLM on HC2 targeted for winter.

Why it matters

  • If these numbers hold up, inference latency and cost could drop by an order of magnitude—key for real‑time agents and large‑scale deployment without ballooning data centers.
  • It’s a bolder bet than general accelerators (e.g., Groq, Cerebras): a per‑model ASIC pipeline that trades flexibility for speed/efficiency, mitigated by a fast 2‑month turnaround and LoRA support.

Caveats/questions

  • “Compute + DRAM‑density storage on one chip without exotic tech” is a big claim; details on process, concurrency, and memory architecture are sparse.
  • Hard‑wiring weights risks rapid obsolescence as models evolve; economics hinge on how often new ASICs are spun and how broadly each model is adopted.
  • Benchmark framing (tokens/sec per user, aggressive quantization) may mask throughput/quality trade‑offs; quality metrics vs GPU baselines aren’t shown.

How to try

  • A beta chatbot demo and inference API for the HC1 Llama 3.1 8B are live.

Based on the discussion, here is a summary of the comments:

Technical Speculation & Credibility

  • Commenters analyzed the likely architecture, theorizing that Taalas is using specialized "mask ROM" fabrication where weights are stored via physical transistor traits (e.g., drive strength) rather than traditional memory cells. This enables single-transistor, 4-bit multiplication and massive density on mature nodes like TSMC 6nm.
  • While some were skeptical of the claimed 2-month "tape-out" turnaround, others noted the founders’ pedigree (veterans from Nvidia, AMD, and Tenstorrent), suggesting they have the specific expertise and connections to pull off such complex VLSI feats.

Performance vs. The Status Quo

  • Users distinguished Taalas’s metrics from Nvidia’s. While an H200 serves high throughput via massive batching (high latency per user), Taalas appears to offer massive throughput at very low latency (milliseconds) for single users.
  • The 17k tokens/second speed was described as a legitimate "quantitative change leading to a qualitative change," enabling real-time voice, video generation, and agentic workflows that current GPUs cannot handle efficiently.

Strategic Use Cases: Speculative Decoding

  • A significant portion of the debate focused on using these chips for speculative decoding. Instead of just being a standalone chatbot, the Taalas chip could serve as a "draft model" to rapidly generate tokens that are then verified by a larger frontier model, significantly speeding up the total inference of massive models.
  • Caveats were raised regarding tokenizer compatibility and whether the rigid nature of hard-wired chips allows them to pair effectively with evolving frontier models in this way.

Risks & Environmental Concerns

  • Obsolescence: The primary economic concern is whether a model remains relevant long enough to justify a dedicated hardware run. However, some argued that specific domains (robotics, control systems, basic coding) are stable enough to benefit from frozen, highly efficient models.
  • E-waste: Critics worried about the environmental impact of manufacturing chips that become useless once the model weights are outdated, comparing them to single-purpose ASIC miners or disposable tech.

Pi for Excel: AI sidebar add-in for Excel

Submission URL | 104 points | by rahimnathwani | 29 comments

Pi for Excel: an open-source, multi-model AI sidebar that can read and edit your spreadsheets

What it is

  • A Microsoft Excel add-in that embeds an AI agent directly in a sidebar. It can read your workbook, make changes, explain formulas, search the web, and run task-specific “skills.”
  • Bring-your-own model: works with Anthropic (Claude), OpenAI, Google Gemini, GitHub Copilot, or any OpenAI-compatible endpoint. You can switch models mid-conversation.
  • MIT-licensed. Repo: https://github.com/tmustier/pi-for-excel — Demo/site: https://pi-for-excel.vercel.app

Why it’s interesting

  • Deep Excel tooling: 16 built-in actions the AI can call, including read/write ranges, fill formulas, trace dependencies, explain formulas, search across sheets, modify structure, apply formatting, manage comments, and even restore from automatic backups.
  • Auto-context: before each turn the model gets a blueprint of your workbook, your current selection, and recent edits—so you don’t have to describe what you’re looking at.
  • Safety and control: write operations have overwrite protection and auto-verification, with one-click revert via checkpoints.
  • Extensible: install sandboxed sidebar extensions the AI can generate for you; integrates optional web search (Serper/Tavily/Brave) and an MCP gateway to connect custom tools.
  • Power-user extras (behind /experimental): tmux bridge for local terminal control, Python/LibreOffice bridge, external skills discovery, stricter extension permissions.

How to use

  • Install by sideloading the manifest (macOS and Windows instructions in the README). Click “Open Pi” in Excel’s ribbon.
  • Connect a provider via API key or OAuth, or point it at a custom OpenAI-compatible gateway.
  • Try prompts like “What sheets do I have?” or “Summarize my current selection,” then ask it to fill formulas or format ranges per your house style.

Dev setup

  • Node 20 + mkcert for local HTTPS (Office.js requirement), Vite dev server, sideload the dev manifest. Quick-start steps are in the repo.

Caveats

  • As with any LLM-powered Excel agent, workbook data you share may be sent to your chosen model/provider. Good fit if you want Copilot-like assistance but prefer open-source and BYO keys.

Here is a summary of the discussion:

  • Origins and Inspiration: The author (tmstr) joined the thread to explain that the project was built to bring the spirit of the open-source coding agent Pia to Excel. They noted that the tool uses a virtual filesystem and wraps specific "skills" to interact with the spreadsheet.
  • Web Compatibility: Users confirmed the add-in works on the web version of Excel (allowing use on Linux), provided the manifest is sideloaded via localhost. However, users noted that sideloaded dev-mode add-ins typically disappear after a week.
  • Technical Constraints: There was discussion regarding context window limits when handling large datasets. The author acknowledged that large tables currently overflow the context when passing results to the LLM, but mentioned optimizations and a potential Python bridge to handle larger sheets in the future.
  • Deployment Headaches: While users praised the modern Office JS API for functionality, several lamented the distribution process, describing the MS App Store and side-loading requirements as significant hurdles compared to the actual coding.
  • Action vs. Text: Users verified that the agent can perform functional tasks—such as creating charts and formatting tables—rather than just answering questions, with one user comparing the desired utility to an AI graphic designer that performs steps in Photoshop.
  • Risks: One user warned about API bans, claiming their Google account was banned after 48 hours of heavy usage with a similar tool, highlighting the risks of hitting consumer AI endpoints with automated tasks.

Consistency diffusion language models: Up to 14x faster, no quality loss

Submission URL | 215 points | by zagwdt | 96 comments

Consistency Diffusion Language Models (CDLM) promise big speed gains for diffusion LMs without hurting quality. The team from SNU, UC Berkeley, and Together AI reports up to 14.5x lower latency on math and coding tasks by making diffusion decoding both more parallel and cache-friendly.

Why this matters

  • Diffusion LMs refine masked text over multiple steps and can finalize multiple tokens per iteration, but in practice they’re slow: full bidirectional attention blocks KV caching, and many refinement steps are needed for quality.
  • CDLM tackles both, pushing diffusion LMs closer to practical, high-throughput deployment while keeping their advantages (bidirectional context, infilling, refinement).

What’s new

  • Exact block-wise KV caching for diffusion LMs: Train a student with a block-causal mask (attends to prompt, prior blocks, and current block) distilled from a fully bidirectional teacher. This preserves quality while enabling standard KV reuse for finished blocks.
  • Reliable multi-token finalization: Within each block, the model confidently finalizes several tokens in parallel, reducing the number of refinement steps without degrading output.

How it works (post-training recipe)

  • Trajectory collection: Run a strong bidirectional DLM as teacher to record token-by-token refinement trajectories and hidden states under a conservative setup (e.g., 256-token generations in blocks of 32, one token finalized per step) to capture high-quality signals.
  • Train a block-causal student with three losses:
    • Distillation on newly unmasked tokens (match teacher’s reconstructed distributions).
    • Consistency on still-masked tokens (align intermediate predictions with block-complete predictions via stop-gradient).
    • Auxiliary masked-denoising (retain general masked-token and reasoning ability).
  • Inference: Decode block-wise autoregressively with exact KV reuse for prompt and finished blocks; within a block, finalize tokens in parallel using a confidence threshold; early stop on EOS.

Results

  • On Dream-7B-Instruct (a diffusion LM), CDLM achieves up to 14.5x latency speedups on math/coding benchmarks with comparable quality, using far fewer refinement steps.
  • No extra heuristic knobs required; the gains come from the caching-compatible mask plus consistency-based multi-token finalization.

Takeaway CDLM is a clean, post-training path to make diffusion language models fast enough to compete with autoregressive LMs in interactive settings—keeping diffusion’s bidirectional strengths while unlocking KV caching and parallel token finalization. Caveats to watch: the need for offline teacher trajectories and how well gains generalize beyond math/coding and the tested model size.

Here is a summary of the discussion:

Technical Mechanisms and "Drafting" vs. Autoregression Users discussed the fundamental differences between Consistency Diffusion Language Models (CDLM) and standard Autoregressive (AR) models. bpp and others used analogies (referencing "British munchkin cats" and "inserting a refrigerator into a kitchen") to illustrate how diffusion models handle global context. Unlike AR models that generate tokens linearly (left-to-right), diffusion models treat text generation more like editing a draft—modeling the probability of the entire output structure simultaneously. This allows for infilling and correcting "invalid" structures that AR models struggle with. There was brief debate regarding hybrid AR+Diffusion models (citing LLaDA), with some analyzing whether combining them risks losing the reasoning benefits of pure diffusion.

Model Sizing, Efficiency, and Data Quality The conversation shifted significantly when MASNeo expressed a desire for researchers to build larger models. wngrs and others countered that frontier models likely haven't grown substantially in parameter size over the last year or two. Instead, the industry focus has shifted to efficiency and data quality. mgclhpp cited technical reports from Qwen, noting that cleaner training data and longer pre-training allow smaller models (e.g., Qwen 2.5/3) to rival or outperform older, larger models. Users suggested the industry is moving toward "density" and "singularity"—highly efficient, smaller models correcting each other in parallel, rather than monolithic giants.

Pricing Conspiracies and "Enshittification" A thread of skepticism emerged regarding the business practices of major AI labs (OpenAI, Anthropic, Google). Users speculated that the lack of public parameter counts allows companies to maintain high pricing tiers ($20-$200/month) while actually serving cheaper, optimized models. rthmsthms and bdbdbdb argued that speed improvements often look suspiciously like "rebranded" models or silent downgrades to cut compute costs (enshittification), effectively keeping margins high while technological costs drop. Specific grievances were aired regarding the pricing and performance confusing between versions like Claude Sonnet 3.5 vs. 3.6 and Opus.

Nvidia and OpenAI abandon unfinished $100B deal in favour of $30B investment

Submission URL | 294 points | by zerosizedweasle | 325 comments

Nvidia and OpenAI reportedly ditch $100B mega-deal, pivot to $30B investment

What happened

  • The Financial Times reports that Nvidia and OpenAI have abandoned an unfinished deal said to be worth around $100 billion, opting instead for a roughly $30 billion investment. The detailed terms aren’t public.

Why it matters

  • A move from a single, gigantic commitment to a smaller one suggests both sides prefer staged, flexible financing and capacity build-out over an all-in mega arrangement.
  • It could temper near-term expectations for OpenAI’s dedicated compute scale-up and keep Nvidia’s options open on who gets priority access to its next-gen GPUs.
  • The shift may also reflect practical constraints (supply chains, regulatory scrutiny, financing costs) and a desire to avoid heavy concentration risk.

What to watch

  • How the $30B is structured (equity stake, JV, supply pre-pays, or a mix).
  • Any exclusivity or priority-allocation clauses for Nvidia hardware.
  • Knock-on effects for OpenAI’s cloud partnerships and training timelines.
  • Whether other financiers or hyperscalers step in to fill the gap.

Based on the discussion, here is a summary of the community's reaction:

OpenAI’s "WeWork Moment" and the IPO Rush Much of the conversation draws a sharp comparison between OpenAI and WeWork. Commenters suggest OpenAI (and Anthropic) may be rushing toward an IPO to capitalize on "unlimited AI hype" before the market scrutinizes their lack of a clear path to profitability. Users expressed skepticism about the company's cost structure, viewing the pivot to a smaller investment deal as a potential signal that the "blind check" era of funding is ending.

Nvidia: Enron, Cisco, or Just a Cyclical Bust? Speculation regarding Nvidia’s future was polarized. Some questioned if Nvidia could become "Enron 2.0" (referencing fraud), though this was largely dismissed because Nvidia sells tangible, high-demand hardware. A more popular comparison was Cisco or Sun Microsystems during the Dotcom boom—highly profitable companies that faced a massive correction when the bubble burst. There was debate over whether Nvidia could easily pivot back to selling GPUs to gamers if enterprise demand dries up, with some arguing that retreating from high-margin datacenter chips would be a painful restructuring.

Tangible Assets vs. Model Weights A sub-thread contrasted OpenAI with SpaceX. Participants noted that SpaceX has tangible assets (rockets, Starlink satellites, launch infrastructure) and a clear moat, whereas OpenAI faces high operating costs with a product (LLMs) that some view as lacking the same defensive "hard tech" barriers.

Hype Fatigue and Skepticism Calculated skepticism regarding AGI (Artificial General Intelligence) is growing. Commenters noted that aggressive predictions about AI replacing white-collar work or programmers by 2024/2025 are already "aging like milk." Citations regarding Ed Zitron (a tech critic who predicted investment scale-backs) were shared, suggesting that while the technology is useful, the financial expectations attached to it may be decoupling from reality.

'A Big Fuck You to Big Tech': New Jersey Residents Defeat AI Data Center

Submission URL | 53 points | by abdelhousni | 18 comments

Headline: New Jersey city kills AI data center, chooses public park instead

Summary: The New Brunswick, NJ City Council voted to cancel a planned 27,000-square-foot AI data center at 100 Jersey Ave and will add a public park to a redevelopment that already includes 600 apartments (10% affordable) and small-business warehouses. Hundreds packed the meeting to oppose the project, citing fears of higher electricity and water bills and environmental impacts. Local groups, including the NAACP, argued the facility would drain community resources. After the vote, residents celebrated; organizers framed the decision as prioritizing neighborhoods over Big Tech.

Why it matters:

  • Signals rising local pushback to AI infrastructure over energy and water use, grid strain, and limited community benefits.
  • May push data center developers toward stronger community benefit agreements, better transparency on resource use, and siting in areas with surplus power/water or on-site renewables.
  • Highlights tension between tech-driven development, housing affordability, and public amenities.

What to watch:

  • Whether developers appeal or relocate, and if future proposals include stronger environmental mitigations or utility guarantees.
  • If other municipalities adopt stricter zoning or disclosure requirements for AI/data center projects.
  • How New Brunswick funds and executes the park addition alongside the broader redevelopment.

Source: Common Dreams (Brett Wilkins), Feb 20, 2026.

The Discussion

The Hacker News discussion moves beyond New Brunswick's specific decision to a broader critique of the technology sector's relationship with the public and the power grid.

  • Tech Fatigue: Several users expressed that "Big Tech" has lost the "cool" factor it possessed during the 90s and 2000s (likening the old vibe to "sk8erboi culture"). Commenters described current tech giants as "soulless," "sanitized," and overly integrated into daily life, viewing them now as "the establishment" rather than exciting innovators.
  • Data Center Value: Participants debated the local value of data centers, describing them as "big ugly boxes" that employ few people compared to the space they occupy. One user contrasted them with nuclear plants which, despite risks, historically evoked a sense of national industrial pride that server farms lack.
  • The Energy Dilemma: A significant portion of the thread focused on whether cities could leverage data center demand to fund new power infrastructure (solar, battery, or nuclear) rather than banning them.
    • Skepticism of Commitments: Critics argued that even if cities demand new power generation as a prerequisite, "energy is fungible" and corporations will use legal loopholes, subsidiaries, or financial derivatives to avoid delivering actual local capacity while still causing grid congestion.
    • Implementation Issues: Others noted that solar requires expensive battery storage to be viable for 24/7 uptime and nuclear takes 15+ years to build, leading to fears that companies will simply revert to gas turbines or result in costs being "dumped onto regular people."
    • Regulatory Solutions: Ideas were floated to strictly regulate business power rates or use performance bonds (seized if timelines aren't met) to enforce infrastructure promises, though skeptics maintained that a "blanket refusal" is often the only safe move for a municipality.

AI Submissions for Thu Feb 19 2026

AI is not a coworker, it's an exoskeleton

Submission URL | 416 points | by benbeingbin | 412 comments

Core idea: Treat AI as an “exoskeleton” that amplifies human judgment, not as an autonomous coworker. Framed this way, AI succeeds by handling scale and surfacing insights while humans make the calls. The post contrasts this with “agentic AI,” which often disappoints because it lacks the implicit context people carry.

Evidence via real exoskeletons:

  • Manufacturing: Ford’s EksoVest cut injuries 83%; BMW reports 30–40% reduced effort with Levitate vests; German Bionic’s Cray X supports up to 66 lbs and customers saw 25% fewer sick days.
  • Military: Sarcos Guardian XO Max delivers 20:1 strength amplification; Lockheed’s HULC supports ~200 lbs at ~7 mph sustained.
  • Rehab: Meta-analysis found 76% of spinal cord injury patients could walk with powered exoskeletons (with crutches/walkers for balance).
  • Running: Stanford showed a 15% energy cost reduction with an ankle exo; Harvard’s soft exosuit cut running metabolic cost by 5.4%.

Product application: Kasava positions its platform as an exoskeleton for product teams—deep commit analysis to spot technical debt and risks; large-scale transcript analysis to extract themes and pain points—while leaving prioritization and decisions to humans. The piece argues autonomous agents fail without organizational “connective tissue” (hinting at a “product graph” to encode that context).

Why it matters:

  • Practical takeaway: capture and wire up org context; design AI to surface options, not make judgment calls.
  • Counterpoint: agents can work in tightly scoped, well-instrumented domains, but broad autonomy still hits context walls.
  • HN angle: a clear framing for building AI features users trust—and a reminder that “human-in-the-loop” is a product choice, not a fallback.

The Debate: Stochastic Parrots vs. Useful Architects The discussion focused heavily on the definition of "reasoning" and the practical application of AI in software engineering, moving beyond the article's specific "exoskeleton" metaphor into the mechanics of how AI assists—or hinders—technical work.

  • The "Architect" vs. The "Team": One prominent perspective challenged the need for human teams, arguing that human collaboration has high "synchronization costs." This user suggested AI facilitates a shift where a single human "architect" with good taste can direct an "army of agents" or utilize AI for error correction and delegation, effectively acting as a massive lever for individual output.
  • The Competence Barrier: Countering the "easy automation" narrative, others likened AI coding to distinct skill levels. One commenter compared unskilled AI usage to a toddler with Play-Doh—making a mess without fundamental structural knowledge—working only when the user is a "competent sculptor" (skilled programmer) who can mold the raw material correctly.
  • Reasoning vs. Autocomplete: A heated debate emerged regarding whether LLMs possess true logic or are simply "text predictors."
    • Skeptics noted that models often hallucinate (citing an anecdote about a model reading the clipboard and inventing a scenario) and make "non-human" mistakes, arguing they lack the context to find actual flaws.
    • Defenders argued that if a model successfully navigates messy log files to produce working code fixes, the distinction between "reasoning" and "prediction" is effectively meaningless. One user cited mechanistic interpretability research (specifically regarding Anthropic’s models) to demonstrate that LLMs have specific internal feature representations for "bugs" and "unsafe code," suggesting a deeper internal logic than simple pattern matching.
  • Human Reliability: Several comments pushed back against the demand for deterministic AI, pointing out that human reasoning is also probabilistic and unreliable. The argument was made that AI doesn't need to be perfect, but simply needs to outperform the average human's consistency to be valuable, drawing parallels to safety thresholds for self-driving cars.

Gemini 3.1 Pro

Submission URL | 902 points | by MallocVoidstar | 871 comments

Google announces Gemini 3.1 Pro, pushing “core reasoning” and agentic workflows

  • What’s new: An upgraded core model meant for complex, multi-step tasks, rolling out across Google’s consumer and developer products.
  • Headline claim: Scores 77.1% (verified) on ARC-AGI-2, more than double Gemini 3 Pro’s reasoning performance, per Google.
  • Demos/use cases:
    • Generates website-ready animated SVGs from text.
    • Builds a live ISS telemetry dashboard by wiring a public API to a visualization.
    • Codes an interactive 3D starling murmuration with hand tracking and a generative score.
    • “Creative coding” that translates literary themes (e.g., Wuthering Heights) into site design and behavior.
  • Availability:
    • Developers (preview) via Gemini API in AI Studio, Gemini CLI, Google Antigravity, and Android Studio.
    • Enterprises via Vertex AI and Gemini Enterprise.
    • Consumers via the Gemini app and NotebookLM (Pro/Ultra; higher limits in the app, NotebookLM access limited to Pro/Ultra).
  • Context: Follows last week’s “Gemini 3 Deep Think” update; 3.1 Pro is in preview as Google iterates on more ambitious agentic workflows before general availability.

Here is a summary of the discussion:

  • Claude Remains the Coding Favorite: A significant portion of the discussion contrasts Gemini unfavorably with Anthropic’s Claude. While users acknowledge Gemini’s raw reasoning speed, developers report getting "stuck in loops" and struggling with poor tooling integration. One former Googler described their current workflow as "Plan in Gemini, Execute in Claude," implying Google has captured the reasoning aspect but Anthropic currently owns the developer experience.
  • Strategic Focus (OpenAI vs. Anthropic): Commenters suggest that while Google is busy fighting an existential war against OpenAI to protect Search, they are inadvertently leaving the high-value "enterprise and coding" door open for Anthropic. However, defenders note that Google’s "messy" product launches are typical for the company, and its vertical integration (owning the TPUs and data centers) gives it a long-term economic advantage that competitors reliant on Nvidia chips lack.
  • Performance & Hallucinations: User reports on the new model's accuracy are mixed. Some anecdotal evidence suggests the new preview hallucinates more than Gemini 3.0, while others cite benchmarks (like the AA-Omniscience Index) to argue that 3.1 is technically an improvement, leading to a debate on how these benchmarks apply to real-world usage.
  • Safety & Refusals: There is conflicting feedback regarding safety filters. While some users complain about Google’s strict refusal to answer certain prompts (e.g., SSH keys), others note that they have recently found ChatGPT to be more puritanical, occasionally switching to Gemini specifically because it was willing to answer questions OpenAI refused.

Measuring AI agent autonomy in practice

Submission URL | 109 points | by jbredeche | 49 comments

Measuring AI agent autonomy in practice (Anthropic): Anthropic analyzed millions of interactions across Claude Code and its public API to see how autonomously agents run in the wild, how humans oversee them, and where they’re used.

  • Autonomy is rising: In the longest Claude Code sessions, uninterrupted work time nearly doubled in three months—from under 25 minutes to over 45—an increase that appears driven by usage patterns/UX as much as model upgrades.
  • Oversight is shifting: Experienced users enable full auto-approve in 40%+ of sessions (vs ~20% for new users) yet interrupt more often—letting agents run, then stepping in when it matters. On complex tasks, Claude Code pauses to ask for clarification more than twice as often as humans interrupt it.
  • Where agents act: Most API-side actions are low-risk and reversible. About half of agentic activity is software engineering, with early but growing use in healthcare, finance, and cybersecurity.
  • How they measured: API traffic was analyzed at the tool-call level (no session stitching), while Claude Code provided end-to-end session timelines; autonomy was proxied by turn duration, with caveats.

Why it matters: As agent autonomy grows, Anthropic argues for new post-deployment monitoring and human–AI interaction paradigms so people and agents can manage autonomy and risk together.

Based on the discussion, here is a summary of the comments on Hacker News:

Critique of Metrics and Methodology The most prominent criticism focuses on Anthropic's decision to use "turn duration" (specifically the increase from 25 to 45 minutes at the 99.9th percentile) as a proxy for autonomy. Users argued this metric is heavily flawed because:

  • Hardware dependency: Duration correlates with compute speed; the same task takes vastly different amounts of time on a Raspberry Pi versus a Groq chip, making time a "gibberish measurement" without controlling for token speed or output quality.
  • Cherry-picking: Critics noted that highlighting the 99.9th percentile (the extreme tail) looks like cherry-picking outliers rather than representing the average user experience.
  • Better alternatives: Commenters suggested that true autonomy should be measured by "authorization scope" (success within permission boundaries) or the complexity of tasks handled without human intervention, rather than raw time.

Privacy and Telemetry Concerns There is significant skepticism regarding user privacy. Several commenters believe the push for "agentic" coding via Claude Code is primarily a method to harvest telemetry for training data. Users expressed distrust in "privacy-preserving" features (like the mentioned specific research tool "Clio"), arguing that even if personal speech is removed or summarized, the storage of that derived data still poses a security risk and benefits the company more than the user.

Model Consistency and "Overthinking" Users reported frustrated experiences with recent model behavior, suspecting silent backend changes to balance costs or game benchmarks.

  • Some noted that models seem to vary wildly in quality from day to day or hour to hour.
  • Others mentioned that "thinking" models (like Opus) sometimes overthink simple tasks, burning tokens unnecessarily, leading some users to switch back to previous versions or disable thinking modes for efficiency.

The "Dead Internet" Meta-Discussion A subset of the discussion focused on the comment thread itself, with users claiming to spot "green-named" (new) or previously dormant accounts posting generic, bot-like responses. This fueled a sentiment that AI agents are already "clogging up" the commons and poisoning the very platforms discussing their regulation and monitoring.

Anthropic officially bans using subscription auth for third party use

Submission URL | 641 points | by theahura | 765 comments

Anthropic clarifies Claude Code legal, auth, and compliance rules

What’s new

  • Terms: Enterprise/API users fall under Commercial Terms; Free/Pro/Max under Consumer Terms. Existing commercial agreements apply whether using Claude directly or via AWS Bedrock/Google Vertex.
  • Healthcare: A Business Associate Agreement automatically covers Claude Code if the customer has an executed BAA and Zero Data Retention (ZDR) is enabled; applies to that customer’s API traffic through Claude Code.
  • Usage: Subject to Anthropic’s AUP. Pro/Max usage limits assume ordinary, individual use across Claude Code and the Agent SDK.
  • Authentication: Consumer OAuth tokens (Free/Pro/Max) are only for Claude Code and Claude.ai. Using them in other products/services—including the Agent SDK—or proxying user traffic through Claude.ai credentials is prohibited. Developers must use API keys from Claude Console or supported cloud providers. Anthropic may enforce without notice.
  • Security: Trust Center and Transparency Hub are the references; vulnerabilities go through HackerOne.

Why it matters

  • Shuts down gray-market “log in with Claude.ai” or credential-proxy tools.
  • Provides a clear compliance path for healthcare workloads (BAA + ZDR).
  • Reduces legal ambiguity for third-party builders on the Agent SDK and cloud marketplaces.

Action items for devs and teams

  • If you’re integrating Claude or the Agent SDK, provision API keys—don’t rely on user OAuth.
  • Audit any tool that asks users to sign in with Claude.ai; it likely violates the ToS.
  • For HIPAA use, confirm BAA is executed and enable ZDR before routing through Claude Code.
  • Point security teams to the Trust Center/Transparency Hub and set up HackerOne reporting.

Here is a summary of the discussion:

Strategic Lock-in vs. Developer Freedom The discussion focused heavily on the business signals behind Anthropic’s technical restrictions. Many users view the move as an attempt to "capture value" and create platform path dependency—comparing the strategy to Apple’s walled garden or Microsoft’s OS dominance—rather than purely technical necessity. Commenters argued that by coupling the frontend (Claude Code) tightly with the model usage, Anthropic is trying to prevent their backend models from becoming a fungible commodity.

The "Bloat" vs. DIY Debate A significant portion of the conversation revolved around the efficiency of the Claude Code tool itself.

  • Performance: Several developers critiqued the official CLI as "slow" and "bloated."
  • Simplicity: Users shared anecdotes of building their own minimal coding agents (e.g., "100 lines of code and tmux controls") that they feel outperform the official harness.
  • The "Bitter Lesson": Some invoked Rich Sutton's "Bitter Lesson," suggesting that complex, hand-engineered software harnesses (like Claude Code) will eventually lose out to general, compute-driven approaches.

Risk of Accelerating Competitors Commenters warned that strictly enforcing these rules might backfire.

  • Fungibility: There is a sentiment that LLMs are becoming interchangeable artifacts. If Anthropic makes access too difficult or expensive (forcing API keys over OAuth convenience), developers may simply switch to competitors or open-source alternatives that offer more flexibility.
  • Innovation: Users argued that banning "gray market" integrations (using web-auth for third-party tools) stifles the ecosystem and "burns developer loyalty" in favor of short-term corporate control.

AI makes you boring

Submission URL | 660 points | by speckx | 359 comments

AI makes you boring (Opinion) — Posted 2026-02-19

  • Thesis: The surge of AI-assisted “vibe-coded” Show HN projects has boosted volume but hollowed out depth. Pre-AI, demos sparked rich conversations because creators had wrestled with problems; now many feel shallow and interchangeable.
  • Core argument: Offloading thinking to LLMs dampens originality. Models are good at smoothing ideas, not generating novel ones; relying on them yields surface-level outputs that feel smart but aren’t. Using AI to explore unfamiliar ground can help, but it’s a “fatal flaw” for original work like posts or products.
  • Rebuttal to “human-in-the-loop”: Original ideas emerge from doing the hard, messy work AI replaces. Steering an LLM doesn’t restore depth; it nudges human thought toward AI’s averaged patterns.
  • Process matters: Articulation—writing, teaching, grinding through a problem—refines ideas. Prompting isn’t articulation; the output is disposable without the thinking behind it. Metaphor: you don’t build muscle by having an excavator lift your weights; you don’t get interesting thoughts by letting a GPU think for you.
  • Why it matters: Beyond HN, AI risks homogenizing programming discourse and product ideas, trading hard-won insight for quick, forgettable demos.

Here is a summary of the discussion:

The Maintenance Debt of "Vibe Coding" The discussion centered on the distinction between running code and reading it. While some users argued that code is primarily an executable tool and that AI is excellent for removing the drudgery of "boring" implementations, others countered that code must eventually be read to be maintained. User fhd2 and others pointed out that if a creator doesn't have the time to write the code, they likely won't have the understanding required to fix it when it breaks. This led to comparisons between AI-generated code and "minified" or closed-source code—functional black boxes that are hostile to debugging.

Reliability vs. "Good Enough" A debate emerged regarding the stakes of software development. User zm argued that 90% of software (e.g., CRUD apps) is not life-critical, contrasting it with the infamous Therac-25 incidents, and suggested that strict reliability matters less than shipping. bstmm and others pushed back, noting that reliability is often the difference between high-value software (banking) and cheap disposable tools. They warned that AI introduces subtle logic errors (like using floating-point math for currency) that "vibe coders" might miss until it's too late.

The Productivity Trap Several commenters shared anecdotes about AI in professional settings. User rwn noted that coworkers using AI generators doubled their feature output but increased the volume of code (line changes) by 10x, resulting in a team that was "less able to work" or self-correct because they didn't understand the bloat they were committing. The consensus among skeptics was that AI encourages treating software like "cheap, Chinese-made widgets"—fine for disposable one-offs, but a liability for professional-grade engineering.

AI made coding more enjoyable

Submission URL | 95 points | by domysee | 90 comments

AI as the bore-b-gone for coding grunt work

  • A developer describes offloading the “typing exercise” parts of software engineering to LLMs: error handling, input validation, plumbing properties through layers, and handling many type-specific branches.
  • Tests are the key workflow: they design for testability, write one exemplar test to set style and expectations, then have the AI generate the rest of the cases.
  • One red line: they don’t trust LLMs for literal copy-paste edits, fearing subtle, undetectable drift when the model “re-creates” code instead of pasting it verbatim.
  • Net result: major reduction in tedium and faster iteration; AI shines at boilerplate and repetitive patterns while the human focuses on design and tricky paths.
  • Implicit takeaway: use AI where a clear pattern/spec exists (tests as oracle), and reach for deterministic tools (codemods, refactors, patches) when exactness matters.

While the submission praises AI for removing tedium, the discussion highlights a divide between developers who prioritize "shipping" and those who prioritize "craft," with significant concerns regarding long-term code comprehension.

Key themes in the discussion:

  • The "Video Game" Metaphor: Several users argued that writing the "boring" parts (tests, boilerplate) is the primary way developers build a mental model of the system. Offloading this to AI was likened to "watching a video game instead of playing it"—you get to the end, but you miss the experience and understanding gained through the struggle.
  • Code Review vs. Code Generation: A recurring sentiment is that reviewing code is cognitively harder than writing it. Users fear that AI optimizes the "cheap" task (generating lines of code) while exponentially increasing the "expensive" task (debugging and reviewing logic you didn't write). There is concern that this leads to "functioning but not understood" technical debt.
  • Product vs. Craft: The thread split into two philosophical camps:
    • The Pragmatists: Argue that business value is tied to the product, not the code. If AI accelerates the result, "losing" the coding process is acceptable. This group included time-poor developers (e.g., new parents) who found AI essential for actually finishing personal projects.
    • The Craftsmen: Argue that if you enjoy the act of coding, AI removes the fun. One user noted an ironic inversion: they love writing small, logical methods and hate high-level architecture; AI forces them to do the architecture while robbing them of the logic puzzles they enjoy.
  • Skill Atrophy: Commenters expressed concern that skipping manual implementation (like setting up Linux servers or writing basic algorithms) stops the deepening of technical knowledge, turning developers into "no-code" platform operators.

Step 3.5 Flash – Open-source foundation model, supports deep reasoning at speed

Submission URL | 222 points | by kristianp | 88 comments

StepFun drops Step 3.5 Flash: an open-source MoE that aims for frontier reasoning at real-time speeds

Highlights

  • Architecture and speed: 196B-parameter sparse MoE with only 11B active per token (“intelligence density”). Uses 3-way Multi-Token Prediction to hit 100–300 tok/s in typical use (peaks ~350 tok/s on single-stream coding).
  • Long context: 256K window via a 3:1 Sliding Window Attention to full-attention ratio, targeting lower compute while keeping quality on long codebases/docs.
  • Agent/coding focus: RL-trained for stability on long-horizon tasks; scores 74.4% on SWE-bench Verified and 51.0 on Terminal-Bench 2.0; 86.4 on LiveCodeBench-V6.
  • Reasoning: Near-SOTA on math olympiad style tasks—AIME 2025 at 97.3 (99.9 with PaCoRe/parallel thinking) and HMMT 2025 at 96.2; 85.4 on IMOAnswerBench.
  • Agents and browsing: 88.2 on τ²-Bench; 69.0 on BrowseComp (with Context Manager).
  • Overall standing: Average score across eight benchmarks is 81.0—competitive with top proprietary models (e.g., Gemini 3.0 Pro 80.7, Claude Opus 4.5 80.6) and just behind GPT-5.2 xhigh at 82.2. It trails some closed models on specific tasks (e.g., Claude on Terminal-Bench).
  • Local deployment: Positioned for high-end consumer hardware (e.g., Mac Studio M4 Max, NVIDIA DGX Spark) with code/weights slated across GitHub, HuggingFace, and ModelScope; tech report and OpenClaw Guidance mentioned.

Why it matters

  • Pushes open-source models closer to “agent-ready” territory: fast generation, long context, and robust coding/terminal competence.
  • MoE + MTP-3 delivers a rare combo of throughput and reasoning depth, making local private agents more practical.

Caveats

  • Vendor-reported results; averages exclude xbench-DeepSearch where it lags (56.3 vs ChatGPT-5-Pro 75.0).
  • “PaCoRe/parallel thinking” gains come with extra test-time compute.
  • Real-world latency/throughput depend heavily on user hardware and deployment stack.

StepFun drops Step 3.5 Flash: an open-source MoE that aims for frontier reasoning at real-time speeds https://huggingface.co/StepFun/Step-3.5-Flash-GGUF

StepFun has released Step 3.5 Flash, a 196B-parameter sparse Mixture-of-Experts (MoE) model with 11B active parameters per token. Aiming for "intelligence density," the model utilizes 3-way Multi-Token Prediction to achieve generation speeds of 100–300 tokens per second. It features a 256K context window using a sliding window attention mechanism to maintain performance on long documents and codebases. The model is positioned as an agent-ready solution, scoring competitively on coding benchmarks like SWE-bench Verified (74.4%) and reasoning tasks like AIME 2025 (97.3%). It is designed for local deployment on high-end hardware, with weights available on HuggingFace and GitHub.

Discussion Summary

The discussion focuses heavily on the practicalities of running such a large model locally, specifically regarding hardware requirements, recurring bugs, and the economics of self-hosting versus using APIs.

  • Local Hardware and Performance: Users confirm the model runs well on high-end Apple Silicon systems (M1 Ultra, M3 Max) with 128GB+ RAM using 4-bit quantization. Reported speeds are around 36 tokens per second on an M1 Ultra, with users praising the "intelligence density" compared to other local models like Minimax 2.5 and GLM-4.
  • Infinite Loop Bug: A significant portion of the feedback concerns an "infinite reasoning loop" issue. While initially suspected to be a llama.cpp engine bug (potentially mishandling format tags or "thinking" tokens), some users reported reproducing the issue in the official chat UI, suggesting it may be an intrinsic flaw in the model weights.
  • Agentic Capabilities and Comparisons:
    • Users debating its coding utility found it handles local CLI harnesses better than Qwen 2.5 Coder in some instances, though others noted it can struggle with explicit tool-calling instructions compared to models like Nemotron.
    • While reasoning is praised in some contexts (solving riddles), others found it prone to hallucinations on niche logic queries (e.g., Pokémon deck mechanics) where models like Claude Opus or DeepSeek succeed.
  • Benchmark Skepticism: There is debate regarding the Terminal-Bench 2.0 score (51.0); while some users felt this was low, others contextualized it as valid near-SOTA performance given the difficulty of the benchmark. However, critics argued that Terminal-Bench often tests syntax memorization rather than true agentic reasoning.
  • Economics of Self-Hosting: A sidebar discussion analyzed the cost-benefit of purchasing $10,000+ hardware (e.g., M3 Ultra with 512GB RAM) versus paying for APIs. The consensus leaned toward APIs being cheaper for most users unless privacy or 24/7 heavy utilization are the primary factors.

How AI is affecting productivity and jobs in Europe

Submission URL | 169 points | by pseudolus | 131 comments

Europe’s AI paradox: lagging invention, solid deployment — and a measurable payoff

  • What’s new: A study of 12,000+ EU firms (EIBIS + Orbis) claims the first causal estimate of AI’s impact on firm performance in Europe. Using a Rajan–Zingales-style instrument, the authors match each EU firm to similar US firms and use US AI adoption as an exogenous shock to identify effects.

  • Key findings:

    • AI lifts labour productivity by about 4% on average — economically meaningful, but far from “instant productivity boom” territory.
    • No evidence of job losses from adoption in the short run, suggesting AI is acting as a complement rather than a direct substitute for labour.
    • Adoption parity masks big divides:
      • Financially developed EU countries (e.g., Sweden, Netherlands) match US adoption (~36% of firms in 2024).
      • Less financially developed EU economies (e.g., Romania, Bulgaria) lag (~28%).
      • Big-firm edge: 45% of large firms use AI vs 24% of small firms.
    • Adopters already look different: they invest more, innovate more, and face tighter skill bottlenecks — hence the need for causal methods.
  • Why it matters: AI is delivering real, near-term efficiency gains in Europe without immediate employment cuts, but benefits concentrate in larger firms and financially stronger countries — a recipe for widening gaps unless policy focuses on diffusion (SMEs), skills, and access to finance/infrastructure.

  • Caveats: Effects are on labour productivity levels (short-run) rather than long-run TFP growth; adoption is partly survey-based; heterogeneity across countries and sectors remains large.

While the submission focuses on the economic and productivity impact of AI adoption in Europe, the comment text pivots to a debate on the reliability of the metrics used and whether AI’s perceived utility is actually a symptom of failing legacy tools.

Critique of Innovation Metrics Discussion opens with skepticism regarding the study's use of patent data to measure AI specialization. Commmenter gyl argues that comparing EU and US patent numbers is flawed because EU law has historically been much stricter regarding software patents. Others suggest that a surge in patents often signals patent trolling or speculative investment rather than genuine innovation, with cor_NEEL_ius warning that verifying AI-generated output (which often hallucinates) can actually create new productivity costs.

The "Search is Broken" Thesis The majority of the thread argues that AI feels productive primarily because traditional web search has degraded.

  • Decline of Search: Users like m463, mnkpt, and adrian_b expres deep frustration with modern search engines (specifically Google) for ignoring Boolean operators, fuzzy-matching specific terms (returning "Hoodoo" for "Xoodoo"), and prioritizing SEO-spam and ads over direct answers.
  • AI as a Refuge: In this context, AI is viewed as a "cleaner" alternative for cutting through "corporate obfuscation" to find simple facts (e.g., car pricing) without clicking through ad-laden pages.

Skepticism and Future Outlook Despite the current utility of AI over search, the mood remains cynical:

  • Hallucination vs. Noise: CamelCaseCondo counters that while search is noisy, AI is dangerous due to confident hallucinations, citing difficulties using LLMs for technical schematics where "close enough" is useless.
  • The Cycle of "Enshittification": Wobbles42 and luke5441 predict that the productivity boost is temporary. They anticipate that LLMs will eventually succumb to the same forces that ruined search: sponsored content, "SEO" manipulation of training data/context windows, and a flood of AI-generated spam, ultimately negating the efficiency gains currently being celebrated.

I traced 3,177 API calls to see what 4 AI coding tools put in the context window

Submission URL | 31 points | by theredbeard | 3 comments

What happened

  • Frustrated by ballooning token bills, the author built Context Lens, a tracer that records exactly what LLM coding tools put in their context window each turn.
  • He planted a small bug in Express.js (res.send(null) returning the string "null" with JSON content-type instead of an empty body) and asked four tools to find/fix it and run tests.
  • All four fixed the bug and passed 1,246 tests—but their context strategies and token usage were dramatically different.

Who and how

  • Models tested via CLI: Claude Code Opus, Claude Code Sonnet, “Codex (GPT-5.3)” and Gemini 2.5 Pro.
  • Context Lens broke down tokens by category: tool definitions, tool results, system prompt, and user conversation; it also tracked per-turn totals.

Key results

  • Total tokens (mean; range across runs):
    • Opus: ~27K (23–35K), tight and consistent.
    • Sonnet: ~50K (43–70K), broader reads.
    • Codex: ~35K (29–47K), moderate variance.
    • Gemini: ~258K (179–351K), huge variance with no convergence.
  • Peak-context composition:
    • Opus: 69% tool definitions (16.4K); only 1.5K in actual tool results. “Architectural tax” from re-sending tool definitions every turn. Caching helps (≈95% hits after the first turn).
    • Sonnet: Similar Claude overhead (18.4K tool defs, 43%) plus heavy reads (16.9K tool results, 40%); even read an entire 15.5K test file.
    • Codex: Minimal tool-def overhead (6%); most tokens go to tool results (72%).
    • Gemini: 0% tool-def overhead (server-side tools) but extremely aggressive reads—96% tool results. One turn pulled 118.5K tokens by dumping the full git history of a file.
  • Gemini ran the same number of API calls in low and high runs but chose very different, data-heavy paths each time. Its 1M context window and cheaper tokens are caveats, but the strategy still burns a lot of context.

Why it matters

  • Same quality, very different costs: tool design choices (prompted tool definitions vs. server-side tools, caching, retrieval breadth) dominate token spend.
  • Big context windows invite sloppy retrieval; developers pay for it.
  • Instrumentation like Context Lens surfaces hidden overheads (e.g., repeated tool definitions, unbounded diffs/histories) so teams can rein them in.

Notable footnotes

  • Claude uses Haiku subagents for small tasks; they don’t share cache with the main Opus calls.
  • The author observed no “settling” pattern in Gemini’s runs—each took a fresh, often heavier route.

Takeaways for practitioners

  • Measure per-turn context composition; set budgets and guardrails for tools (cap diffs/histories, prefer targeted reads).
  • Exploit caching where available; avoid re-sending large static tool definitions.
  • Bigger, cheaper context is not a free lunch—strategy beats window size.

Discussion Summary

Commenters praised the analysis and the name of the author's tool, "Context Lens." The limited discussion focused on technical specifics, including:

  • Model Behaviors: One user noted that Opus and Codex are particularly strong at reading code and diagnosing bugs, observing that telling Opus a feature "worked recently" effectively prompts it to inspect git history.
  • Version Confusion: There was some confusion regarding the specific model versions and release timelines being compared, likely due to the fast pace of recent LLM updates (e.g., Gemini 2.5 vs. recent Claude releases).

Submission URL | 56 points | by latexr | 33 comments

404 Media reports that xAI’s Grok surfaced porn performer Siri Dahl’s full legal name and birthdate to users without being prompted, blowing up a decade of separation between her stage identity and private life. Within hours, impersonators spun up Facebook accounts in her legal name and leak sites relabeled stolen clips, escalating harassment. Dahl told 404 Media she chose to include her legal name in the piece to try to reclaim control, but the episode underscores mounting concerns about Grok’s privacy and safety guardrails—especially around doxxing and sexualized content—and raises fresh questions about how chatbots source, filter, and gatekeep sensitive personal data.

Here is a summary of the discussion on Hacker News:

Public Data vs. Accessibility A significant portion of the debate centered on whether Grok actually "leaked" private information or simply surfaced data that was technically already public. Users did their own digging, noting that the information appeared to be available through decades-old social media posts archived on the Wayback Machine or via USPTO trademark filings. However, counter-arguments emphasized the concept of "security by obscurity." Commenters argued that there is a functional difference between data buried in a "faraway corner of the internet" that takes humans hours to find, and data served instantly by a chatbot. They contended that LLMs remove the friction that previously acted as a privacy shield, turning difficult-to-find information into immediate doxxing material.

Data Scraping and "Grokipedia" Participants discussed "Grokipedia," an information retrieval component of xAI, noting that it lists sources directly. Some users expressed skepticism about the claim that the data was private to begin with, suggesting Grok likely crawled public web data or data broker aggregations. This led to a broader discussion on how LLMs are trained; several users pointed out that once an LLM ingests personal data from the internet archive, it is difficult to "scrub" or make the model forget it. Comparisons were made to Google and OpenAI, with users noting that established tech giants generally have legal processes for removing such data that xAI may lack.

Harassment and Subject Identity The conversation touched on the specific vulnerability of the subject as a performer. While some comments were dismissive ("played the world's smallest violin"), others pushed back strongly, arguing that the dismissal of the harm was rooted in misogyny or bias against sex workers. These users highlighted that the technology powers abuse at scale, and that declaring data "public" forces victims into a binary where they cannot complain about harassment if a machine aggregates their data.

Data Removal Services There was a side discussion regarding the efficacy of data removal services. Users questioned if paying these services actually works long-term, with some noting that while they can enforce removals, sketchier sites often ignore them or reappear, and LLM training sets present a new, harder-to-clean repository of personal history.

Sam Altman (OpenAI) and Dario Amodei (Anthropic) Refuse to Hold Hands

Submission URL | 57 points | by doener | 21 comments

Top story: Modi’s “hands up” photo op with AI chiefs goes viral at India AI Impact Summit 2026

  • What happened: At the India AI Impact Summit in New Delhi, PM Narendra Modi posed with global AI leaders including Sundar Pichai (Google/Alphabet), Sam Altman (OpenAI), Dario Amodei (Anthropic), and Alexandr Wang (Scale AI). An ANI/DD News clip of a group hand-raise moment took off across X.
  • The clip: Modi prompts a linked hand-raise; several executives look hesitant, with Altman and Amodei’s awkward half-gestures becoming the meme focus.
  • Why it matters: Beyond the optics, the lineup underscores India’s bid to be a convening power in AI policy, investment, and infrastructure—positioning itself as more than a market and leaning into standard-setting and talent pipelines.

What people are saying

  • Symbolism vs PR: Supporters hailed “collaboration” and India’s “civilizational confidence”; critics called it a political-style photo op overshadowing substance.
  • Representation questions: Observers noted few Indian tech operators on stage besides Modi; others asked about notable absences (e.g., Meta’s AI leadership).
  • Real-world friction: Delhiites complained about Bharat Mandapam logistics, with long walks and traffic disruptions during the event.

Notes and nitpicks

  • The viral captioning sparked corrections about attendee titles/affiliations, adding to the meme storm.
  • Rivals standing shoulder-to-shoulder (Altman/Amodei) drew jokes—and curiosity about how aligned the “AI Avengers” really are.

What to watch next

  • Concrete outcomes: Any announcements on compute access, data center investments, skilling programs, or safety/standards workstreams tied to India.
  • Policy influence: Whether India translates convening power into credible guardrails and incentives that attract frontier AI R&D—not just cloud and services.
  • Industry impact: If India’s IT majors pivot toward AI productization fast enough to keep pace with the Anthropic/OpenAI era.

The Awkward Photo Op Discussion turned immediately to the visible tension on stage, with the top comment comparing the scene to a satirical moment from Mike Judge’s Silicon Valley, imagining the PM forcing tech rivals into a “Thank You” tableau. Others joked that the tension between the executives felt like a remake of the movie Challengers.

Bad Blood and Social Contracts A significant portion of the thread criticized the executives—specifically Sam Altman and Dario Amodei—for their reluctance to participate in the gesture.

  • "Divorced Parents": User jobs_throwaway labeled the display "shameful," comparing the CEOs to divorced parents who can’t pretend to get along for the sake of their children for ten seconds.
  • Social Intelligence: In a subhead regarding social norms, users argued that while such conventions (holding hands) are arbitrary, failing to follow them signals a lack of social adaptability. malux85 noted that good judgment involves putting "petty rivalries" aside for the broader group, arguing that a photo op doesn't materially change the business competition between OpenAI and Anthropic, but refusing it makes them look immature.

The "AI Avengers" History Commenters provided context for the friction, noting the specific history between the companies. hnkly pointed out that Anthropic was founded by disillusioned former OpenAI employees, comparing it to gaming industry schisms (like Runic Games spinning out of Blizzard).

Tech Celebrity Status There was a brief debate regarding the longevity of these figures. While one user dismissed them as "ephemeral celebs," others pushed back strongly (one switching to Spanish to emphasize the point), arguing that dismissing the architects of the current AI boom as fleeting figures is a miscalculation of how the next decade will play out.