Hacker News
Daily AI Digest

Welcome to the Hacker News Daily AI Digest, where you will find a daily summary of the latest and most intriguing artificial intelligence news, projects, and discussions among the Hacker News community. Subscribe now and join a growing network of AI enthusiasts, professionals, and researchers who are shaping the future of technology.

Brought to you by Philipp Burckhardt

AI Submissions for Sun Feb 22 2026

Google restricting Google AI Pro/Ultra subscribers for using OpenClaw

Submission URL | 738 points | by srigi | 634 comments

Google AI Ultra/“Antigravity” users report sudden account bans after third‑party OAuth

  • Multiple paying subscribers say their AI Ultra/Antigravity access was abruptly restricted (403 “service disabled”), often right after connecting Gemini via third‑party tools like OpenClaw/OpenCode. No warning or clear violation notice preceded the lockouts.
  • Support has been described as unresponsive or circular: users were bounced between Google Cloud and Google One, with some saying they’ve waited days or weeks without resolution.
  • One user shared a formal response from Google stating an internal investigation found use of credentials in the third‑party “open claw” tool violated Terms of Service by “using Antigravity servers to power a non‑Antigravity product.” Google called it a zero‑tolerance issue and said suspensions won’t be reversed.
  • Frustration is high among annual prepay customers; several report canceling other Google services, considering chargebacks, or migrating to alternatives (e.g., Claude Code). Others suggest creating a new account as a workaround.
  • A recurring pain point: the in‑app “Report Issue” path isn’t usable once you’re locked out.

Takeaway: Third‑party OAuth into paid AI accounts appears risky under Google’s ToS enforcement; users are calling for clearer rules, pre‑ban warnings, and a working appeal path before permanent suspensions.

Here is a summary of the discussion:

  • Exploit vs. Legitimate Use: A contentious debate emerged regarding the nature of the third-party tools (like "OpenClaw"). Some commenters viewed the usage as a clear "exploit" or "script kiddie" behavior—likening it to sharing a parking lot access code with the entire internet until the lot jams—arguing that handing OAuth tokens to third-party apps is a major security lapse. Conversely, others argued these are technically paying customers trying to utilize a product they purchased, and that Google unilaterally changed the Terms of Service to punish legitimate demand that their official apps didn't support.
  • The "Digital Death Penalty": The strongest criticism focused on the severity of the punishment. Users argued that permanently banning an entire Google Workspace or personal account (cutting off Gmail, Drive, and GCP) for a violation in a specific AI service introduces a "novel business risk." Commenters described the fear of accidentally violating obscure rules and losing their entire digital life as "insane," with some comparing it to a disproportionate "video game ban" applied to critical infrastructure.
  • Google's Response & Infrastructure: A comment linked to a Google employee’s statement claiming the bans were triggered because the "massive increase in malicious usage" was degrading service quality for everyone. However, critics countered that this reflects a failure in Google's quota management; rather than banning paying customers ($200+/month), the system should simply enforce rate limits, API caps, or "backpressure" to manage load without nuking accounts.
  • Market Implications: The incident is driving sentiment toward diversifying away from relying on a single "megacorp" for all digital services. Users noted this situation serves as a strong advertisement for self-hosted/local LLMs, as the risk of arbitrary lockouts makes proprietary cloud dependencies increasingly unattractive for business-critical workflows.

We hid backdoors in ~40MB binaries and asked AI + Ghidra to find them

Submission URL | 234 points | by jakozaur | 92 comments

AI + Ghidra vs. backdoored binaries: promising, but not production-ready

  • What they did: A team hid backdoors in compiled executables (around 40 MB) and asked AI agents, wired into Ghidra and standard RE tooling, to find them—no source code allowed. They’ve released an open benchmark and tasks as BinaryAudit (github.com/quesmaOrg/BinaryAudit), with a results dashboard covering false positives, tool proficiency, and a Pareto view of cost-effectiveness.

  • Why it matters: Real-world attacks increasingly swap or taint binaries and firmware (e.g., recent NPM supply-chain malware, the Notepad++ hijack, and findings in trains/solar inverters). Many targets are closed-source; binary analysis is the only line of defense.

  • How hard is this? Compilers strip structure and symbols, then optimize aggressively, making reverse engineering rely on disassembly and decompilation (e.g., Ghidra) back to pseudo-C. The post walks through an example that ultimately funnels user-controlled bytes into a system() call.

  • Key results:

    • Best model (Claude Opus 4.6) caught “relatively obvious” backdoors in small/mid-size binaries only 49% of the time.
    • Most models showed high false-positive rates, flagging clean binaries.
    • Conclusion: Today’s AI agents can sometimes spot real red flags but are far from reliable for standalone binary vetting.
  • Takeaway: Treat LLMs as noisy triage helpers alongside traditional RE tools and human experts; don’t rely on them for final judgments on shipped binaries or firmware.

Links: BinaryAudit results and benchmark details on the project site; tasks are open source at github.com/quesmaOrg/BinaryAudit.

Based on the discussion, users analyzed the effectiveness of combining LLMs with reverse engineering (RE) tools like Ghidra. While skeptics noted that current models struggle with complex logic and obfuscation, others shared specific workflows and tools that have proven successful for tasks like file format parsing and basic cracking.

Methodology and Context Much of the debate focused on the "fairness" and realism of the benchmark tasks.

  • Documentation vs. Autonomy: Several users argued that restricting AI from accessing tool documentation (to test "autonomy") is unrealistic. Users btsrs and nmxs suggested that just as human specialists use manuals, AI performance improves significantly when the context window is "stuffed" with Ghidra tutorials and API docs.
  • Obfuscation: Commenter 7777332215 noted that while simple string obfuscation lowers success rates, LLMs excel at detecting pattern-based anomalies. kslv added that asking a model to RE obfuscated code causes it to "spin in circles," but instructing it to explicitly identify obfuscation works better.

Benchmark Critique: The Dropbear Task User cmx performed a deep dive into one of the benchmark tasks (a backdoored Dropbear SSH server).

  • Heuristics vs. Understanding: cmx observed that Claude identified the correct function (svr_auth_password) but likely did so based on heuristics (it is a standard target for backdoors) rather than successfully analyzing the assembly.
  • Human vs. AI: Interestingly, cmx admitted to initially failing the same task manually by analyzing the wrong function, highlighting that while the AI might be guessing, the task itself is difficult for humans without recognized patterns.

Tooling and Workflows

  • Ghidra-CLI: User kslv shared their tool ghidra-cli, a REPL interface designed for LLMs, claiming it was "insanely effective" for reverse engineering the Altium file format (Delphi). They argued models are particularly good at writing parsers from scratch.
  • The "Swiss Army Knife" Approach: User btxpldr described using agents not for final judgments, but to automate high-level grunt work—like mapping attack surfaces or generating architecture diagrams—allowing the human to focus on deep investigation. They warned of the "productivity trap" where one spends more time prompting the AI than doing the work manually.
  • Cracks vs. Backdoors: User hereme888 claimed success using Claude Opus and Ghidra plugins to fully reverse engineer software cracks, though they acknowledged this is different from detecting state-level hidden backdoors.

Concerns

  • Training Data: Users questioned whether models were simply recalling solutions to known "crackmes" from their training data. However, kslv noted that performance remains consistent even on challenges released days or weeks ago.
  • Productivity: jkzr noted that some Python bindings (PyGhidra) are too slow, making CLI approaches more viable for agent loops.

Show HN: TLA+ Workbench skill for coding agents (compat. with Vercel skills CLI)

Submission URL | 40 points | by youio | 4 comments

agent-skills (GitHub) — A brand-new repo from younes-io popped up on HN. From the snippet we only see the GitHub chrome (6 stars, 0 forks) and no README details, so specifics are unclear. Judging by the name, it may be a collection of reusable “skills” for AI agents, but consider this a placeholder to watch—if you’re tracking agent tooling, bookmark it and check back as the project fleshes out.

agent-skills The creator (y) clarified the project’s purpose in the comments, describing it as a suite of skills for coding-agent workflows. The repository currently features a tlaplus-workbench skill designed to help agents convert natural language designs into TLA+ configuration files, run the TLC model checker, and summarize counterexamples. The author provided npx commands for users to try the tool and requested feedback on its utility for protocol and state-machine modeling. Discussion briefly touched on whether the tool references official language grammar for PlusCal and the potential for using formal TLA+ specifications alongside real code to improve LLM reasoning.

How I use Claude Code: Separation of planning and execution

Submission URL | 932 points | by vinhnx | 568 comments

TL;DR: After 9 months using Claude Code as a primary dev tool, the author’s winning tactic is strict separation of planning and execution. Never let the model write code until you’ve reviewed and approved a written plan. This human-in-the-loop workflow reduces wasted effort, preserves architectural control, and outperforms prompt-fix-repeat and agent loops—often with fewer tokens.

How it works:

  • Phase 1 — Research: Force a deep read of the relevant code, then require a persistent artifact (research.md). Use loaded language (“deeply,” “intricacies,” “go through everything”) so the model doesn’t skim. This surfaces misunderstandings early and prevents the costliest failure mode: correct code that violates the surrounding system (caches, ORM conventions, duplicated logic, etc.).
  • Phase 2 — Plan: Ask for plan.md with real file paths, concrete code snippets, approach trade-offs, and references to actual source. Ignore built-in plan modes; a markdown file is editable, reviewable, and part of the repo.
  • Reference-first: When possible, supply a high-quality OSS implementation as a template. The model is dramatically better adapting a concrete reference than inventing from scratch.
  • Annotation cycle: You edit plan.md inline—adding corrections, constraints, domain knowledge—then send it back for updates. Repeat until satisfied. Short notes (“not optional”) or longer business-context blocks both work.
  • Then and only then: Generate a focused TODO, implement against the approved plan, and iterate with feedback.

Why it wins:

  • Prevents garbage-in/garbage-out mistakes
  • Keeps you in charge of architecture and trade-offs
  • Produces more reliable changes with less churn and lower token spend

If you’ve found AI codegen flaky on non-trivial tasks, this plan-first, artifact-driven loop is the fix.

Based on the discussion, here is a summary of the comments:

Validation of the "Plan-First" Approach Many users validated the author's central thesis: that LLMs are "assumption engines" that tend to fill gaps with industry standards which may not fit specific project needs.

  • Commenters agreed that LLMs rarely fail on simple syntax, but frequently fail on "invisible assumptions," architectural constraints, and system invariants.
  • One user described the written plan not just as documentation, but as a "test harness" for constraints (latency, concurrency, memory budgets) that helps catch architecture-level mistakes before code is generated.
  • The consensus was that forcing a plan effectively stops the model from "reverting to the mean" and brings hidden assumptions to the surface.

Debate: "Magic Words" vs. Architecture A significant portion of the discussion focused on the author's advice to use "loaded language" (e.g., "deeply," "intricacies") into prompts to improve performance.

  • The Skeptics: Some users dismissed this as "magical thinking" or "superstition," comparing it to performing rituals for a "random word machine." They argued that unless there are rigorous statistics, this is just anthropomorphizing the model.
  • The Theorists: Others offered technical explanations for why this works. One theory is that these words trigger specific weights in the Attention mechanism, associating the prompt with high-quality training data (like detailed StackOverflow explanations or expert tutorials).
  • The MoE Theory: Several users debated whether this forces Mixture of Experts (MoE) models to route the query to a "smarter" expert path, though others argued that MoE routing is based on token type rather than semantic complexity in that specific way.
  • Research: One user pointed to academic papers regarding "emotional stimuli" in prompts (e.g., telling the model a task is vital) as proof that phrasing impacts output quality.

Workflow and Agents There was technical discussion on how to implement this loop:

  • Users debated the specific benefit of sequential prompts vs. "agents." The consensus leaned toward sequential steps to avoid "context pollution"—where a long-running agent session gets confused by potential hallucinations or previous step details.
  • One user warned against building "black box" agent swarms, advocating instead for a single-agent orchestrator with strict logging and human-reviewed "pull requests" or checkpoints.

Counterpoints

  • Directly contradicting the author's experience, one user shared a horror story where Claude Code burned $20 in 30 minutes looping on a simple Rust syntax/API hallucination, suggesting that LLMs can and do still fail on basic implementation details.

Met police using AI tools supplied by Palantir to flag officer misconduct

Submission URL | 37 points | by helsinkiandrew | 6 comments

The UK’s Metropolitan Police is piloting Palantir’s AI to sift internal HR-style signals—sickness, absences, overtime—in order to flag potential misconduct patterns among its 46,000 staff. The Met says the system only surfaces patterns and humans make the calls; the Police Federation calls it “automated suspicion,” warning workload or illness could be misread as wrongdoing. The move lands amid Palantir’s expanding UK public-sector footprint (NHS data platform, MoD deal) and political scrutiny over transparency and influence, prompting an MP to ask, “Who is watching Palantir?” Labour’s recent policing paper backs rapid, “responsible” AI rollout across all 43 forces with £115m over three years, signaling this kind of tooling could scale beyond the Met. Palantir says its software is improving public services; critics see a fresh layer of opaque workplace surveillance in a force already under fire for cultural failings.

Discussion Summary:

Commenters focus heavily on the irony of the Police Federation’s complaints, pointing out that while the union decries "automated suspicion" and opaque tools when applied to officers, police departments rarely hesitate to deploy similar surveillance against the general public. One user draws a parallel to the anime Ghost in the Shell: Stand Alone Complex, speculating that the Met might eventually find itself investigating Palantir's own interests. Others note a perceived recent increase in positive PR stories surrounding Palantir, viewing them with skepticism, while some readers report hitting a paywall.

Amazon, Meta, Alphabet report plunging tax bills thanks to AI and tax changes

Submission URL | 44 points | by epistasis | 40 comments

Big Tech’s 2025 US tax bills tumble on AI buildout and new expensing rules

  • What happened: Amazon, Meta, and Alphabet reported sharply lower 2025 US tax bills, citing last year’s pro-business tax changes in Trump’s “One Big Beautiful Bill” plus massive AI/data center investments.
  • The numbers:
    • Amazon: ~$9B (2024) → $1.2B (2025) federal tax; total payments this year $2.75B. Domestic profit ~ $90B (+40%+).
    • Meta: ~$9.6B → $2.8B federal tax. Domestic profit $79.6B (+20%).
    • Alphabet: $21.1B → $13.8B combined federal+state tax. Domestic profit $143.6B (+32%).
  • Why taxes fell: New deductions/credits for depreciation, capital investment, R&D, interest; most notably 100% expensing for new/updated factories. Much of the benefit is timing—big deferrals now, higher taxes later.
    • Deferred taxes: Amazon >$11B; Meta >$18B; Alphabet ~ $8B.
  • Company stance: “We’re following the rules.” Amazon says it invested $340B in the US in 2025 (including AI). Meta’s CFO flagged “substantial cash tax savings.”
  • Criticism: ITEP estimates AMZN/META/GOOG plus Tesla “avoided” nearly $50B versus the 21% statutory rate; Tesla paid zero federal tax for 2025. More disclosures from large firms still to come.

Why it matters

  • Near-term boost to earnings and cash flow could fuel more AI capex and shareholder returns; some of it reverses as deferrals unwind.
  • Strong incentives for US-based data center and factory buildouts likely pull AI infrastructure timelines forward.
  • Optics risk: plunging taxes amid soaring profits may invite policy backlash and future rule changes.

Discussion Summary:

The comment section evolved into a broad debate covering tax mechanics, wealth inequality, and the efficiency of government spending.

  • Wealth Inequality vs. Incentives: A heated philosophical dispute emerged regarding wealth accumulation. Radical suggestions were made to cap personal wealth at specific limits (ranging from $200k to $1M) to solve inequality, though these were met with skepticism regarding their economic feasibility, the definition of "luxury," and the destruction of incentives.
  • Tax Burden Realities: Users corrected the misconception that large corporations fund the majority of the government. Commenters pointed out that individual income taxes and payroll taxes make up the vast majority of federal revenue, while corporate taxes constitute a much smaller fraction (roughly 10%).
  • Accounting Mechanics: There was a specific discussion regarding the rules of writing off expenses. Users clarified that taxes are levied on profit rather than revenue, and noted recent changes to Section 174 which require software R&D expenses to be amortized over years rather than immediately expensed (though the summaries in the article highlight capital expensing for physical infrastructure like data centers).
  • The California Debate: The conversation drifted into a debate about California as a case study for high taxation. While some users criticized the state for squandering tax revenue on inefficient programs, others defended the cost as the price for labor rights, environmental protections, and a higher quality of life, attributing high costs to restrictive zoning laws rather than taxes alone.

AI Submissions for Sat Feb 21 2026

How Taalas “prints” LLM onto a chip?

Submission URL | 306 points | by beAroundHere | 167 comments

Taalas “prints” Llama 3.1 8B onto an ASIC, claims 17,000 tokens/sec and 10x gains in cost and power

TL;DR: A 2.5-year-old startup, Taalas, built a fixed-function ASIC that hardwires Llama 3.1 8B’s weights into silicon, reportedly hitting ~17k tokens/sec with 3/6-bit quantization, while being ~10x cheaper to run and ~10x more energy-efficient than GPU inference.

How it works

  • No HBM/DRAM loop: Instead of shuttling weights over a memory bus each step, the model’s 32 layers are physically laid out on-chip. Inputs stream through layer-by-layer logic with pipeline registers; activations don’t round-trip to external memory.
  • Weights in silicon: The weights are “engraved” as transistors; Taalas hints at a “magic multiplier” that can store 4-bit data and perform its multiply in what they describe as a single-transistor element, enabling dense, low-power compute-in-memory–style MACs.
  • Minimal SRAM: On-chip SRAM is used for KV cache and to host LoRA adapters; there’s no external DRAM/HBM.
  • One model per chip: It’s a fixed-function device (think cartridge/CD-ROM). To target a new model, they customize only the top metal layers over a generic base fabric, which they say let them map Llama 3.1 8B in ~2 months.

Why it matters

  • Smashes the memory wall: By eliminating weight fetches over a memory bus, the design attacks the core bandwidth/latency bottleneck in today’s GPU LLM inference.
  • Throughput and efficiency: If the 17k tok/s and 10x cost/power claims hold, inference economics—especially at the edge or at massive scale—could shift sharply away from general-purpose GPUs for stable, high-volume models.

Caveats and open questions

  • Flexibility: It’s essentially one-model-per-chip; updating architectures or sizes requires a respin.
  • Quality trade-offs: Real-world accuracy with 3/6-bit quantization isn’t detailed; effects across tasks and long contexts remain to be seen.
  • Practical limits: KV cache size, max context length, batching, sampling features, and how the “single-transistor multiplier” works (analog vs. digital, precision, variability) are not fully explained.
  • Manufacturing/yield: Customizing top metal layers is faster than a full new chip, but still slower and riskier than software updates.

Here is a summary of the discussion:

Feasibility and quantization trade-offs Commenters crunched the numbers on the claim of packing ~8B coefficients into 53B transistors, concluding the math theoretically holds up if the device relies on aggressive quantization (likely 3-bit or "double FP4"). While some users were excited by the prospect of "model-to-VHDL" synthesis, others worried that hardwiring such strong quantization into silicon would permanently degrade model quality, making the chip useless for tasks requiring higher precision.

The inevitable hardware cycle Many users viewed this as a predictable evolution of computing, drawing parallels to the transition from CPU to GPU to ASIC in Bitcoin mining, or the move from software rendering to hardware acceleration in 3D graphics. While some suggested FPGAs as a middle ground, others argued FPGAs lack the efficiency/scaling needed to compete with GPUs or ASICs in this specific domain.

The "Inflexibility" bottleneck The primary skepticism revolved on the risk of obsolescence. With LLM architectures and weights changing almost daily, users noted that a fixed-function chip could become e-waste before it hits the market. Big tech companies likely haven't pursued this yet because they are constrained by fab capacity and cannot afford to bet on a model that might be outdated in six months.

Killer use-case: Edge and Latency Despite the flexibility concerns, users identified a strong niche for this tech: local inference.

  • Latency: Eliminating the 50-200ms network overhead of the cloud allows for sub-100ms response times, enabling real-time voice and video agents that current GPUs can't serve efficiently over the web.
  • Stable Appliances: It was suggested these chips are perfect for "frozen" models running on drones, phones, or appliances (e.g., a smart fridge) where the model doesn't need to be State-of-the-Art, just functional and offline.

Show HN: Llama 3.1 70B on a single RTX 3090 via NVMe-to-GPU bypassing the CPU

Submission URL | 321 points | by xaskasdf | 82 comments

NTransformer: runs Llama 70B on a single RTX 3090 by streaming layers over PCIe

What’s new

  • A C++/CUDA LLM inference engine that keeps only a subset of layers in VRAM and streams the rest from RAM/NVMe, enabling 70B models on a 24GB GPU. No PyTorch or cuBLAS; GGUF models with multiple quantizations supported.

How it works

  • 3-tier adaptive caching: VRAM-resident layers (no I/O), pinned RAM (H2D only), and NVMe/mmap fallback, auto-sized from your hardware.
  • NVMe direct I/O: a userspace driver reads weights straight into GPU-accessible pinned memory, overlapping disk, PCIe DMA, and compute (SLEP streaming).
  • Layer skipping: cosine-similarity–based calibration can skip ~20 of 80 layers per token at 0.98 threshold, with minimal quality loss.
  • Self-speculative decoding: uses resident layers as a draft model; no second model required.

Performance highlights (author’s tests, RTX 3090 + 48GB RAM)

  • Llama 3.1 8B Q8_0 (resident): ~48.9 tokens/s using ~10GB VRAM.
  • Llama 3.1 70B:
    • Q6_K tiered: ~0.2 tok/s at ~23.1GB VRAM (26 layers in VRAM, rest in RAM).
    • Q4_K_M tiered: ~0.3 tok/s at ~22.9GB VRAM (36 layers in VRAM).
    • Q4_K_M + layer skip: ~0.5 tok/s (fastest reported).
  • Claims up to 83x speedup over naive mmap streaming; bottleneck is PCIe H2D bandwidth (Gen3 x8 ~6.5 GB/s).

Caveats and setup

  • Linux + CUDA 13.1, gcc-14, CC 8.0+ GPU (3090 tested). Optional NVMe on a separate PCIe slot for best results.
  • For NVMe-direct mode, the setup script performs invasive system changes: disables IOMMU, patches NVIDIA DKMS for recent kernels, tweaks CUDA headers, and binds NVMe via VFIO with “unsafe noiommu” mode. Not recommended on multi-tenant/production systems; missteps can break your GPU driver.

Why it matters

  • A clever, low-level approach that makes 70B models usable on consumer GPUs by trading speed for capacity and I/O orchestration. Great for experimentation and edge cases where VRAM is the limiting factor—just be mindful of the heavy-duty system tweaks and modest 70B throughput.

Based on the discussion, here is a summary of the community's reaction:

Performance vs. Practicality Discussion focused heavily on whether 0.2–0.5 tokens/second is usable.

  • Chat vs. Batch: Most users agreed this is too slow for interactive chat, but several (like umairnadeem123) noted it is viable for automated background tasks (batch processing) where latency doesn't matter, offering a private, fixed-cost alternative to APIs.
  • Better Alternatives: Users like flrdtn pointed out that standard CPU offloading (system RAM + GPU) is currently faster than this method, citing ~1.5 t/s on a Ryzen 7950X + 3090.
  • Small Models: Some argued that for interactive use, a high-quality 8B model entirely in VRAM offers a better experience than a crippled 70B model.

Hardware Bottlenecks & Apple Comparisons

  • The Apple Factor: MarcLore and others drew comparisons to Apple’s M-series chips (Unified Memory), which handle 70B models natively with much higher throughput, though at a higher hardware entry price.
  • Author’s Constraints: The author (xsksdf) clarified that their benchmarks are severely bottlenecked by their specific hardware setup—a B450 motherboard limiting the GPU to PCIe 3.0 x8 speeds. A modern PCIe 4.0/5.0 x16 setup would likely yield significantly higher throughput.

The "Why" (PlayStation 2 Origins) In a surprising reveal, the author explained that this project stems from their background in retro-gaming development. They previously built a transformer engine for the PlayStation 2 (PS2-LLM), where the console's tiny 32MB RAM and 4MB VRAM forced them to master DMA (Direct Memory Access) and layer streaming. They simply applied the same "extreme constraint" logic to the RTX 3090.

Cost & Power There was a debate regarding the economics of running this locally versus using cheap APIs.

  • Energy: While esquire_900 calculated it might be cheaper than APIs over time, lvntysvn reminded the thread to factor in the 300W+ power draw of a 3090 running for hours to generate a single report.
  • Utilization: The author noted that due to the I/O bottleneck, the GPU isn't actually hitting full TDP (power limit), so electricity costs might be lower than expected.

zclaw: personal AI assistant in under 888 KB, running on an ESP32

Submission URL | 230 points | by tosh | 125 comments

zclaw: an 888 KiB AI assistant firmware for ESP32

  • What it is: A tiny C-based “agent” for ESP32 boards that turns a microcontroller into a natural-language assistant. It handles schedules (cron-style), GPIO control with guardrails, persistent memory, and user-defined tools. Chat via Telegram or a hosted web relay. Persona options include neutral, friendly, technical, and witty.

  • How it works: Runs fully on-device as an orchestrator with Wi‑Fi, TLS, and certs, but uses cloud LLMs (Anthropic, OpenAI, OpenRouter) for reasoning. Includes provisioning, rate limits (default 100/hour, 1000/day), and optional encrypted credentials in flash.

  • Footprint bragging rights: All-in firmware cap of 888 KiB, including ESP-IDF/FreeRTOS, networking, TLS/crypto, and cert bundle. Current build: ~869,952 bytes. App logic alone is ~35 KiB (~4%); the bulk is networking/TLS/runtime.

  • Hardware and dev: Tested on ESP32-C3/S3/C6 (recommended: Seeed XIAO ESP32-C3). QEMU profile available. One-line bootstrap, secure flash, provisioning, relay/serial benchmarking, and a web relay with mobile chat UI.

  • Why it’s interesting: It shows how much “agent” capability you can pack into a sub‑1 MB firmware on a $5 microcontroller—no local LLM, but solid tool composition, scheduling, and state, all in C.

  • License and repo: MIT. GitHub: https://github.com/tnm/zclaw — Docs: https://zclaw.dev

Notes:

  • Cloud LLM required (not on-device inference).
  • Guardrails for GPIO (including bulk reads).
  • Scripts cover flashing, provisioning, Telegram backlog clearing, emulation, and latency benchmarking.

Here is the summary of the discussion on Hacker News:

zclaw: an 888 KiB AI assistant firmware for ESP32

The comment section explores the utility of running AI agents on bare-metal microcontrollers versus full operating systems, alongside skepticism regarding the "agent" hype cycle.

  • ESP32 vs. Linux for Agents: umairnadeem123 argues that the primary appeal of zclaw is the "zero-maintenance" aspect of the ESP32; unlike a Linux box which requires updates and suffers from OOM kills, an ESP32 provides a simpler, predictable failure mode for always-on orchestration. However, hsbvhbzb counters that this approach introduces new points of failure—specifically reliance on cloud APIs, Wi-Fi stability, and the internet—suggesting that swapping an OS for a microcontroller doesn't inherently solve reliability problems.
  • Tamagotchis and Use Cases: GTP proposed building an "intelligent Tamagotchi" using this stack. tempaccount5050 shared their experience attempting this, noting that an LLM alone isn't enough; the project still requires a state machine to define constants (like "hunger") to prevent the AI from getting stuck in a loop. Others, like post_below, discussed more complex home automation, such as a self-hosted agent that manages grocery lists via Signal and automatically populates browser-based shopping carts.
  • The "Claw" Ecosystem & Protocols: There is confusion regarding the "OpenClaw" ecosystem compared to zclaw. blnsr compared OpenClaw to ROS (Robot Operating System) for distributed nodes, but TheDong quipped that the only real protocol here is English, stating we are in a "post-API world" where natural language turns into bash or browser tool invocations.
  • Security and Hype: The discussion veered into the risks of IoT agents. dlt713705 jokingly envisioned a future where vacuum cleaners declare war on refrigerators via Discord. On a serious note, h4ch1 criticized the "ostrich-head-in-the-sand" enthusiasm for agent frameworks, warning that giving unfettered API and tool access to unverified dependencies (likened to eating "cake made of plastic") is a disaster waiting to happen.
  • Technical Implementation: Dr_Birdbrain and others dismissed the project as merely a "tiny LLM power agent wrapper" connected to the internet, though some appreciated the engineering effort required to fit the TLS stack and runtime into less than 1 MB of flash.

AI uBlock Blacklist

Submission URL | 265 points | by rdmuser | 114 comments

AI uBlock Origin Blacklist: A crowdsourced filter list to hide AI-generated “content farm” sites. GitHub user alvi-se maintains a manually curated uBlock Origin list (and a uBlacklist version for search engines) that blocks domains and specific paths churning out SEO’d, low-value, ad/affiliate-heavy AI articles. Installation is via a one-click subscription link or by adding the raw list URL as a 3rd-party filter in uBlock. The author argues automated detection is unreliable, so entries are added by hand and guided by telltale signs: fluffy/baroque intros, “Comprehensive/Ultimate Guide” titles, few outbound links or sources, and aggressive referral links. Contributors are encouraged to file issues or PRs; the repo avoids blanket-banning platforms like Medium/dev.to by targeting offending blogs/paths only. Despite being personal and somewhat Italy-biased, the maintainer says the list is effective because the same spammy sites recur across searches. As of now: ~213 commits, ~349 stars.

Based on the discussion, here is a summary of the comments:

Concerns Regarding Maintenance and False Positives A significant portion of the discussion focuses on the risks associate with personal, manually curated blacklists. Several users criticize the specific maintainer of this list, describing them as having a "suspicious attitude" and believing themselves to be "infallible."

  • Lack of Recourse: Examples were shared of personal websites being blocked by similar lists on PiHole or uBlock; users noted that requests to be unblocked often go unanswered or are ignored entirely.
  • Domain Churn: Users pointed out that static blacklists fail to account for domain ownership changes. A domain currently hosting AI spam might later be purchased by a legitimate owner, but it remains in a "reputational blackhole" with no easy mechanism for removal.
  • Comparison to Anti-Cheat: The situation was likened to "VAC bans" in gaming, where false positives occur, but the system is treated as absolute.

The State of Search and "AI Slop" Despite the concerns about the list's implementation, many commenters expressed a desperate need for tools to filter AI-generated noise.

  • Search Quality: Users described the current search experience as being drowned in "slop," making it difficult to find human-created content (specifically on platforms like Reddit).
  • "Hater" Lists vs. Utility: There was debate regarding alternative lists (such as the "HUGE AI Blocklist"). Some argued these are merely "hater lists" that block sites for tangentially related reasons (like having an AI widget or unrelated grievances), while others defended aggressive blocking as the only way to improve the user experience.

Side Discussion: AI in the Workplace A tangible sub-thread emerged regarding the use of AI for writing text (emails, reports) in professional settings.

  • "Cosmetic Surgery" Analogy: One user described a coworker who uses Copilot to generate 20-paragraph emails as having "extraordinarily bad cosmetic surgery"—it looks polished at a glance but is fundamentally uncanny and distinct from human communication.
  • Skill vs. Laziness: Commenters debated whether this usage covers for "functional illiteracy" and language barriers, or if it simply encourages laziness and results in "mediocre crap" that colleagues are forced to read.

Alternatives and Technical Solutions Users shared various alternatives to the submitted list, including:

  • uBlacklist: Specifically mentioned as a tool to remove specific domains from search engine results pages (SERPs).
  • AdGuard/PiHole: Discussed as broader network-level solutions, though they suffer from the same false-positive risks if the underlying lists are poor.
  • Other Repos: Links to other GitHub repositories and Gists were shared for those looking for different filtering criteria.

Cord: Coordinating Trees of AI Agents

Submission URL | 151 points | by gfortaine | 75 comments

The pitch

  • Most multi‑agent frameworks make developers predefine roles, graphs, and handoffs. Cord flips this: you give a goal, and the agent plans, decomposes, parallelizes, blocks on dependencies, and asks humans when needed.

What’s different

  • Runtime decomposition: The agent decides the workflow as it goes, not from a static graph or role roster.
  • Spawn vs. Fork: Two context-flow primitives.
    • Spawn: clean slate; only the prompt plus explicit dependencies. Good for independent subtasks.
    • Fork: inherits all completed sibling results. Good for synthesis and analysis.
  • Explicit dependencies and blocking: Tasks can wait on others and on human answers, enabling predictable parallelism.

Example

  • Given “Should we migrate from REST to GraphQL?”, Cord:
    • Spawns parallel research and API audit
    • Asks a human about traffic scale, blocked on the audit
    • Forks a comparative analysis that inherits prior results
    • Writes a tailored recommendation after dependencies resolve

Why it matters

  • Moves from developer-scripted workflows to agent-discovered structure, matching how strong models plan and reason today.
  • Introduces simple, learnable context flow so agents can parallelize without losing necessary shared knowledge.

Under the hood

  • Each agent is a Claude Code CLI process with MCP tools, coordinated via a shared SQLite DB.
  • Minimal API: spawn, fork, ask, complete, read_tree.
  • Roadmap idea: first-class context_query to distill and pass only relevant context to children via a compaction subagent.

The Debate: Dynamic Planning vs. Deterministic Control The discussion centered on a fundamental divide in agentic engineering:

  • Reliability vs. autonomy: Some users argued that strict, discrete topologies (static DAGs) form the only viable path for reliable systems, warning that unconstrained agent planning compounds probabilistic errors.
  • Obsolescence of hardcoding: Counter-arguments suggested that modern models (like Claude 3.5 Sonnet) are now sufficiently capable of planning and decomposition that hardcoding task graphs is becoming obsolete.

Key Technical Feedback

  • Context flow primitives: The community reacted positively to the "Spawn" (clean state) vs. "Fork" (inherited context) distinction, viewing it as a clever strategy to manage context window pollution.
  • Feature Suggestion: One commenter proposed adding a distinct context_query primitive—a mechanism where a subagent requests specific data via natural language query rather than receiving a raw dump of the parent’s context, effectively acting as "context compression."
  • Comparisons: Users drew parallels to Anthropic’s internal tooling (Agent Tums) and Claude Code’s existing capabilities. The OP clarified that while Claude can spawn subagents, Cord aims to enable deeper, recursive trees where subagents can spawn their own sub-subagents.

Framework Fatigue & Skepticism

  • "Not another framework": Several commenters expressed fatigue with the proliferation of orchestrator tools (referencing LangGraph), with some preferring simple shell scripts or "roll-your-own" solutions over adopting new protocols.
  • AI-generated content: A meta-discussion emerged regarding the blog post's writing style; multiple users felt the prose was obviously AI-generated, which they argued detracted from the message reliability, though the OP acknowledged this and promised follow-up data.

Large Language Model Reasoning Failures

Submission URL | 40 points | by T-A | 80 comments

Large Language Model Reasoning Failures (Song, Han, Goodman) — a TMLR 2026 survey with Survey Certification — maps where LLMs still stumble, even on “simple” problems, and organizes a scattered literature into a single playbook.

What’s new

  • A two-axis taxonomy:
    • Types of reasoning: embodied vs. non-embodied; the latter split into informal (intuitive) vs. formal (logical).
    • Types of failures: fundamental (architecture-level), application-specific (domain-bound), and robustness issues (brittleness to small prompt/task variations).
  • For each failure class: clear definitions, evidence from prior studies, suspected root causes, and mitigation strategies collected from the literature.
  • A curated GitHub repository aggregating papers on LLM reasoning failures for quick entry into the area.

Why it matters

  • Gives researchers and product teams a shared vocabulary to diagnose errors, design evaluations across reasoning modes, and choose mitigations.
  • Highlights that many “reasoning” wins remain fragile, with inconsistent behavior under minor changes.

Links

  • Paper: arXiv:2602.06176 (with arXiv-issued DOI)
  • Repository: included via the paper’s GitHub link

The Discussion

The Hacker News discussion focuses heavily on whether the paper’s claims about "fundamental" failures hold up against the most recent state-of-the-art models, alongside a broader debate about the nature of machine intelligence.

Arithmetic, Tools, and "Cheating" A contentious debate erupted over the paper's assertion that LLMs fail at basic arithmetic (specifically large-number multiplication).

  • User smnwrds attempted to "falsify" the paper's claims by testing 20-digit multiplication on GPT-o1 Pro, which solved the problems correctly.
  • Others, notably rybswrld and chcknmprnt, countered that this doesn't prove the LLM can reason; rather, it highlights how frontier models increasingly rely on hidden tools. They argued that models often offload math to internal Python interpreters or obscure "Chain of Thought" processes effectively "faking" the alignment.
  • When chcknmprnt tested a local model (Mistral) where tool-use was explicitly disabled, the model hallucinated the answer, supporting the paper's thesis. smnwrds dismissed this as picking on "the worst model," while others maintained that even closed-source models likely rely on undocumented internal subsystems (like specific Rust optimizations) to patch these fundamental architectural weaknesses.

Anthropomorphism vs. Architecture Top-level commenter srgmtt welcomed the paper as a necessary check against anthropomorphism.

  • They argued that the identified failures—such as inability to count like a toddler or handle object permanence—stem from the nature of "next-token predictors" being fundamentally different from human general intelligence.
  • lnsbr and mttmg added that unlike humans, who evolve and maintain long-term dynamic memories, LLMs rely on frozen weights, making the comparison to human reasoning inherently flawed.
  • otabdeveloper4 cynically noted that these systems are sold as AGI primarily to sustain stock market narrratives.

Social and Moral Fragility Finally, Lapel2742 highlighted the paper's points on social reasoning failures.

  • The commenter ridiculed the idea that models are ready for ethical decision-making, noting they struggle with social norms and cultural context.
  • They joked that the industry has successfully created AI in the image of "Techbro-CEOs" rather than a system capable of broadly congruent human values. rnlszlrn agreed, suggesting current models embed values that are incompatible with large percentages of the global population.

Show HN: AI writes code – humans fix it

Submission URL | 5 points | by stasman | 3 comments

Humans-on-demand for broken AI code: a 24-hour bug-fix marketplace

A new service targets the growing “AI wrote it, now it’s broken” gap. You post a bug, set a price (from $49), and a vetted human developer delivers a fix within 24 hours—no meetings, no chat, just a PR.

Key details:

  • Workflow: Post task with context/screenshots → set your price → a verified dev gets read-only repo access → they propose a fix and submit a delivery → on approval, you receive a pull request.
  • Pricing: You choose the bounty (min $49) + 10% platform fee. Payment is charged when a dev accepts, held in escrow, released on your approval. If no one picks it up in 24 hours, the hold auto-expires. Cancel anytime before acceptance.
  • Quality/safety: Developers are manually vetted via LinkedIn/GitHub. You get 1 free revision if the first attempt misses. If deadlines slip or no one picks it up, you’re refunded.
  • Positioning: “Introvert-friendly” debugging—no calls, fast turnaround—aimed at users of tools like Bolt, Replit, Cursor, Claude Code, Windsurf, and Base44.

Why it matters: As AI code-gen accelerates, this is a lightweight, SLA-backed alternative to hiring a freelancer or slogging through fixes yourself—human-in-the-loop debugging as a service.

Humans-on-demand for broken AI code: a 24-hour bug-fix marketplace A new service proposes a bounty-based marketplace (minimum $49) where vetted developers fix broken, AI-generated code within 24 hours via pull request, functioning as a "human-in-the-loop" layer for tools like Replit or Cursor.

Discussion:

  • The Model: One commenter pointed out that this approach is backed by research suggesting human-AI pairs consistently outperform AI working autonomously.
  • Technical Glitches: Early feedback included a bug report regarding the onboarding process, with a user noting that Stripe incorrectly flagged their location as the Netherlands. They noted the idea was "cool" despite needing to contact the developer to resolve the payment issue.
  • Developer Experience: Sentiment regarding the work itself was mixed, with one user remarking that the prospect of fixing broken AI code "sounds miserable."

Why is Claude an Electron app?

Submission URL | 395 points | by dbreunig | 410 comments

Why Claude (and so many others) still ship as Electron apps, even in the agent era

  • The pitch: If coding agents can turn a spec and test suite into cross-platform code, why not ship snappy native apps per OS instead of bundling a browser with Electron?
  • The reality: Agents excel at the first 90%, but the last 10%—edge cases, real‑world quirks, regressions, and ongoing support—is where costs explode. Maintaining three native codebases (Mac/Win/Linux) triples the surface area for bugs and support.
  • Case in point: Anthropic’s much‑touted agent swarm spent ~$20k building a Rust‑based C compiler that flew through early tests but hit a wall on stability and completeness—impressive, yet largely unusable without heavy human cleanup.
  • Why Electron wins today: One codebase, familiar web stack, and instant cross‑platform reach outweigh bloat, lag, and weaker OS integration for most teams. The incentives favor shipping once over hand‑holding agents to production‑ready parity across three native apps.
  • Bottom line: Spec‑driven, agent‑powered native builds are promising, but the last mile and ongoing maintenance keep Electron in the lead—for now—even for AI leaders like Anthropic.

Based on the discussion, here is a summary of the comments:

The Insider Perspective A commenter identifying as an engineer on the project noted that the team had previous experience with Electron and preferred building non-natively to share code between web and desktop. However, they acknowledged that engineering tradeoffs might change in the future.

User Experience: Terminal vs. Desktop Users drew a sharp distinction between Anthropic's tools.

  • Claude Code (CLI): Was described as "magical" and highly effective, even on single terminals.
  • Claude Desktop (Electron): Received significant criticism for poor performance. Users reported it turning laptops into "toasters," causing fans to run wildly, and suffering from lag/freezing (one user noted delays of multiple seconds when switching tasks).
  • Workarounds: Some users resort to "disposable conversations" or stick strictly to the terminal interface to avoid the resource heaviness of the desktop app.

The "Coding is Solved" Irony A major theme of the discussion was the perceived contradiction between Anthropic’s marketing and their tech stack choices.

  • The Paradox: Commenters questioned why, if Claude is capable of "solving coding" or effortlessly porting code between languages, Anthropic cannot use their own agent to maintain three native codebases (Mac/Windows/Linux) instead of relying on Electron.
  • The Rebuttal: Others argued that "coding" isn't the bottleneck—maintenance is. Even if AI generates the code, maintaining three separate stateful architectures is a logistical nightmare compared to deploying a single web-stack application.

The Broader Electron Debate The thread evolved into a classic debate over the viability of Electron:

  • Defenders: Argued that performance complaints are often hyperbole. They cited VS Code and Gmail as examples of complex, successful web-stack applications. Some argued that "native app development is dead" outside of gaming and walled gardens (iOS), and that the browser is the only runtime that matters.
  • Detractors: Countered that VS Code is an outlier that relies heavily on native modules (Rust/C++) and WebGL optimizations to function well, implying standard Electron apps remain "junk." Users pointed to native alternatives (like Neovim or Thunderbird) as proof of the superior efficiency and speed of native code compared to web technologies.

How an inference provider can prove they're not serving a quantized model

Submission URL | 67 points | by FrasiertheLion | 48 comments

Tinfoil’s “Modelwrap” aims to solve a long‑standing gripe with inference APIs: you can ask for a specific model, but you can’t really know what you got. Providers can silently swap in different quantizations, tweak context windows under load, or drift over time—something users have observed across vendors and even within the same vendor.

What they built: verifiable inference that binds an API call to an exact set of model weights at runtime, without changing app code.

How it works

  • Public commitment to weights: Tinfoil publishes a single Merkle-tree root hash for the model’s weight files (e.g., 140 GB split into 4 KB blocks).
  • Enclave attestation, extended to data: They use secure enclaves, but go beyond “what binary booted” by attesting two things at launch: the committed root hash and the presence of an enforcement mechanism.
  • Kernel-enforced verification on every read: dm-verity in the Linux kernel checks each disk block read against the Merkle tree; if any byte doesn’t match the committed root, the read fails with an I/O error. Apps like vLLM don’t need modifications and can’t accidentally read uncommitted bytes.
  • Client-side verification: On each request, clients can verify the enclave’s attestation report contains the expected root hash and dm-verity configuration, tying the running server to the public commitment.
  • Analogy: This is the same mechanism behind Android Verified Boot (root hash + kernel-enforced Merkle checks), repurposed for model weights.

Why it matters

  • Proves you’re hitting the exact weights you pinned (no silent quantization or model swaps).
  • Stabilizes evals and regression tracking across time/providers.
  • Works for closed-source models too: you can’t see the weights, but you can verify you’re getting the same committed bits every time.

Caveats and open questions

  • Scope: This guarantees the bytes read from disk match the commitment; it doesn’t by itself prove anything about post-load transformations or runtime configuration unless those are also covered by attestation.
  • Trust base: You’re trusting the enclave/CPU vendor’s attestation and kernel integrity.
  • Practicalities: Update/rollout mechanics, performance overhead of dm-verity, and how broader server config (e.g., context window, KV cache policies) is pinned weren’t detailed here.

Bottom line: Modelwrap turns “trust us” into “verify us,” giving API users a cryptographic handle (a root hash) they can pin to—and a kernel-enforced path that makes serving anything else fail fast.

The discussion revolves around the technical limitations of "black box" verification (checking outputs) versus the cryptographic verification proposed by Tinfoil, with the author (FrasiertheLion) answering questions about the specific security architecture.

The feasibility of "Output Checking" The thread began with users questioning why complex attestation is necessary when users could simply check for deterministic outputs using a fixed seed.

  • The Consensus: Commenters (including trpplyns, jshlm, and msrblfnc) argued that checking outputs is unreliable. Floating-point math is not truly associative, meaning the order of operations matters.
  • Hardware Variance: At scale, providers split models across different GPU/CPU combinations and use optimizing compilers that change instruction scheduling. This results in slight numerical differences that break strict determinism, making it impossible to distinguish between a benign hardware change and a malicious model swap based on output alone.
  • Benchmarking issues: Aurornis noted that while external benchmarking sites exist, they are expensive to maintain and often produce noisy data rather than definitive proof of model degradation.

Attestation Mechanics & Trust A significant portion of the discussion focused on how the client effectively trusts the server.

  • The Mechanism: FrasiertheLion explained that the system relies on hardware-backed enclaves (Intel TDX, AMD SEV-SNP, Nvidia Confidential Computing).
  • Preventing "Replay" Attacks: Users (rbls, vrptr) asked how a client knows the provider isn't faking the attestation report. The author clarified:
    1. The enclave generates an ephemeral key pair at boot.
    2. The public key is embedded in the hardware-signed attestation report.
    3. The client encrypts their request using that public key.
    4. Only the specific, verified enclave instance can decrypt and process the request, preventing Man-in-the-Middle attacks or spoofed reports.
  • Trust Anchor: jlsdrn observed that this technology effectively shifts the "root of trust" from the API provider (who might cut corners) to the hardware manufacturer (Intel/AMD), who certifies the chip state.

Other Notes

  • Apple: There was interest in Apple’s similar approach with "Private Cloud Compute," which users felt offered strong integrity guarantees due to Apple's control over the entire hardware/software stack.
  • Quantization: rbrnd noted that quantization isn't inherently bad—users often want the trade-off of 99% quality for 50% cost—but the implication is that transparency about which version is running remains the key issue.

AI Submissions for Fri Feb 20 2026

Ggml.ai joins Hugging Face to ensure the long-term progress of Local AI

Submission URL | 786 points | by lairv | 206 comments

Headline: ggml.ai (team behind llama.cpp) joins Hugging Face to power the next phase of Local AI

What happened:

  • Georgi Gerganov and the ggml.ai team are joining Hugging Face while continuing to lead and maintain ggml and llama.cpp full-time.
  • The projects remain 100% open source and community-driven; technical decisions stay with the existing maintainers.
  • Hugging Face will provide long-term resources to sustain and grow the ecosystem.

Why it matters:

  • llama.cpp and the GGUF ecosystem have become a backbone for running powerful models locally on consumer hardware. This move shores up sustainability and accelerates support for new models and quantizations.
  • It formalizes a productive collaboration: HF engineers have already contributed core features, multimodal support, an inference server/UI, additional architectures, GGUF compatibility, and integrations with HF Inference Endpoints.

What to expect next:

  • Tighter, near “single-click” integration between transformers and ggml/llama.cpp for broader model coverage and easier validation.
  • Better packaging and UX to make local model deployment simpler and more ubiquitous.
  • Faster turnaround for new model support and quant releases.

Bigger picture:

  • This is a clear bet on local inference as a serious alternative to cloud AI, aiming to build an efficient, open inference stack across devices—framed by both teams as a step toward widely accessible, open-source “superintelligence.”

Impact for developers and users:

  • Smoother pipelines from transformers to GGUF/llama.cpp.
  • Improved tooling and installers for desktop/server setups.
  • Continued openness and community autonomy, with added stability and resourcing from HF.

Open questions to watch:

  • Details of governance and roadmap prioritization under the new arrangement.
  • How quickly new architectures and multimodal features flow through transformers → GGUF → llama.cpp.

Here is the daily digest summary for this submission and the accompanying discussion.

Top Story: ggml.ai (llama.cpp) joins Hugging Face

The Gist Georgi Gerganov and the team behind ggml.ai (the creators of llama.cpp and the GGUF file format) differ joining Hugging Face. The team will continue to work on their projects full-time with a commitment to keeping them 100% open-source and maintaining independent decision-making. Hugging Face is stepping in to provide the long-term resources and infrastructure needed to sustain the ecosystem.

Why it Matters This consolidates the "local AI" stack. llama.cpp has become the standard for running LLMs on consumer hardware (MacBooks, gaming PCs, etc.). By officially aligning with Hugging Face, users can expect a seamless pipeline where new models uploaded to HF are immediately compatible with local inference tools (GGUF), along with faster support for new architectures and easier "one-click" deployment options.

Hacker News Discussion Summary

The discussion on Hacker News focused heavily on the logistics of local AI, the sustainability of Hugging Face, and the technical nuance of quantization.

  • Hugging Face as the "Real" OpenAI: Commenters praised Hugging Face as the true champion of open-source AI, noting the immense, likely expensive service they provide by hosting petabytes of model weights for free. Several users expressed hope that HF has a sustainable business model so this resource doesn't disappear.
  • The "Data Cap" Bottleneck: A significant portion of the thread revolved around the practical struggles of local AI enthusiasts. Users reported downloading terabytes of models per week, triggering data caps from residential ISPs (like Comcast and AT&T). This led to a debate on why Hugging Face doesn't utilize BitTorrent to offload bandwidth costs. The counter-argument was that HF likely avoids torrents to maintain accurate download metrics (vanity metrics) and to manage access control for "gated" models (like Llama 3).
  • Quantization Quality Control: There was a technical exchange (involving Daniel Han of Unsloth) regarding the reliability of quantized models. Users expressed concern that aggressive quantization "lobotomizes" models in ways that are hard to detect with standard benchmarks (like perplexity). The consensus was that running comprehensive benchmarks on every quantization format is currently too expensive ($1k–$100k) for most open-source maintainers.
  • East vs. West Open Weights: A debate emerged regarding the quality of Western open models (like Mistral) versus Chinese open models (DeepSeek, Qwen, Kimi). Some users argued that Chinese models are currently winning on efficiency and reasoning benchmarks, while others countered that they still lack Western cultural knowledge and nuance required for tasks like creative writing or roleplay.

Every company building your AI assistant is now an ad company

Submission URL | 258 points | by ajuhasz | 135 comments

Juno opens pre-orders for a local, always-on AI assistant—and argues ads make rival assistants a surveillance risk

  • The pitch: Adam Juhasz says the next wave of assistants must be “always on”—hearing and seeing context across rooms, wearables, and time—to be truly proactive. Wake words are a dead end because the most valuable context happens in natural, unprompted conversations.

  • The clash: He argues nearly every major assistant effort is now ad-funded, creating structural incentives to capture and monetize continuous audio/visual data. Policies can change; architectures can’t. “Policy is a promise. Architecture is a guarantee.”

  • The remedy: Keep inference local. If models run on-device/in-home, there’s no API to hit, no telemetry to harvest, no data to subpoena. He cites Amazon’s Alexa/Ring histories as cautionary tales and notes OpenAI’s recent move to ads as a sign of where cloud-first models trend.

  • Tech claim: The edge stack is “ready now.” A small, fanless box can run real-time STT, memory, reasoning, and TTS with acceptable quality for home tasks. Remaining issues are framed as memory architecture and context handling, not model size.

  • The product: Juno’s Pioneer Edition—positioned as a local-first, always-listening home assistant—opens for pre-orders. The post doubles as a manifesto for on-device AI over ad-backed cloud assistants.

Why it matters: If “ambient AI” is inevitable, the business model and deployment architecture will decide who controls the feed from your life—users or advertisers. This piece argues a new category—local AI appliances—may be the only trustworthy path.

The Discussion The comment thread centers on the tension between Juno’s privacy promises and the inherent risks of "always-on" recording, with the creator (jhsz) actively addressing technical concerns.

  • Legal & Physical Risks: Users argued that "local-only" does not equal "immune to law enforcement." Commenters like zmmmmm and pxys noted that unless the device is legally and technically resistant to compelled decryption (e.g., warrants), a box full of intimate family conversations is a liability. There were also concerns about what happens if the hardware is stolen or if Juno is acquired by a data-hungry conglomerate.
  • The Creator's Defense: jhsz engaged with critics, explaining that the system is designed to minimize long-term raw data storage (tuning memory to extract facts and "forget" the audio quickly) and relies on hardware-encrypted storage (Nvidia Jetson). They argued that while no solution is magical, data "inside the walls" is fundamentally safer than data in a corporate cloud.
  • Target Audience Mismatch: Several users (BoxFour, bndrm) pointed out a contraction in the pitch: the specific demographic that cares enough about privacy to buy a local AI appliance is often the same demographic that refuses to have any always-on microphones in their home, regardless of architecture.
  • Philosophical Pushback: The debate extended to the utility of "perfect memory." Citing Borges’ Funes the Memorious, some users questioned whether an AI that remembers every interaction is a helpful tool or a dystopian burden, suggesting that human forgetfulness is a feature, not a bug.

The path to ubiquitous AI (17k tokens/sec)

Submission URL | 791 points | by sidnarsipur | 429 comments

Taalas: turning models into chips for 10x faster, cheaper inference; launches hard‑wired Llama 3.1 8B at 17K tok/s

What’s new

  • Taalas claims it can convert any AI model into custom silicon in ~2 months, producing “Hardcore Models” that are ~10x faster, ~20x cheaper to build, and ~10x lower power than GPU-based inference.
  • First product: HC1 chip hard‑wiring Llama 3.1 8B, offered as a chatbot and inference API. Claimed performance: 17,000 tokens/sec per user (1k/1k input/output), outpacing Nvidia H200/B200 baselines and specialist stacks (Groq, SambaNova, Cerebras) in their chart.
  • Design philosophy:
    • Total specialization: per‑model ASICs for maximum efficiency.
    • Merge storage and compute: single chip at DRAM‑level density to eliminate HBM, massive I/O, advanced packaging, liquid cooling.
    • Radical simplification: simpler systems = much lower total cost.
  • Flexibility: configurable context window; supports fine‑tuning via LoRA. Gen‑1 uses aggressive 3‑/6‑bit quantization (some quality hit). Gen‑2 (HC2) moves to standard 4‑bit floating point with higher density/speed.
  • Roadmap: mid‑sized reasoning LLM on HC1 this spring; “frontier” LLM on HC2 targeted for winter.

Why it matters

  • If these numbers hold up, inference latency and cost could drop by an order of magnitude—key for real‑time agents and large‑scale deployment without ballooning data centers.
  • It’s a bolder bet than general accelerators (e.g., Groq, Cerebras): a per‑model ASIC pipeline that trades flexibility for speed/efficiency, mitigated by a fast 2‑month turnaround and LoRA support.

Caveats/questions

  • “Compute + DRAM‑density storage on one chip without exotic tech” is a big claim; details on process, concurrency, and memory architecture are sparse.
  • Hard‑wiring weights risks rapid obsolescence as models evolve; economics hinge on how often new ASICs are spun and how broadly each model is adopted.
  • Benchmark framing (tokens/sec per user, aggressive quantization) may mask throughput/quality trade‑offs; quality metrics vs GPU baselines aren’t shown.

How to try

  • A beta chatbot demo and inference API for the HC1 Llama 3.1 8B are live.

Based on the discussion, here is a summary of the comments:

Technical Speculation & Credibility

  • Commenters analyzed the likely architecture, theorizing that Taalas is using specialized "mask ROM" fabrication where weights are stored via physical transistor traits (e.g., drive strength) rather than traditional memory cells. This enables single-transistor, 4-bit multiplication and massive density on mature nodes like TSMC 6nm.
  • While some were skeptical of the claimed 2-month "tape-out" turnaround, others noted the founders’ pedigree (veterans from Nvidia, AMD, and Tenstorrent), suggesting they have the specific expertise and connections to pull off such complex VLSI feats.

Performance vs. The Status Quo

  • Users distinguished Taalas’s metrics from Nvidia’s. While an H200 serves high throughput via massive batching (high latency per user), Taalas appears to offer massive throughput at very low latency (milliseconds) for single users.
  • The 17k tokens/second speed was described as a legitimate "quantitative change leading to a qualitative change," enabling real-time voice, video generation, and agentic workflows that current GPUs cannot handle efficiently.

Strategic Use Cases: Speculative Decoding

  • A significant portion of the debate focused on using these chips for speculative decoding. Instead of just being a standalone chatbot, the Taalas chip could serve as a "draft model" to rapidly generate tokens that are then verified by a larger frontier model, significantly speeding up the total inference of massive models.
  • Caveats were raised regarding tokenizer compatibility and whether the rigid nature of hard-wired chips allows them to pair effectively with evolving frontier models in this way.

Risks & Environmental Concerns

  • Obsolescence: The primary economic concern is whether a model remains relevant long enough to justify a dedicated hardware run. However, some argued that specific domains (robotics, control systems, basic coding) are stable enough to benefit from frozen, highly efficient models.
  • E-waste: Critics worried about the environmental impact of manufacturing chips that become useless once the model weights are outdated, comparing them to single-purpose ASIC miners or disposable tech.

Pi for Excel: AI sidebar add-in for Excel

Submission URL | 104 points | by rahimnathwani | 29 comments

Pi for Excel: an open-source, multi-model AI sidebar that can read and edit your spreadsheets

What it is

  • A Microsoft Excel add-in that embeds an AI agent directly in a sidebar. It can read your workbook, make changes, explain formulas, search the web, and run task-specific “skills.”
  • Bring-your-own model: works with Anthropic (Claude), OpenAI, Google Gemini, GitHub Copilot, or any OpenAI-compatible endpoint. You can switch models mid-conversation.
  • MIT-licensed. Repo: https://github.com/tmustier/pi-for-excel — Demo/site: https://pi-for-excel.vercel.app

Why it’s interesting

  • Deep Excel tooling: 16 built-in actions the AI can call, including read/write ranges, fill formulas, trace dependencies, explain formulas, search across sheets, modify structure, apply formatting, manage comments, and even restore from automatic backups.
  • Auto-context: before each turn the model gets a blueprint of your workbook, your current selection, and recent edits—so you don’t have to describe what you’re looking at.
  • Safety and control: write operations have overwrite protection and auto-verification, with one-click revert via checkpoints.
  • Extensible: install sandboxed sidebar extensions the AI can generate for you; integrates optional web search (Serper/Tavily/Brave) and an MCP gateway to connect custom tools.
  • Power-user extras (behind /experimental): tmux bridge for local terminal control, Python/LibreOffice bridge, external skills discovery, stricter extension permissions.

How to use

  • Install by sideloading the manifest (macOS and Windows instructions in the README). Click “Open Pi” in Excel’s ribbon.
  • Connect a provider via API key or OAuth, or point it at a custom OpenAI-compatible gateway.
  • Try prompts like “What sheets do I have?” or “Summarize my current selection,” then ask it to fill formulas or format ranges per your house style.

Dev setup

  • Node 20 + mkcert for local HTTPS (Office.js requirement), Vite dev server, sideload the dev manifest. Quick-start steps are in the repo.

Caveats

  • As with any LLM-powered Excel agent, workbook data you share may be sent to your chosen model/provider. Good fit if you want Copilot-like assistance but prefer open-source and BYO keys.

Here is a summary of the discussion:

  • Origins and Inspiration: The author (tmstr) joined the thread to explain that the project was built to bring the spirit of the open-source coding agent Pia to Excel. They noted that the tool uses a virtual filesystem and wraps specific "skills" to interact with the spreadsheet.
  • Web Compatibility: Users confirmed the add-in works on the web version of Excel (allowing use on Linux), provided the manifest is sideloaded via localhost. However, users noted that sideloaded dev-mode add-ins typically disappear after a week.
  • Technical Constraints: There was discussion regarding context window limits when handling large datasets. The author acknowledged that large tables currently overflow the context when passing results to the LLM, but mentioned optimizations and a potential Python bridge to handle larger sheets in the future.
  • Deployment Headaches: While users praised the modern Office JS API for functionality, several lamented the distribution process, describing the MS App Store and side-loading requirements as significant hurdles compared to the actual coding.
  • Action vs. Text: Users verified that the agent can perform functional tasks—such as creating charts and formatting tables—rather than just answering questions, with one user comparing the desired utility to an AI graphic designer that performs steps in Photoshop.
  • Risks: One user warned about API bans, claiming their Google account was banned after 48 hours of heavy usage with a similar tool, highlighting the risks of hitting consumer AI endpoints with automated tasks.

Consistency diffusion language models: Up to 14x faster, no quality loss

Submission URL | 215 points | by zagwdt | 96 comments

Consistency Diffusion Language Models (CDLM) promise big speed gains for diffusion LMs without hurting quality. The team from SNU, UC Berkeley, and Together AI reports up to 14.5x lower latency on math and coding tasks by making diffusion decoding both more parallel and cache-friendly.

Why this matters

  • Diffusion LMs refine masked text over multiple steps and can finalize multiple tokens per iteration, but in practice they’re slow: full bidirectional attention blocks KV caching, and many refinement steps are needed for quality.
  • CDLM tackles both, pushing diffusion LMs closer to practical, high-throughput deployment while keeping their advantages (bidirectional context, infilling, refinement).

What’s new

  • Exact block-wise KV caching for diffusion LMs: Train a student with a block-causal mask (attends to prompt, prior blocks, and current block) distilled from a fully bidirectional teacher. This preserves quality while enabling standard KV reuse for finished blocks.
  • Reliable multi-token finalization: Within each block, the model confidently finalizes several tokens in parallel, reducing the number of refinement steps without degrading output.

How it works (post-training recipe)

  • Trajectory collection: Run a strong bidirectional DLM as teacher to record token-by-token refinement trajectories and hidden states under a conservative setup (e.g., 256-token generations in blocks of 32, one token finalized per step) to capture high-quality signals.
  • Train a block-causal student with three losses:
    • Distillation on newly unmasked tokens (match teacher’s reconstructed distributions).
    • Consistency on still-masked tokens (align intermediate predictions with block-complete predictions via stop-gradient).
    • Auxiliary masked-denoising (retain general masked-token and reasoning ability).
  • Inference: Decode block-wise autoregressively with exact KV reuse for prompt and finished blocks; within a block, finalize tokens in parallel using a confidence threshold; early stop on EOS.

Results

  • On Dream-7B-Instruct (a diffusion LM), CDLM achieves up to 14.5x latency speedups on math/coding benchmarks with comparable quality, using far fewer refinement steps.
  • No extra heuristic knobs required; the gains come from the caching-compatible mask plus consistency-based multi-token finalization.

Takeaway CDLM is a clean, post-training path to make diffusion language models fast enough to compete with autoregressive LMs in interactive settings—keeping diffusion’s bidirectional strengths while unlocking KV caching and parallel token finalization. Caveats to watch: the need for offline teacher trajectories and how well gains generalize beyond math/coding and the tested model size.

Here is a summary of the discussion:

Technical Mechanisms and "Drafting" vs. Autoregression Users discussed the fundamental differences between Consistency Diffusion Language Models (CDLM) and standard Autoregressive (AR) models. bpp and others used analogies (referencing "British munchkin cats" and "inserting a refrigerator into a kitchen") to illustrate how diffusion models handle global context. Unlike AR models that generate tokens linearly (left-to-right), diffusion models treat text generation more like editing a draft—modeling the probability of the entire output structure simultaneously. This allows for infilling and correcting "invalid" structures that AR models struggle with. There was brief debate regarding hybrid AR+Diffusion models (citing LLaDA), with some analyzing whether combining them risks losing the reasoning benefits of pure diffusion.

Model Sizing, Efficiency, and Data Quality The conversation shifted significantly when MASNeo expressed a desire for researchers to build larger models. wngrs and others countered that frontier models likely haven't grown substantially in parameter size over the last year or two. Instead, the industry focus has shifted to efficiency and data quality. mgclhpp cited technical reports from Qwen, noting that cleaner training data and longer pre-training allow smaller models (e.g., Qwen 2.5/3) to rival or outperform older, larger models. Users suggested the industry is moving toward "density" and "singularity"—highly efficient, smaller models correcting each other in parallel, rather than monolithic giants.

Pricing Conspiracies and "Enshittification" A thread of skepticism emerged regarding the business practices of major AI labs (OpenAI, Anthropic, Google). Users speculated that the lack of public parameter counts allows companies to maintain high pricing tiers ($20-$200/month) while actually serving cheaper, optimized models. rthmsthms and bdbdbdb argued that speed improvements often look suspiciously like "rebranded" models or silent downgrades to cut compute costs (enshittification), effectively keeping margins high while technological costs drop. Specific grievances were aired regarding the pricing and performance confusing between versions like Claude Sonnet 3.5 vs. 3.6 and Opus.

Nvidia and OpenAI abandon unfinished $100B deal in favour of $30B investment

Submission URL | 294 points | by zerosizedweasle | 325 comments

Nvidia and OpenAI reportedly ditch $100B mega-deal, pivot to $30B investment

What happened

  • The Financial Times reports that Nvidia and OpenAI have abandoned an unfinished deal said to be worth around $100 billion, opting instead for a roughly $30 billion investment. The detailed terms aren’t public.

Why it matters

  • A move from a single, gigantic commitment to a smaller one suggests both sides prefer staged, flexible financing and capacity build-out over an all-in mega arrangement.
  • It could temper near-term expectations for OpenAI’s dedicated compute scale-up and keep Nvidia’s options open on who gets priority access to its next-gen GPUs.
  • The shift may also reflect practical constraints (supply chains, regulatory scrutiny, financing costs) and a desire to avoid heavy concentration risk.

What to watch

  • How the $30B is structured (equity stake, JV, supply pre-pays, or a mix).
  • Any exclusivity or priority-allocation clauses for Nvidia hardware.
  • Knock-on effects for OpenAI’s cloud partnerships and training timelines.
  • Whether other financiers or hyperscalers step in to fill the gap.

Based on the discussion, here is a summary of the community's reaction:

OpenAI’s "WeWork Moment" and the IPO Rush Much of the conversation draws a sharp comparison between OpenAI and WeWork. Commenters suggest OpenAI (and Anthropic) may be rushing toward an IPO to capitalize on "unlimited AI hype" before the market scrutinizes their lack of a clear path to profitability. Users expressed skepticism about the company's cost structure, viewing the pivot to a smaller investment deal as a potential signal that the "blind check" era of funding is ending.

Nvidia: Enron, Cisco, or Just a Cyclical Bust? Speculation regarding Nvidia’s future was polarized. Some questioned if Nvidia could become "Enron 2.0" (referencing fraud), though this was largely dismissed because Nvidia sells tangible, high-demand hardware. A more popular comparison was Cisco or Sun Microsystems during the Dotcom boom—highly profitable companies that faced a massive correction when the bubble burst. There was debate over whether Nvidia could easily pivot back to selling GPUs to gamers if enterprise demand dries up, with some arguing that retreating from high-margin datacenter chips would be a painful restructuring.

Tangible Assets vs. Model Weights A sub-thread contrasted OpenAI with SpaceX. Participants noted that SpaceX has tangible assets (rockets, Starlink satellites, launch infrastructure) and a clear moat, whereas OpenAI faces high operating costs with a product (LLMs) that some view as lacking the same defensive "hard tech" barriers.

Hype Fatigue and Skepticism Calculated skepticism regarding AGI (Artificial General Intelligence) is growing. Commenters noted that aggressive predictions about AI replacing white-collar work or programmers by 2024/2025 are already "aging like milk." Citations regarding Ed Zitron (a tech critic who predicted investment scale-backs) were shared, suggesting that while the technology is useful, the financial expectations attached to it may be decoupling from reality.

'A Big Fuck You to Big Tech': New Jersey Residents Defeat AI Data Center

Submission URL | 53 points | by abdelhousni | 18 comments

Headline: New Jersey city kills AI data center, chooses public park instead

Summary: The New Brunswick, NJ City Council voted to cancel a planned 27,000-square-foot AI data center at 100 Jersey Ave and will add a public park to a redevelopment that already includes 600 apartments (10% affordable) and small-business warehouses. Hundreds packed the meeting to oppose the project, citing fears of higher electricity and water bills and environmental impacts. Local groups, including the NAACP, argued the facility would drain community resources. After the vote, residents celebrated; organizers framed the decision as prioritizing neighborhoods over Big Tech.

Why it matters:

  • Signals rising local pushback to AI infrastructure over energy and water use, grid strain, and limited community benefits.
  • May push data center developers toward stronger community benefit agreements, better transparency on resource use, and siting in areas with surplus power/water or on-site renewables.
  • Highlights tension between tech-driven development, housing affordability, and public amenities.

What to watch:

  • Whether developers appeal or relocate, and if future proposals include stronger environmental mitigations or utility guarantees.
  • If other municipalities adopt stricter zoning or disclosure requirements for AI/data center projects.
  • How New Brunswick funds and executes the park addition alongside the broader redevelopment.

Source: Common Dreams (Brett Wilkins), Feb 20, 2026.

The Discussion

The Hacker News discussion moves beyond New Brunswick's specific decision to a broader critique of the technology sector's relationship with the public and the power grid.

  • Tech Fatigue: Several users expressed that "Big Tech" has lost the "cool" factor it possessed during the 90s and 2000s (likening the old vibe to "sk8erboi culture"). Commenters described current tech giants as "soulless," "sanitized," and overly integrated into daily life, viewing them now as "the establishment" rather than exciting innovators.
  • Data Center Value: Participants debated the local value of data centers, describing them as "big ugly boxes" that employ few people compared to the space they occupy. One user contrasted them with nuclear plants which, despite risks, historically evoked a sense of national industrial pride that server farms lack.
  • The Energy Dilemma: A significant portion of the thread focused on whether cities could leverage data center demand to fund new power infrastructure (solar, battery, or nuclear) rather than banning them.
    • Skepticism of Commitments: Critics argued that even if cities demand new power generation as a prerequisite, "energy is fungible" and corporations will use legal loopholes, subsidiaries, or financial derivatives to avoid delivering actual local capacity while still causing grid congestion.
    • Implementation Issues: Others noted that solar requires expensive battery storage to be viable for 24/7 uptime and nuclear takes 15+ years to build, leading to fears that companies will simply revert to gas turbines or result in costs being "dumped onto regular people."
    • Regulatory Solutions: Ideas were floated to strictly regulate business power rates or use performance bonds (seized if timelines aren't met) to enforce infrastructure promises, though skeptics maintained that a "blanket refusal" is often the only safe move for a municipality.