ChatGPT Containers can now run bash, pip/npm install packages and download files
ChatGPT’s code sandbox just got a quiet but major upgrade, per Simon Willison’s testing
- What’s new: Containers can now run Bash directly, not just Python. They also execute Node.js/JavaScript and successfully ran “hello world” in Ruby, Perl, PHP, Go, Java, Swift, Kotlin, C, and C++ (no Rust yet).
- Package installs: pip install and npm install now work via a custom proxy despite the container lacking general outbound network access.
- Fetching files: A new container.download tool saves files from the web into the sandboxed filesystem. It only downloads URLs that were explicitly surfaced in the chat (e.g., via web.run), reducing prompt-injection exfil risks.
- Safety notes: Attempts to craft arbitrary URLs with embedded secrets were blocked—container.download requires the URL to have been “viewed in conversation,” and web.run filters long/constructed query strings. Downloads appear to come from an Azure IP with a ChatGPT-User/1.0 UA.
- Availability: Willison says these capabilities show up even on free accounts; documentation and release notes are lagging.
Why it matters: This turns ChatGPT’s sandbox into a far more capable dev environment—able to script with Bash, pull data files, and install ecosystem packages—making it much better at real-world coding, data wrangling, and agentic workflows, while keeping some guardrails against data exfiltration.
The Discussion
- Expanded Capabilities & Language Support: Users confirmed the sandbox is available to free accounts and validated support for languages like D (via the DMD compiler), though C# appears mostly absent due to .NET framework constraints. One commenter described the experience of watching the AI use a computer to answer questions as having a "Pro Star Trek" feel, contrasting it with the loops and errors often encountered with Gemini.
- CLI Abstraction vs. Mastery: A debate emerged regarding the efficiency of using LLMs for standard *nix tasks. While some purists argued that using an LLM to inspect file headers (magic bytes) is wasteful compared to valid CLI commands like
file or curl, Simon Willison and others countered that this democratizes powerful tools (like ffmpeg or ImageMagick) for the 800 million users who aren't comfortable in a terminal, while also saving experienced developers from memorizing complex flag syntax.
- The Return of the "Mainframe": The upgrade sparked speculation about the shift toward persistent, virtual development environments. Commenters noted that tool-calling is moving off-platform, with some likening the trend to "renting time on a mainframe" to avoid local hardware maintenance. Willison noted that ephemeral environments are particularly well-suited for coding agents, as they mitigate security risks—if an environment is trashed or leaked, it can be discarded and restarted.
- Unreleased Features: Sharp-eyed users noticed "Context: Gmail" and "read-only" tags in the application logs, leading to speculation (confirmed by news leaks) that deep integration with Gmail and Calendar is currently in testing.
There is an AI code review bubble
What’s new: Greptile’s Daksh Gupta argues we’re in the “hard seltzer era” of AI code review—everyone’s shipping one (OpenAI, Anthropic, Cursor, Augment, Cognition, Linear; plus pure-plays like Greptile, CodeRabbit, Macroscope). Rather than claim benchmark superiority, Greptile stakes its bet on a philosophy built around:
- Independence: The reviewer should be different from the coder. Greptile refuses to ship codegen—separation of duties so the agent that writes code isn’t the one approving it.
- Autonomy: Code validation (review, tests, QA) should be fully automated with minimal human touch. Greptile positions itself as “pipes,” not a UI.
- Feedback loops: Coding agent writes, validation agent critiques and blocks, coding agent fixes, repeat until the reviewer approves and merges. Their Claude Code plugin can auto-apply Greptile’s comments and iterate.
Why it matters:
- If agents start auto-approving most changes, separation of duties becomes a compliance and safety necessity.
- Review tools could shift from “assistive UIs” to background infrastructure with audit trails and merge authority.
- Switching costs are high for review systems, so vendor choice may be sticky.
Claims and context:
- Greptile cites enterprise traction (including two “Mag7” customers) and recent updates: Greptile v3 (agentic workflow, higher acceptance rates and better signal ratios), long‑term memory, MCP integrations, scoped rules, and lower pricing.
- Caveat: performance claims are vendor-reported; teams may weigh the trade-off between independence and an integrated, single-agent DX.
If you’re evaluating:
- Can the reviewer run tests/QA autonomously and block/merge with guardrails?
- Is there strict separation from your codegen agent?
- Does it support closed-loop remediation, audit logs, and blast-radius controls (permissions, rollout, revert)?
- How painful is it to rip out later (repo coverage, CI/CD hooks, policy/rule migration)?
Based on the discussion, here is a summary of the comments regarding AI code review tools and code review philosophy:
Prompt Engineering & Implementation Strategies
User jv22222 shared detailed insights from building an internal AI review tool, suggesting specific prompting strategies to improve results:
- Context is King: The AI needs the full file contents of changed files plus "1-level-deep imports" to understand how changed code interacts with the codebase.
- Diff-First, Context-Second: Prompts should explicitly mark diffs as "REVIEW THIS" and surrounding files as "UNDERSTANDING ONLY" to prevent hallucinations or false positives on unchanged code.
- Strict Constraints: Use negative constraints to reduce noise. Explicitly tell the AI not to flag formatting (let Prettier handle it), TypeScript errors (let the IDE handle it), or guess line numbers.
- Structure: Force structured output (e.g., Emoji-prefixed bullets lists) categorized by severity (Critical, Major, Minor).
The Signal-to-Noise Problem
Several users expressed skepticism about the current state of AI reviewers, citing a poor signal-to-noise ratio.
- One user estimated their experience as "80% noise."
- The danger is that if the AI floods the PR with speculative or trivial comments, humans will tune it out, resulting in the "20%" of critical bugs slipping through because the human bottleneck is attention.
- The consensus among skeptics is that AI tools are currently useful as a "second set of eyes" but cannot yet be trusted as a default gatekeeper or autonomous agent.
Edit-Based vs. Comment-Based Reviews
A significant portion of the thread digressed into a discussion on Jane Street’s internal review system (Iron), sparked by a comment about how reviewers should fix trivial issues rather than commenting on them.
- The Philosophy: Instead of leaving nitpick comments (which can feel insulting and slow down the loop), reviewers should simply apply the changes directly.
- The Friction: Users noted that standard GitHub workflows make this difficult (checkout, edit, push, context switch), whereas internal tools like Iron or specific VS Code plugins streamline "server-side" editing.
- Consensus: "Ping-ponging" comments over variable names or minor logic is a massive productivity killer; fixing it directly is often preferred by senior engineers.
The "Bikeshedding" & Naming Debate
The thread debated whether comments on variable names are valuable or a waste of time ("bikeshedding").
- Pro-Naming Comments: If a reviewer finds a name confusing, the code is confusing. The mental model gap must be bridged. One user suggested the default response to a naming suggestion should always be "Done" unless there is a specific, strong reason not to.
- Anti-Naming Comments: Others argued that arguing over
itemCount vs numberOfItems for 30 minutes is a waste of expensive engineering time. While clarity matters, hyper-focusing on conventions (especially in test function names) often yields diminishing returns.
Porting 100k lines from TypeScript to Rust using Claude Code in a month
- Vjeux (ex-Facebook/Meta engineer) set out to port the open‑source Pokémon Showdown battle engine (JS/TS) to Rust to enable fast AI training loops, using Claude Code as the primary driver.
- The “agentic” setup required a surprising amount of ops hackery: he proxied git push via a local HTTP server (since the sandbox blocked SSH), compiled/running inside Docker to avoid antivirus prompts, and used automation (keypress/paste loops and an auto‑clicker) to keep Claude running unattended for hours.
- Early naïve conversion looked impressive (thousands of compiling lines) but hid bad abstractions: duplicated types across files, “simplified” logic where code was hard, and hardcoded patches that didn’t integrate.
- The fix was to impose structure and determinism: generate a script that enumerates every JS file/method and mirrors them in Rust, with source references inline. This grounded the port and reduced hallucinations/mistranslations.
- Key lesson: AI can churn volume, but you must tightly constrain it with deterministic scaffolding, clear mappings, and checks; otherwise it drifts into inconsistent designs.
- Reliability matters for long runs: he hit intermittent failures overnight and notes that agent platforms still need better stability and permission models for unattended work.
- Big picture: with enough guardrails and orchestration, one engineer can push a six‑figure‑line port in weeks—but the real work is designing the rails that keep the model honest and productive.
Takeaway: AI was the tireless junior dev; success came from turning the task into a scripted, verifiable pipeline rather than a free‑form translation exercise.
Here is a summary of the discussion:
- The "Improvement" Trap: Commenters shared similar war stories, notably a failed attempt to port an Android/libgdx game to WASM. The "agentic" failure mode was identical: the AI refused to do a dumb, line-by-line port and insisted on "improving" the code (simplifying methods, splitting generic
start functions), which immediately introduced layout and physics bugs. The consensus is that legacy code is battle-tested; AI attempts to apply "clean code" principles during a port usually destroy subtle, necessary logic.
- Hidden Prompts vs. User Intent: A key insight proposed is that IDE plugins and agent frameworks often inject system prompts instructing the model to act as an "expert engineer" using "best practices." This creates a conflict: the user wants a literal translation, but the hidden prompt forces the AI into a refactoring mode.
- Anthropomorphism and Consistency: A sub-thread debated the terminology of calling the AI "arrogant." While technically just token prediction based on training data (which includes post-mortems suggesting code cleanup), users noted that treating the model as an inconsistent reasoning engine is dangerous. The "reasoning" output (Chain of Thought) is often just a hallucinated prediction of what an explanation should look like, not a true window into the model's logic.
- Alternative Strategies:
- Port Tests First: Write or port the test suite before the application code; if the AI can make the green lights turn on, the implementation details matter less.
- Source-to-Source Compilers: for mechanical translations, it may be better to have the AI write a chaotic regex/parsing script to do the bulk work deterministically, rather than relying on inference for every line.
- Language Proximity: One user noted a successful 10k line C++ to Rust port using Gemini, largely because the memory models and structures allowed for a straighter translation than the paradigm shift required for TS to Rust.
AI code and software craft
- The author frames today’s AI “slop” (low-effort audio/video/text) through Jacques Ellul’s “technique”: optimizing for measurable outcomes over craft. When engagement and efficiency are the ends, quality and delight erode.
- Music as parable: Bandcamp’s album-centric, human curation model nurtured indie craft; Spotify’s playlist-driven optimization yields bland, metrics-first tracks. In such spaces, AI thrives because “good enough” at scale beats artistry. The author says Bandcamp is hostile to AI, even banning it.
- Software’s parallel: Big Tech work has become “plumbing”—bloated systems, weak documentation, enshittified platforms, and narrow roles that atrophy broad engineering skill. Cites Jonathan Blow’s “Preventing the Collapse of [Software] Civilization”: industry has forgotten how to do things well.
- Two consequences: (1) AI agents threaten rote, narrow coding jobs—where “good enough” code meets business needs. (2) Overgeneralized claims that AI can do “most” software reduce software to mere output, ignoring design, taste, and higher aims.
- Practical verdict on agents: useful for well-scoped, often-solved tasks (tests, simple DB functions). But they hallucinate, lack understanding, and produce brittle or monstrous code when asked to generalize or “vibe code.”
- Implied takeaway: If platforms value the domain (music, software) over metrics, craft can flourish. Cultivate broad, thoughtful engineers and resist collapsing creative work into pure optimization. AI is a tool for the routine, not a replacement for judgment or taste.
Summary of Discussion:
The discussion threads initially pivoted from the article's critique of software to a specific debate on why implementation quality varies so drastically between markets:
- The Enterprise Software Trap: Commenters argued that enterprise software is universally poor due to a principal-agent problem: the purchaser (managers) prioritizes compliance, reporting, and "required fields," while the actual user is ignored. Conversely, consumer software is polished because it must woo individuals, though users noted it is often optimized for engagement rather than actual utility.
- AI Code: Orchestration vs. Pollution: A heated debate emerged regarding the practical output of AI agents.
- The Proponent View: Some argued AI is a tool change—like moving from a handsaw to a chainsaw. The work shifts from manual coding to "orchestration, planning, and validating."
- The Skeptic View: Critics countered that AI-generated code is often "orders of magnitude worse" and brittle. A significant complaint was review fatigue: developers expressed frustration with colleagues dumping 1,000-line, zero-effort AI diffs that are exhausting to verify and debug.
- The Amplification Theory: One user suggested AI simply amplifies existing traits: it makes lazy developers lazier (and dangerous), while potentially helping skilled architects move faster, provided they have the taste to judge the output.
- The Philosophy of Efficiency: Expanding on the article's reference to Jacques Ellul, commenters noted that "efficiency" in the corporate sense is usually a proxy for shareholder return, which fundamentally trades off system adaptability and resilience. One user recommended Turing's Cathedral as a look back at an era of "real engineering" craft that the industry has since forgotten.
Show HN: TetrisBench – Gemini Flash reaches 66% win rate on Tetris against Opus
AI Model Tetris Performance Comparison: A new site pits AI models against each other in head-to-head Tetris and tracks results by wins, losses, and draws with a public leaderboard. It’s a community-driven benchmark—there’s no data yet, and users are invited to run AI-vs-AI matches to populate the stats. Beyond the leaderboard, there’s a “Tetris Battle” mode to watch or play, making it a playful way to compare agents under identical conditions.
Discussion Summary:
The discussion focused heavily on the specific mechanics of the implementation, the suitability of LLMs for the task, and legal concerns. The creator (ykhl) clarified the technical architecture: rather than relying on visual reasoning (where LLMs struggle), the system treats Tetris as a coding optimization problem. The models receive the board state as a JSON structure and generate code to evaluate moves. The creator noted that models often "over-optimize" for immediate rewards (clearing lines), creating brittle game states that lead to failure, rather than prioritizing long-term survival heuristics like board smoothness.
Key points from the thread include:
- Tetris Mechanics: Experienced players pointed out flaws in the game engine, specifically regarding the randomization system (suggesting a standard "7-bag" system to prevent piece starvation) and rotation physics, which users felt were biased to the left or lacked standard "SRS" (Super Rotation System) behaviors.
- LLM Utility vs. Traditional Bots: Several users debated the efficiency of using billion-dollar GPU clusters to play a game that simple algorithms on 40-year-old CPUs can master. Critics argued that Reinforcement Learning (e.g., AlphaZero) is a more appropriate architecture than LLMs for this task. However, others acknowledged the project as a benchmark for reasoning and coding benchmarks rather than raw gameplay dominance.
- Model Specifics: There was a brief comparison of model performance, with Gemini 1.5 Flash highlighted as a cost-effective "workhorse" compared to the more expensive Claude 3 Opus.
- Legal Warnings: Multiple users warned the creator about "Tetris Holdings" and their history of aggressive trademark enforcement against fan projects.
Show HN: Only 1 LLM can fly a drone
SnapBench: a Pokémon Snap–style spatial reasoning test for LLMs
-
What it is: A small benchmark where a vision-language model pilots a drone in a voxel world to find and identify three ground-dwelling creatures (cat/dog/pig/sheep). Architecture: Rust controller (orchestration), Zig/raylib simulator (physics, terrain, creatures, UDP 9999 command API), and a VLM via OpenRouter. The controller feeds screenshots + state to the model; the model returns movement, identify, and screenshot commands. Identification succeeds within 5 units.
-
Headline result: Out of seven frontier models tested with the same prompt, seeds, and a 50-iteration cap, only Gemini 3 Flash consistently managed to descend to ground level and successfully identify creatures. Others:
- GPT-5.2-chat: Gets near horizontally, rarely lowers altitude.
- Claude Opus 4.5: Spams identification (160+ attempts) but approaches at bad angles, never succeeds.
- The rest: Wander or get stuck.
-
Key insight: Altitude/approach control—not abstract reasoning—was the bottleneck. The cheapest model beat pricier ones, suggesting:
- Spatial/embodied control may not scale with model size (yet).
- Training differences (e.g., robotics/embodied data) could matter more.
- Smaller models may follow literal instructions (“go down”) more directly.
-
Other observations:
- Two-creature “win” happened when spawns were close; the 50-step limit often ends runs early in a big world.
- High-contrast creatures (gray sheep, pink pigs) are easier to spot; visibility normalization is a possible future tweak.
-
Caveats: Side project, not a rigorous benchmark; one blanket prompt for all models; basic feedback loop; iteration cap may disadvantage slower-but-capable agents.
-
Prior attempt IRL: A DJI Tello test ended with ceiling bumps and donuts; new hardware planned now that Flash shows promise in sim.
-
Try it: GitHub kxzk/snapbench. Requires Zig ≥0.15.2, Rust (2024 edition), Python ≥3.11, uv, and an OpenRouter API key.
Here is a summary of the discussion:
Critique: Wrong Tool for the Job
A significant portion of the discussion centered on whether LLMs should handle low-level control. Critics argued that LLMs are inefficient text generators ill-suited for real-time physics, noting that latency and token costs destroy the simple economics of flight. Several users likened the approach to "using a power drill for roofing nails," suggesting that standard control loops (PID) or simpler neural networks (LSTMs) are far superior for keeping a drone airborne.
Alternative Architectures
Commenters suggested looking beyond standard VLLMs:
- Vision-Language-Action (VLA) Models: Users noted that specific VLA models (like those intended for robotics) are better suited for embodied control than general-purpose chat models.
- Qwen3VL: One user observed that Qwen models often possess better spatial grounding (encoding pixel coordinates in tokens) compared to larger, abstract reasoning models.
- Pipeline Flaws: One commenter critiqued the project's architecture (Scene $\to$ Text Labels $\to$ Action), arguing that converting a 3D simulation into discrete text labels causes a loss of relative geometry and depth. They suggested that without an explicit world model, errors compound over time because the agent is "stateless" and lacks temporal consistency.
Defense: High-Level Reasoning vs. Piloting
Defenders of the approach clarified that the goal isn't to replace the flight controller (autopilot), but to build an agent capable of high-level semantic tasks (e.g., "fly to the garden and take a picture of the flowers"). They argued that while an LLM shouldn't manage rotor speeds, it is currently the best tool for interpreting natural language instructions and visual context to direct a traditional control system.
Google AI Overviews cite YouTube more than any medical site for health queries
Google AI Overviews lean on YouTube over medical sites for health answers, study finds
- The study: SE Ranking analyzed 50,807 German-language health queries (Dec 2025) and 465,823 citations. AI Overviews appeared on 82% of health searches.
- Top source: YouTube was the single most cited domain at 4.43% (20,621 citations) — more than any hospital network, government health portal, medical association, or academic institution. Next: NDR.de (3.04%), MSD Manuals (2.08%), NetDoktor (1.61%), Praktischarzt (1.53%).
- Why it matters: Researchers say relying on a general-purpose video platform (where anyone can upload) signals a structural design risk—prioritizing visibility/popularity over medical reliability—in a feature seen by 2B people monthly.
- Google’s response: Says AI Overviews surface high-quality content regardless of format; many cited YouTube videos come from hospitals/clinics and licensed professionals; cautions against generalizing beyond Germany; claims most cited domains are reputable. Notes 96% of the 25 most-cited YouTube videos were from medical channels.
- Context: Follows a Guardian probe showing dangerous health misinfo in some AI Overviews (e.g., misleading liver test guidance). Google has since removed AI Overviews for some, but not all, medical searches.
- Caveats: One-time snapshot in Germany; results can vary by region, time, and query phrasing. Researchers chose Germany for its tightly regulated healthcare environment to test whether reliance on non-authoritative sources persists.
Based on the discussion, users expressed concern that Google’s AI is creating a "closed loop" of misinformation by citing AI-generated YouTube videos as primary sources.
The "Dead Internet" and Feedback Loops
Participants described an "Ouroboros" effect where AI models validate themselves by citing other AI-generated content. Several users invoked the "Dead Internet Theory," noting that YouTube is exploding with low-quality, AI-generated videos that exist solely to capture search traffic.
- One user recounted finding deepfaked "science" videos, such as AI-generated scripts using Richard Feynman’s voice to read content he never wrote.
- Others noted that "grifters" are using AI to scale scams, such as fake video game cheat detection software.
The Crisis of Trust
A debate emerged regarding the fundamental reliability of AI search:
- The skeptical view: Relying on AI that cites other AI destroys "shared reality." If the technology cannot filter out its own "slop," it cannot be a trusted intermediary for truth.
- The counter-argument: Some argued that human sources (propaganda, biased media) are also unreliable, suggesting that the issue is not unique to AI but rather a general problem of verifying information sources.
Economic Incentives and Alternatives
Commenters suggested that Google's business model limits its ability to fix this.
- Users praised Kagi (a paid search engine) for its "niche" ability to use agentic loops to verify primary sources.
- The consensus was that Google’s unit economics (serving billions of free queries) force it to rely on cheaper, single-pass retrieval methods, which lack the depth to filter out reputable-looking but hallucinated video content.
HN: llms.py v3 ships a big extensibility overhaul, 530+ models via models.dev, and “computer use” automation
llms.py just released v3, reframing itself as a highly extensible LLM workbench. The headline change is a switch to the models.dev catalog, unlocking 530+ models across 24 providers, paired with a redesigned Model Selector and a plugin-style extensions system. It also adds desktop automation (“computer use”), Gemini File Search–based RAG, MCP tool connections, and a raft of built‑in UIs (calculator, code execution, media generation).
Highlights
- 530+ models, 24 providers: Now inherits the models.dev open catalog; enable providers with a simple "enabled": true in llms.json. Daily auto-updates via llms --update-providers. Extra providers can be merged via providers-extra.json.
- New Model Selector: Fast search, filtering by provider/modalities, sorting (knowledge cutoff, release date, last updated, context), favorites, and rich model cards. Quick toggles to enable/disable providers.
- Extensions-first architecture: UI and server features implemented as plugins using public client/server APIs; built-ins are just extensions.
- RAG with Gemini File Search: Manage file stores and document uploads for retrieval workflows.
- Tooling: First-class Python function calling; MCP support to connect to Model Context Protocol servers for extended tools.
- Computer Use: Desktop automation (mouse, keyboard, screenshots) for agentic workflows.
- Built-in UIs:
- Calculator (Python math with a friendly UI)
- Run Code (execute Python/JS/TS/C# in CodeMirror)
- KaTeX math rendering
- Media generation: image (Google, OpenAI, OpenRouter, Chutes, Nvidia) and TTS (Gemini 2.5 Flash/Pro Preview), plus a media gallery
- Storage and performance: Server-side SQLite replaces IndexedDB for robust persistence and concurrent use; persistent asset caching with metadata.
- Provider nuances: Non-OpenAI-compatible providers handled via a providers extension; supports Anthropic’s Messages API “Interleaved Thinking” to improve reasoning between tool calls (applies to Claude and MiniMax).
Why it matters
- One pane of glass for LLM experimentation: broad model coverage with consistent UX.
- Batteries included: from RAG to tool use to desktop control and media generation, with minimal setup.
- Extensible by design: encourages custom providers, tools, and UI add-ons as plugins.
Getting started
- pip install llms-py (or pip install llms-py --upgrade)
- Configure providers by enabling them in llms.json; update catalogs with llms --update-providers
Caveats
- Powerful features like code execution and desktop control have security implications; use with care.
- You’ll still need API keys and to mind provider quotas/costs.
Top Story: llms.py v3 ships a big extensibility overhaul, 530+ models via models.dev, and “computer use” automation
llms.py has released version 3, repositioning itself as a highly extensible workbench for Large Language Models. This update integrates the models.dev catalog, providing access to over 530 models from 24 providers, and introduces a plugin-based architecture that allows both UI and server features to be extended. Key additions include "Computer Use" for desktop automation, RAG capabilities via Gemini File Search, and support for the Model Context Protocol (MCP) to connect external tools. The release also includes built-in utilities like a calculator and code execution environment, all backed by a switch to server-side SQLite for better performance.
Discussion Summary
The discussion explores the open-source philosophy behind the project, its technical implementation of agents, and the challenges of gaining visibility on Hacker News.
- Licensing and OpenWebUI: A significant portion of the conversation contrasts llms.py with OpenWebUI. Users expressed frustration with OpenWebUI’s recent move to a more restrictive license and branding lock-in. The creator of llms.py (
mythz) positioned v3 as a direct response to this, emphasizing a permissive open-source approach and an architecture designed to prevent monopoly on components by making everything modular and replaceable.
- Technical Implementation (Agents & MCP):
- Orchestration: When asked about managing state in long-running loops (specifically regarding LangGraph), the creator clarified that llms.py uses a custom state machine modeled after Anthropic’s "computer use" reference implementation, encapsulating the agent loop in a single message thread.
- Model Context Protocol (MCP): The creator noted that while MCP support is available (via the
fast_mcp extension), it introduces noticeable latency compared to native tools. However, it is useful for enabling capabilities like image generation on models that don't natively support it.
- Deployment and Auth: Users inquired about multi-user scenarios. The system currently supports GitHub OAuth for authentication, saving content per user. While some users felt the features were "gatekept" or not ready for enterprise deployment compared to other tools, the creator emphasized that the project's goal is simplicity and functionality for individuals or internal teams, rather than heavy enterprise feature sets.
- Ranking Mechanics: There was a meta-discussion regarding the difficulty of getting independent projects to the front page. The creator noted they had posted the project multiple times over the week before it finally gained traction, leading to speculation by other users about how quickly "new" queue submissions are buried by downvotes or flagged compared to company-backed posts.
- Naming: A user briefly pointed out the potential confusion with Simon Willison’s popular
llm CLI tool.
OracleGPT: Thought Experiment on an AI Powered Executive
SenTeGuard launches a blog focused on “cognitive security,” but it’s not live yet. The landing page promises team-authored, moderated longform updates, research notes, and security insights, but currently shows “No posts yet” and even an admin prompt to add the first post—suggesting a freshly set-up placeholder. One to bookmark if you’re tracking cognitive security, once content arrives.
Based on the provided comment text (which appears to be a compressed or vowel-reduced transcript), the discussion focuses on the implications of AI in management and governance, rather than the specific lack of content on the blog mentioned in the submission summary.
The conversation covers three main themes:
- Automation and Accountability: Users debate the feasibility of "expert systems" running companies. While some cite algorithmic trading and autopilot as proof that computers already make high-stakes decisions, others argue that humans (executives) are structurally necessary to provide legal accountability ("you can't prosecute code"). The short story Manna by Marshall Brain is cited as a relevant prediction of algorithmic management.
- Government by Algorithm: Participants speculate on using a "special government LLM" to synthesize citizen preferences for direct democracy. Skeptics counter that current LLMs hallucinate (referencing AI-generated fake legal briefs) and that such a system would likely be manipulated by those in charge of the training data or infrastructure.
- Commercial Intent: One commenter critiques the submission as a commercial vehicle to sell "cognitive security" services, arguing this commercial framing undermines the philosophical discussion. The apparent author (
djwd) acknowledges the commercial intent but argues the engineering and political questions raised are still worth discussing.
Minor tangents include:
- Comparisons of AI hype to failed technologies like Theranos versus successful infrastructure like aviation autoland.
- A debate regarding the US President’s practical vs. theoretical access to classified information and chain-of-command issues.
Clawdbot - open source personal AI assistant
Moltbot (aka Clawdbot): an open‑source, self‑hosted personal AI assistant for your existing chat apps
What’s new
- A single assistant that runs on your own devices and replies wherever you already are: WhatsApp, Telegram, Slack, Discord, Google Chat, Signal, iMessage/BlueBubbles, Microsoft Teams, Matrix, Zalo (incl. Personal), and WebChat. It can also speak/listen on macOS/iOS/Android and render a live, controllable Canvas.
- Easy onboarding: a CLI wizard (moltbot onboard) sets up the gateway, channels, and skills; installs a user‑level daemon (launchd/systemd) so it stays always on. Cross‑platform (macOS, Linux, Windows via WSL2). Requires Node ≥22; install via npm/pnpm/bun.
- Models and auth: works with multiple providers via OAuth or API keys; built‑in model selection, profile rotation, and failover. The maintainers recommend Anthropic Pro/Max (100/200) and Opus 4.5 for long‑context and better prompt‑injection resistance.
- Developer/ops friendly: stable/beta/dev release channels, Docker and Nix options, pnpm build flow, and a compatibility shim for the older clawdbot command.
- Security defaults for real messaging surfaces: unknown DMs get a pairing code and aren’t processed until you approve, reducing risk from untrusted input.
Why it matters
- Brings the “always‑on, everywhere” assistant experience without locking you into a hosted SaaS front end.
- Bridges consumer and workplace chat apps, making agents genuinely useful where people already collaborate.
- Thoughtful guardrails (DM pairing) and model failover are practical touches for a bot that lives in your actual message streams.
Quick try
- Install: npm install -g moltbot@latest, then moltbot onboard --install-daemon
- Run the gateway: moltbot gateway --port 18789 --verbose
- Talk to it: moltbot agent --message "Ship checklist" --thinking high
Notes and caveats
- You’re still trusting whichever model/provider you use; “feels local” doesn’t mean the LLM runs locally by default.
- Needs Node 22 and a background service; connecting to many chat platforms may have ToS and security considerations.
- MIT licensed; the repo shows strong community interest (tens of thousands of stars/forks).
Discussion Summary
The discussion centers on the economics, security, and utility of self-hosted AI agents:
- The Cost of "Always On": Users warned that running agents via metered APIs (like Claude Opus) can get expensive quickly. One user reported spending over $300 in two days on "fairly basic tasks," while another noted that a single complex calendar optimization task involving reasoning cost $29. This led to suggestions for using smaller, specialized local models to handle routine logic.
- The "Private Secretary" Dream: There is distinct enthusiasm for a "grown-up" version of Siri—dubbed "Nagatha Christy" or "Jarbis" by one commenter—that can handle messy personal contexts (kids' birthday parties, dentist reminders) alongside work integrations (Jira, Trello, Telegram) without monetizing user data. Several users expressed deep dissatisfaction with current hosted options and a willingness to pay a premium for a private, reliable alternative.
- Security Concerns: The repository came under scrutiny for hardcoded OAuth credentials (client secrets), which some argued is common in open-source desktop apps but poses risks if the "box" is compromised. Others found the concept of "directory sandboxing" insufficient, expressing fear about granting an AI agent permission to modify code and files on their primary machine.
- Self-Repairing Agents: One contributor shared an anecdote about the "lightbulb moment" of working with an agent: when the bot stopped responding on Slack, they used the AI to debug the issue, review the code, and help submit a Pull Request to fix itself.
AI Lazyslop and Personal Responsibility
A developer recounts a painful code review: a coworker shipped a 1,600-line, AI-generated PR with no tests, demanded instant approval, and later “sneak-merged” changes after pushback. The author doesn’t blame the individual so much as incentives that reward speed over stewardship—and coins “AI Lazyslop”: AI output the author hasn’t actually read, pushing the burden onto reviewers.
Proposed anti–AI Lazyslop norms:
- Own the code you accept from an LLM.
- Disclose when and how you used AI; include key prompts/plans in the PR.
- Personally read and test everything; add self-review comments explaining your thinking.
- Use AI to assist review, then summarize what you fixed and why.
- Be able to explain the logic and design without referring back to the AI.
- Write meaningful tests (not trivial ones).
The post notes a cultural shift: projects like Ghostty ask contributors to disclose AI use, and even Linus Torvalds has experimented with “vibe-coding” via AI. The gray area persists: the coworker evolved to “semi-lazy-slop,” piping reviewer comments straight into an LLM—maybe better, maybe not.
In a nice touch of dogfooding, the author discloses using Claude for copy edits and lists the concrete fixes it suggested. The core message: don’t shame AI—set expectations that keep quality and responsibility with the human who ships the code.
Based on the discussion, here is a summary of the points raised by Hacker News commenters:
Trust and Professionalism
The most heated point of discussion was the "sneak-merge" (merging code after a review without approval). Commenters almost universally agreed that this violates the fundamental trust required for collaborative development. While the author focused on systemic incentives, many users argued that "Mike" bears personal responsibility. One user compared sneaking unreviewed code to a chef spitting in food—a deliberate, unethical action rather than just a process error.
The "Blameless" Culture Debate
Several users pushed back against the author's attempt to "blame the incentives" rather than the individual.
- Commenters warned that "blameless culture" can swing too far, protecting toxic behavior and forcing managers to silently manage poor performers out while publicly maintaining a positive façade.
- One user argued that "bending backwards" to avoid blaming an individual for intentional actions creates a low-trust environment where high-quality software cannot be built.
The Reviewer’s Burden and "Prisoner's Dilemma"
A recurring theme was the asymmetry of effort. AI allows developers to generate code faster than seniors can review it.
- The Prisoner’s Dilemma: One commenter described a situation where diligent reviewers spend all their time fixing "AI slop" from others, consequently missing their own deadlines. Meanwhile, the "slop-coders" appear productive due to high velocity and get promoted, punishing those who maintain quality.
- Scale vs. Existence: While huge PRs existed before AI (e.g., refactoring or Java boilerplate), users noted that AI changes the frequency. Instead of one massive PR every few weeks, it becomes a daily occurrence, overwhelming the review pipeline.
Proposed Solutions and Nuance
- Policy: Some pointed to the LLVM project's AI policy as a gold standard: AI is a tool, but the human must own the code and ensure it is not just "extractive" (wasting reviewer time).
- Reviewing Prompts: There was a debate on the author's suggestion to include prompts in PRs. Some argued that prompts represent the "ground truth" and reveal assumptions, making them valuable to review. Others felt that only the resulting code matters and reviewing prompts is unnecessary overhead.
- Author’s Context: The author (presumably 'bdsctrcl') chimed in to clarify the technical context of the 1,600-line PR, noting it was largely Unreal Engine UI boilerplate (flags and saved states). While it "worked," it bypassed specific logic (ignoring flag checks) in favor of direct struct configuration, highlighting how AI code can be functional but architecturally incorrect.
When AI 'builds a browser,' check the repo before believing the hype
Top story: The Register calls out Cursor’s “AI-built browser” as mostly hype
- What was claimed: Cursor’s CEO touted that GPT‑5.2 agents “built a browser” in a week—3M+ lines, Rust rendering engine “from scratch,” custom JS VM—adding it “kind of works.”
- What devs found: Cloning the repo showed a project that rarely compiles, fails CI on main, and runs poorly when manually patched (reports of ~1 minute page loads). Reviewers also spotted reliance on existing projects (e.g., Servo-like pieces and QuickJS) despite “from scratch” messaging.
- Pushback from maintainers: Servo maintainer Gregory Terzian described the code as “a tangle of spaghetti” with a “uniquely bad design” unlikely to ever support a real web engine.
- Cursor’s defense: Engineer Wilson Lin said the JS VM was a vendored version of his own parser project, not merely wiring dependencies—but that undercuts the “from scratch” and “AI-built” framing.
- Scale vs. results: The Register cites estimates that the autonomous run may have burned through vast token counts at significant cost, yet still didn’t yield a reproducible, functional browser.
- Bigger picture: The piece argues this is emblematic of agentic-AI hype—exciting demos without CI, reproducible builds, or credible benchmarks. AI coding tools are useful as assistive “autocomplete/refactor” layers, but claims that agents can ship complex software are outpacing reality.
Why it matters for HN:
- Shipping > demo: Repos, CI status, and benchmarks remain the truth serum. If the code doesn’t build, the press release doesn’t matter.
- Agent limits: Autonomous agents can generate mountains of code, but architecture, integration, and correctness still demand human engineering rigor.
- Practical adoption: The Register urges proof first—working software and measurable ROI—before buying into “agents will write 90% of code” narratives.
Bottom line: Before believing “AI built X,” check the repo.
Based on the discussion, here is a summary of the comments on Hacker News:
The "Novel-Shaped Object"
Commenters were largely dismissive of the project's technical merit, characterizing it as a marketing stunt rather than an engineering breakthrough. One user likened the browser to a "novel-shaped object"—akin to a key made of hollow mud that looks correct but shatters upon use. Others described the code not as a functional engine, but as a "tangle of spaghetti" that poorly copied existing implementations (like Servo) rather than genuinely building "from scratch," resulting in a design unlikely to ever support a real-world engine.
Debating "From Scratch" and Dependencies
A significant portion of the thread involved users decompiling the "from scratch" claim.
- Hidden Dependencies: Users pointed out that the AI did not write a rendering engine from nothing; it halluncinated or "vendored" (copied) code from existing projects like Servo, Taffy, and QuickJS.
- The "TurboTax" Analogy: One commenter compared the CEO’s claim to saying you did your taxes "manually" while actually filling out TurboTax forms—technically you typed the numbers, but the heavy lifting was pre-existing logic.
- Legal Definitions: There was a brief debate over whether the "from scratch" claim constituted "fraudulent misrepresentation" under UK law, though others argued that software engineering terms are too subjective for such a legal standard.
Metacommentary on Tech Journalism
Simon Willison (smnw), who interviewed the creators for a related piece, was active in the thread and faced criticism for "access journalism."
- The Critique: Users argued Willison should have pushed back harder against the CEO's "from scratch" hype during the interview, accusing him of enabling a marketing narrative rather than exposing the project's lack of rigor.
- The Defense: Willison defended his approach, stating his goal was to understand how the system was built rather than to grill the CEO on Twitter phrasing. While he conceded that "from scratch" was a misrepresentation due to the vendored dependencies, he argued the system did still perform complex tasks (writing 1M+ lines of code) that were worth investigating, even if the end result was flawed.
Cost vs. Output
Critiques also focused on the economics of the experiment. Users noted that spending vast amounts of resources (potentially $100k+ in compute/tokens) to generate a browser that "kind of works" is not impressive. They argued that the inability to compile or pass CI makes the project less of a demo of AI capability and more of a cautionary tale about the inefficiency of current agentic workflows.
Georgia leads push to ban datacenters used to power America's AI boom
Georgia lawmakers have introduced what could become the first statewide moratorium on building new datacenters, aiming to pause projects until March 2027 to set rules around facilities that guzzle energy and water. Similar statewide bills surfaced in Maryland and Oklahoma, while a wave of local moratoriums has already spread across Georgia and at least 14 other states. The push comes as Atlanta leads U.S. datacenter construction and regulators approved a massive, mostly fossil-fueled power expansion to meet tech demand.
Key points
- HB 1012 would halt new datacenters statewide to give state and local officials time to craft zoning and regulatory policies. A Republican co-sponsor says the pause is about planning, not opposing datacenters, which bring jobs and tax revenue.
- Georgia’s Public Service Commission just greenlit 10 GW of additional power—enough for ~8.3m homes—largely to serve datacenters, with most supply from fossil fuels.
- At least 10 Georgia municipalities (including Roswell) have enacted their own moratoriums; Atlanta led the nation in datacenter builds in 2024.
- Critics cite rising utility bills, water use, and tax breaks: Georgia Power profits from new capital projects, which advocates say drives rate hikes (up roughly a third in recent years) while dulling incentives to improve grid efficiency.
- Proposals in the legislature span ending datacenter tax breaks, protecting consumers from bill spikes, and requiring annual disclosure of energy and water use.
- National momentum: Bernie Sanders floated a federal moratorium; advocacy groups say communities want time to weigh harms and costs.
Politics to watch
- Bill sponsor Ruwa Romman, a Democrat running for governor, ties the pause to upcoming elections for Georgia’s utility regulator. Voters recently ended the PSC’s all-Republican control by electing two Democrats; another seat is up this year. Supporters hope a new majority will scrutinize utility requests tied to datacenter growth.
Source: The Guardian (Timothy Pratt)
Here is a summary of the discussion on Hacker News:
Regulation vs. Prohibition
Rather than a blanket moratorium, several commenters suggested Georgia should implement strict zoning and operational requirements. Proposals included mandating "zero net water" usage (forcing the use of recycled or "purple pipe" water), setting strict decibel limits at property boundaries to mitigate noise, and requiring facilities to secure their own renewable energy sources.
Grid Strain and Economic Risk
A significant portion of the debate focused on the mismatch between data center construction speeds and power plant construction timelines. Users highlighted the economic risk to local ratepayers: if utilities build expensive capacity for data centers that later scale back or move, residents could be left paying for the overbuilt infrastructure. Some noted that power funding models are shifting to make data centers liable for these costs, but skepticism remains about whether consumers are truly shielded from rate hikes.
Environmental and Local Impact
The "NIMBY" (Not In My Backyard) aspect was heavily discussed. While some users argued that data centers are clean compared to factories, others pointed out that on-site backup generators (gas/diesel turbines) do produce exhaust, and constant cooling noise is a nuisance. There is also frustration that these facilities consume massive resources (water/power) while providing very few local jobs compared to their footprint.
Georgia’s Energy Mix
Commenters debated whether Georgia’s political leaning would hinder renewable energy adoption. However, data was cited showing Georgia is actually a leader in solar capacity (ranked around 7th in the U.S.), suggesting that solar adoption in sunny states is driven more by economics than political rhetoric.
Clarifying the Scope
There was some confusion regarding federal preemption of "AI regulation." Other users clarified that this bill specifically targets physical land use, zoning, and utility consumption—areas traditionally under state and local control—rather than the regulation of AI software or algorithms.
AI Was Supposed to "Revolutionize" Work. In Many Offices, It's Creating Chaos
Alison Green’s latest “Direct Report” rounds up real workplace stories showing how generative AI is backfiring in mundane but costly ways—less “revolution,” more chaos.
Highlights:
- Invented wins: An employee’s AI-written LinkedIn post falsely claimed CDC collaborations and community health initiatives—spawning shares, confusion, and award nominations based on fiction.
- Hype that lies: An exec let AI “punch up” a morale email; it announced a coveted program that didn’t exist.
- Privacy faceplants: AI note-takers auto-emailed interview feedback to the entire company and to candidates; another recorded union grievance meetings and blasted recaps to all calendar invitees.
- Hollow comms: Students and job candidates lean on LLMs for networking and interviews, producing buzzwordy, substance-free answers that erode trust.
The throughline: People over-trust polished outputs, underestimate hallucinations, and don’t grasp default-sharing risks. The result is reputational damage, legal exposure, and worse teamwork.
Takeaways for teams:
- Default to human review; never let AI invent accomplishments or announce decisions.
- Lock down AI transcription/sharing settings; avoid them in sensitive meetings.
- Set clear policies on disclosure and acceptable use; train for verification and privacy.
- In hiring and outreach, authenticity beats LLM gloss—interviewers can tell.
The Discussion
In a concise thread, commenters drew parallels between these workplace AI failures and the privacy controversies surrounding Windows 11 (likely referencing the Recall feature’s data scraping). Conversation also touched on the disparity between the promised "revolution" and the actual user experience, with users briefly debating timelines—years versus quarters—for the technology to mature or for the hype to settle.