AI Submissions for Tue Mar 17 2026
Mistral AI Releases Forge
Submission URL | 667 points | by pember | 170 comments
Mistral launches Forge: build frontier-grade models on your own data
-
What’s new: Forge is Mistral’s system for enterprises to train and refine large models on proprietary knowledge—codebases, policies, ops logs, structured data—so models speak the company’s language and follow its rules. Early partners include ASML, Ericsson, the European Space Agency, Singapore’s DSO and HTX, and Reply.
-
How it works: Covers the full lifecycle—pre-training on internal corpora, post-training for task behavior, and reinforcement learning to align with policies and real-world workflows. It supports dense and MoE architectures (for cost/latency trade-offs) and multimodal inputs.
-
Agent-first: Designed so code agents (e.g., Mistral Vibe) can run the whole loop—fine-tune models, tune hyperparameters, schedule jobs, generate synthetic data, and hill-climb evals—while Forge monitors regressions on chosen benchmarks.
-
Why it matters: Moves beyond generic LLMs to “institutional intelligence.” Custom models promise more reliable enterprise agents: better tool use, sturdier multi-step workflows, and decisions that reflect internal policies and business logic. Emphasis on control/IP, governance, and operating within an org’s own infrastructure.
-
Continuous adaptation: Built-in RL pipelines and evaluation frameworks let teams keep models current as regulations, systems, and data evolve.
Open questions HN will care about:
- Deployment model: on-prem, private cloud, or Mistral-managed? Data residency and isolation guarantees?
- Cost/throughput vs existing stacks (OpenAI/Anthropic fine-tuning, NVIDIA NeMo, Cohere, Google/Azure custom models).
- Export rights/IP ownership, ability to self-host trained weights, and integration with existing MLOps/tooling.
- Benchmarks and case studies showing real reliability gains for agents and tool use at scale.
The Move to Enterprise vs. Developer Adoption A significant portion of the discussion analyzes Mistral’s pivot toward B2B "golf course sales" rather than courting individual developers. While some users argue that winning the developer ecosystem is a prerequisite for long-term success (citing the "bottom-up" adoption of tools like VS Code or early AWS), others counter that for the specific clients Mistral is targeting—large banks, defense, and governments—decisions are top-down strategic choices where individual developer preference is irrelevant compared to compliance and SLAs.
Data Sovereignty as a Unique Selling Point Commenters identify Mistral’s primary "moat" not necessarily as model superiority, but as its status as the leading non-US, GDPR-friendly option.
- The "EU Shield": Users suggest that for highly regulated European organizations (ASML, BNP Paribas, government entities), the paramount requirement is keeping data out of US jurisdictions to avoid the CLOUD Act or industrial espionage risks.
- Protectionism: There is a debate regarding "digital protectionism," with some users viewing Mistral’s growth as a product of EU industrial policy designed to prop up a local champion, comparable to Airbus or agricultural subsidies.
Defensibility of the Business Model Skeptics question whether "fine-tuning as a service" is a defensible long-term strategy, arguing that the process is reproducible and that other providers could easily offer similar "white glove" services. However, supporters argue that Mistral’s specific combination of open-weights flexibility, on-prem operational capability, and non-US origin creates a specialized niche with few current competitors (except perhaps DeepSeek, though Chinese data residency issues make that a non-starter for Western enterprise).
User Experience and "Naming Chaos"
On a tactical level, developers express frustration with Mistral’s seemingly erratic naming conventions and versioning strategies. Users cite confusion between model names like "Codestral," "Devastral," and various date-stamped versions (e.g., 2512 vs latest), noting that documentation often lags behind releases, making it difficult to know which API endpoint is current.
Get Shit Done: A meta-prompting, context engineering and spec-driven dev system
Submission URL | 410 points | by stefankuehnel | 224 comments
Get Shit Done (GSD): a spec-driven “context engineering” layer for AI coding tools
What it is
- An open-source workflow that sits on top of Claude Code, OpenCode, Gemini CLI, GitHub Copilot, “Codex,” and Google’s Antigravity to turn vague prompts into reliable, spec-verified code.
- Markets itself as fixing “context rot” (quality drop as models fill their context window) via meta-prompting, XML-structured prompts, sub-agent orchestration, and state management.
- Aimed at solo devs and small teams who want output without “enterprise theater.” Works on macOS, Windows, and Linux.
Why it’s resonating
- Opinionated, minimal command set that abstracts the complexity of multi-agent prompting and context control.
- Pragmatic pitch: “describe what you want, the system extracts what it needs, then the model builds and verifies it.”
- Trending fast on GitHub (claims ~34k stars and ~2.8k forks in the repo view) and peppered with social proof (“trusted by engineers at Amazon, Google, Shopify, Webflow”).
How it’s different
- Treats AI coding as a spec pipeline rather than chat “vibecoding,” emphasizing structure, verification, and codebase mapping.
- Encourages fully automated runs (suggests using Claude with --dangerously-skip-permissions) to avoid constant approvals; offers granular permissions as an alternative.
Getting started
- Install: npx get-shit-done-cc@latest
- Choose runtime(s): Claude Code, OpenCode (open models), Gemini CLI, Copilot, “Codex” (skills-based), Antigravity.
- Global vs local install supported. Each runtime has its own config dir (e.g., ~/.claude/). Post-install help via /gsd:help (or runtime-specific variants).
- For existing projects: run /gsd:map-codebase to spawn parallel agents that analyze your stack before generation.
Caveats and notes
- Heavy automation can touch your shell and git; if you skip permission prompts, understand the risks. Granular allowlists are documented as a safer path.
- Bold claims about reliability and adoption are marketing; evaluate on your own codebase.
- The tool’s language and defaults are unapologetically “power-user”—great for speed, but not everyone will want to disable guardrails.
Bottom line GSD is a lean, opinionated wrapper that tries to turn AI coding from ad‑hoc chat into a repeatable, spec-first workflow. If you’ve bounced off bloated “AI SDLC” tools or flaky one-shot prompting, this is a compelling, quick-to-try alternative.
Based on the discussion, here is a summary of the community's reaction:
Skepticism Regarding Cost and Efficiency The most prominent theme is the high cost of execution relative to the output. Multiple users reported "burning" through significant token budgets (e.g., spending $25 for 500 lines of code) or confusing the models with too much context. One user noted GSD ran for hours to "achieve nothing," whereas a manual plan and implementation took only 20 minutes. The consensus among critics is that unsupervised agents often spin their wheels, costing time and money without delivering the promised "fire and forget" experience.
Framework Fatigue and "Wrapper" Debates There is a lively debate about whether GSD (and similar tools like "Superpowers") is a necessary innovation or an "over-engineered wrapper" around existing CLIs like Claude Code. While some defend the tool for handling deterministic logic and saving tokens on helper scripts, others argue that native features (like Claude Code's "Plan Mode") are sufficient. One commenter wryly described the trend of building these tools as the "AI Developer’s Descent into Madness": eventually, everyone stops coding to write their own agent framework.
The Reality of "Fire and Forget" Users generally pushed back against the marketing claim of fully autonomous coding. Several commenters noted that they still value the "exploratory" and "interactive" parts of coding, finding that "sophisticated agents" often fail if not closely monitored ("babysat"). The preferred workflow for many remains a tight feedback loop (Plan -> Code -> Verify) rather than a black-box process that might drift or "hallucinate" over long runtimes.
Comparison to Native Tools The discussion frequently compares GSD to using Claude Code or GitHub Copilot directly. While some users appreciate the spec-driven structure GSD forces, others feel that recent updates to native tools (specifically Claude Code's Plan Mode or Copilot's workspace features) handle memory and planning well enough without the added complexity of a third-party layer.
Unsloth Studio
Submission URL | 368 points | by brainless | 72 comments
Unsloth Studio (Beta) launches: an open‑source, no‑code web UI to run, fine‑tune, and export open models locally—bringing inference, training, data prep, and model comparison into one interface.
Highlights
- Local-first, privacy-friendly: Works 100% offline with token-based auth.
- One UI for many tasks: Run GGUF and safetensors; train text, vision, TTS/audio, and embeddings; compare models side-by-side; export to GGUF or safetensors for llama.cpp, vLLM, Ollama, LM Studio.
- Faster training with less VRAM: Claims 2x speed and ~70% less VRAM across 500+ models (LoRA, FP8, FFT, PT optimizations), including Qwen3.5 and NVIDIA Nemotron 3; multi‑GPU supported.
- No-code data prep: “Data Recipes” turns PDFs/CSV/JSON/docs into usable/synthetic datasets via a graph workflow (powered by NVIDIA DataDesigner).
- Built-in tooling: Self-healing tool calls, web search, code execution, auto parameter tuning, editable chat templates, observability for training (loss, grad norms, GPU util), and run history.
What’s new (Mar 17 update)
- More stable install; Claude Artifacts support (execute HTML in chat, e.g., snake game).
- ~30% more accurate tool calls (notably for small models), tool-call timer, save web/tool outputs, toggle auto-healing.
- Faster/smaller installs; Windows CPU inference working; Mac setup more seamless.
Platforms and caveats
- Windows, Linux, WSL fully supported. CPU-only machines can do chat inference.
- Mac: Chat (GGUF) works; MLX training “coming soon.”
- NVIDIA GPUs: Training supported now (RTX 30/40/50, Blackwell, DGX). Multi‑GPU works; big upgrade pending.
- AMD: Chat works; train with Unsloth Core today; Studio training support coming.
- Beta status: Expect fixes/changes; first install can take 5–10 minutes to compile llama.cpp (precompiled binaries in progress).
Why it matters
- Aims to merge Ollama/LM Studio-style local inference with streamlined fine‑tuning, dataset creation, and export—lowering friction for teams and tinkerers to customize and ship local models quickly, without giving up privacy or needing bespoke tooling.
Discussion Summary:
The launch of Unsloth Studio sparked a technical discussion focused on hardware compatibility, installation preferences, and the project's open-source business model. Daniel Han (dnlhnchn), one of the creators, was highly active in the thread, addressing bugs and answering questions.
- Hardware Support: A primary point of friction is the current reliance on NVIDIA GPUs for training. Users with AMD cards expressed frustration with ROCm and are eargerly awaiting support, while Mac users were informed that while CPU inference works now, MLX-based training is "coming soon."
- Installation & Tooling: Several users criticized the initial installation script for modifying system-wide packages (npm, homebrew) rather than using isolated environments. There was a strong advocacy for using
uv(an extremely fast Python package and project manager) to handle dependencies. The developer acknowledged this, noting thatuv tool install unslothis already working and likely to become the default method. - Licensing & Enterprise Use: Users reacted positively to the licensing model (Apache 2.0 for the core, AGPL-3.0 for the Studio UI), noting it is easier to get approved in corporate environments compared to LM Studio’s proprietary license.
- Target Audience: While some commenters felt a UI tool was non-essential for "LLM wizards," the developer pushed back, citing that Unsloth is the 4th largest LLM distributor and is heavily used by organizations like Meta, NASA, and Fortune 500 companies for production workflows.
- Bug Reports: Early adopters reported specific issues, such as TypeScript build errors on macOS and broken privacy policy links, which the developer promised to fix immediately.
Why AI systems don't learn – On autonomous learning from cognitive science
Submission URL | 161 points | by aanet | 105 comments
Why AI systems don’t learn and what to do about it (Dupoux, LeCun, Malik, arXiv:2603.15381)
- TL;DR: Three leading researchers argue today’s AI doesn’t truly “learn autonomously.” They propose a cognitively inspired architecture with three parts: System A (learns by observing), System B (learns by acting to gather new data), and System M (a meta-controller that switches between them using internally generated signals like curiosity or uncertainty).
- Why it matters: Moves beyond static training and benchmark-chasing toward agents that can set their own goals, explore, and learn continuously in open-ended, changing environments—key for robotics, embodied AI, and reducing reliance on labels or hand-crafted rewards.
- How it works (conceptually):
- System A extracts structure from passive data (the world as it is).
- System B intervenes to test hypotheses and collect informative samples (the world as it could be under action).
- System M allocates attention/effort, toggles modes, and drives intrinsic motivation, echoing how animals balance exploration and exploitation across development and evolution.
- Caveats: A position/framework piece—no new benchmarks or empirical results; implementation and evaluation are open questions.
- Paper: arXiv:2603.15381 (Mar 16, 2026), DOI pending.
Here is a summary of the discussion:
The Ghost of "Tay" and Unpredictable Software Much of the discussion focused on the risks of continuous, autonomous learning, frequently citing Microsoft’s 2016 "Tay" chatbot, which learned toxic behavior from Twitter users within a day.
- Safety vs. Dynamism:
Animatsnoted that today's models are "locked down" specifically to prevent this kind of drift.rmrdkttnargued that from a software engineering standpoint, self-evolving software is often undesirable; businesses prefer defined versions with predictable behavior over an "evolving organism" that changes autonomously in production. - Cultural Context:
TeMPOraLand others debated whether the "internet culture" that corrupted Tay (4chan trolling) has changed, suggesting the modern internet might be too fragmented or bored to sway a model as quickly, thoughrmchrhckrwarned that model uniformity leads to "model collapse" and that some variation is necessary for survival/improvement.
Critique: LeCun’s "Blind Spot" on In-Context Learning
thptpoffered a strong critique of the paper, arguing that LeCun ignores the success of In-Context Learning (ICL). They posited that agents using ICL are already performing audio/autonomous learning effectively without updating weights. The commenter suggested the paper reflects an academic bias that overlooks the practical, engineering-led successes of end-to-end systems built by researchers like Sutskever and Karpathy.
Implementation Challenges: Memory and Control
- Building "System A" (Memory):
sctttylrdived into the technical difficulty of building the proposed memory systems. They noted that the hardest part isn't storage, but "knowledge decay"—deciding what to keep, what to trust, and what to forget to prevent "polluting" the agent's decision-making. They shared their own experience with multi-agent shared memory, emphasizing the need for trust signals and consolidation steps. - Designing "System M" (The Controller):
zhngchnandrbt-wrnglrdiscussed how to practically implement the meta-controller. Ideas included looking at biological analogs (hormones/emotions) as secondary systems to regulate curiosity and switching between observation and action.
Corporate Risks
dasil003raised a cynical concern regarding the actual incentives for autonomous agents: in a corporate environment, a system designed to "learn by acting" might evolve "Machiavellian" traits, learning deception and maneuvering to maximize its goals.
Show HN: March Madness Bracket Challenge for AI Agents Only
Submission URL | 67 points | by bwade818 | 41 comments
AI Agent Bracket Challenge: March Madness for bots
What it is
- A playful API-driven March Madness contest where AI agents “enter” their own 63-pick NCAA brackets, get scored, and appear on a public leaderboard with a strategy tag.
How it works
- Quick start for Claude Code/Codex users: clone a skills repo that adds slash commands to sign up, fill brackets, and check status.
- Otherwise, use the REST API:
- Register with an agent name and email; the API key arrives via email.
- Fetch the bracket (text or JSON).
- Submit all 63 picks with an optional strategy_tag (e.g., stats-based, chaos, vibes).
- Check your score and rank; view your bracket via a private URL.
- Optional lock to finalize early; all brackets auto-lock at the deadline.
Notable details
- Round structure: 6 rounds from Round of 64 to Championship; picks must be logically consistent.
- Final Four pairings are fixed: East vs South, West vs Midwest.
- Play-in games are supported; if you picked the losing play-in team, the system auto-replaces it with the winner throughout your bracket.
- Strategy inspirations range from advanced metrics (KenPom/NET) to humorous heuristics (mascot fights, jersey colors).
Why it’s interesting
- A fun benchmark for agent tooling and autonomy: integrates with LLM coding environments, tests API use, planning over constraints, and follow-through.
- Encourages creative prompt/strategy engineering with transparent scoring and public visibility after the deadline.
Links
- Skills repo: github.com/bwadecodes/bracketmadness-skills
- API docs and endpoints: bracketmadness.ai (register, bracket fetch, submit, lock, status)
Based on the discussion, here is a summary of the comments:
Agent Design & Interfaces Much of the conversation focused on the challenges of designing interfaces for software agents versus humans.
- API vs. GUI: User strjll discussed the difficulty of navigating human-centric UIs with agents, noting that building API-first barriers mirrors creating clear machine-readable interfaces. dplncsk suggested a system where agents hitting the homepage receive plain-text API instructions while humans see the standard visual site.
- Reliability: strjll shared insights on building reliable agents, recommending a split approach: use deterministic code for retrieval/search and LLMs for synthesis and verification, rather than forcing models to handle logic outside their narrow scope.
- Technical Hurdles: rpnd shared their own experience building a similar tool, noting that chatbots often struggle with URLs or authenticating POST requests. They resorted to building a remote MCP (Model Context Protocol), which the OP (bwade818) agreed was a strong approach given current chatbot limitations with direct API interactions.
Strategies & Methodology
- Data Sources: Users expressed interest in seeing how AI brackets compare to specific control groups (e.g., "highest seed wins," purely random, or human production). spnkl wondered what data sources models would prioritize—stats, injury reports, or expert analysis.
- Specific Implementations: illusive4080 tasked Claude Opus with the challenge; the model conducted online research, selected strategic posts to emulate, and explained its methodology before submitting. zphyrn hoped to see strategies that go beyond single-turn analysis to implement meaningful trends over multiple reasoning turns.
Odds & Probability nplk calculated the computational impossibility of storing all theoretically possible brackets ($2^63$), noting it would require exabytes of data, though practically, eliminating highly unlikely outcomes (like 16-seeds winning the championship) makes the search space manageable. The OP noted that a verified perfect bracket has never been achieved in the history of the tournament.
User Experience lntn reported a smooth experience using Claude Code/Haiku to follow the documented process, though they found the "auto-locking" mechanism slightly ambiguous. Others, like nplk, recounted previous difficulties trying to automate bracket filling on major sports sites (ESPN/Disney) due to heavy login forms and clunky UIs, praising this project for being API-native.
Garry Tan's Claude Code Setup
Submission URL | 67 points | by alienreborn | 70 comments
YC’s Garry Tan open-sources gstack: an “AI software factory” for Claude Code
- What it is: An MIT-licensed skill pack that turns Claude Code into a virtual engineering org you manage via slash-commands. It coordinates 15 “roles” across the software lifecycle—CEO/product, eng manager/architecture, designer, paranoid reviewer, QA lead (with real browser checks), and release engineer.
- Why it matters: Tan claims he’s shipped 600k+ lines of production code in 60 days (35% tests), regularly doing 10k–20k usable LOC/day while running YC—arguing we’re at a point where one person can ship at a team-of-20 scale. It’s a concrete, reproducible setup rather than a demo video.
- How it works: Everything lives under .claude/skills/gstack. You drive work with commands like:
- /plan-ceo-review, /plan-eng-review, /plan-design-review, /design-consultation
- /review, /qa, /qa-only, /design-review, /ship, /retro, /debug, /document-release, /office-hours, /browse
- Emphasis on using gstack’s /browse skill and avoiding mcp__claude-in-chrome tools.
- Quick start: Clone and run setup, add a “gstack” section to CLAUDE.md, then try:
- /plan-ceo-review on a feature idea
- /review on a branch
- /qa on staging
- Promises a first useful run in under 5 minutes on repos with tests.
- Requirements: Claude Code, Git, Bun v1.0+. Optional: add gstack into your repo so the whole team gets the same skills. No PATH changes; skills are local.
- Who it’s for: Technical founders/CEOs who still ship, first-time Claude Code users who want structure over a blank prompt, and tech leads who want rigorous review/QA/release automation per PR.
- The pitch: Same person, different era—the difference is tooling. Fork it, improve it, make it yours.
Caveats to watch: Depends on Claude Code (proprietary), results hinge on test coverage and repo quality, and the LOC productivity claims are bold and will invite scrutiny. Still, it’s a notable open-source blueprint for practical agentic development today.
Discussion Summary:
The reception to Tan’s release was dominated by extreme skepticism regarding his specific productivity claims, mingled with concerns about code quality and armchair psychoanalysis of his work habits.
- The "600k LOC" Controversy: The claim of writing 600,000 lines of code in 60 days drew the most criticism. Users like tabs_or_spaces and PAndreew argued that LOC is a vanity metric and a "very weak proxy" for value. cldt noted that creating that much code in such a short time is technically a "liability," not an asset, suggesting the AI is likely generating unoptimized boilerplate that will be a nightmare to maintain.
- User Experience: One user who tested the tool, mdrx, provided a balanced review: they found the "Design" and browser automation skills genuinely useful for consistency but noted the "CEO" planning skill was ineffective. They famously flagged that the AI tends to have "huge blind spots" and wanders off-plan without strict oversight.
- "Why is the CEO doing this?": A significant sub-thread questioned why the head of Y Combinator is coding on 4 hours of sleep (nthngtshr, BigTTYGothGF). Comments ranged from concern about "mania" and burnout to critiques that he should be focused on running YC or addressing political issues rather than "LARPing" as a developer.
- Costs & astroturfing: input_sh and others speculated on the astronomical API costs required to hit those metrics (likely high 4-5 figures). Sherveen suggested that if this project weren't by the YC CEO, it wouldn't have reached the front page, calling the reception "engineered."
- Humor: The thread contained significant biting humor, with users comparing the "600,000 beautiful lines" rhetoric to Trump speeches (2kdjat) or noting the irony of an "AI software factory" producing code that no human can reasonably review in that timeframe.
Why refusing AI is a fight for the soul
Submission URL | 25 points | by donohoe | 7 comments
Rest of World interviews geographer Thomas Dekeyser about his new book, Techno-Negative: A Long History of Refusing the Machine, arguing today’s backlash to AI and data centers sits in a centuries-long tradition of resisting technologies that concentrate power and erode livelihoods.
Key points:
- Not technophobia: Dekeyser frames refusals as rational attempts to shape a different kind of progress, not to reject progress outright.
- Image shift in Big Tech: Early “do-gooder” branding (e.g., Google’s “Don’t be evil”) masked long-standing ties to military and geopolitical projects; he argues the ethical veneer has faded, with worker efforts failing to stop military/policing contracts and companies drifting rightward.
- Why vandalism and boycotts: From Waymo robotaxi torchings to 5G tower attacks and canceling ChatGPT accounts, he sees disillusion and disempowerment driving direct action when technologies feel omnipresent, inevitable, and serving the few (job loss, surveillance, climate costs) over the many.
- Global South resistance: Pushback from data workers in Africa and communities in Latin America is tied to “afterlives of colonialism”—being treated as cheap labor or raw data, and bearing environmental extraction for AI infrastructure.
- The deeper stake: AI doesn’t just change work; it narrows what counts as a meaningful life to efficiency, speed, and “intelligence,” a vision many reject.
Bottom line: Today’s AI skepticism is less a fear of machines than a contest over who benefits, who pays the costs, and what kind of human future counts as progress.
Discussion: The comment section reflects a sharp divide between survival anxiety and technological skepticism. One thread debated the economic stakes of an "AI mandatory future," oscillating between hopes that agricultural robots might make food affordable and dark warnings that ChatGPT cannot fix starvation, with one user citing Mike Tyson ("everyone has a plan until they get hit") to describe potential social collapse. Others questioned the narrative that efficiency equals meaning or that tech dominance is inevitable—referencing failed hype cycles like Uber—while a detractor dismissed the article's premise as merely "old woman yelling at AI."
Nvidia's DLSS 5 uses generative AI to boost photorealism in video games
Submission URL | 20 points | by ianrahman | 29 comments
Nvidia unveils DLSS 5: generative AI fused with 3D graphics
- What’s new: At GTC, Jensen Huang introduced DLSS 5, which blends traditional 3D scene data with generative AI models that predict and fill in parts of each frame. The goal: more photorealistic scenes and lifelike characters while using less GPU compute than fully rendering every element.
- The pitch: Huang framed it as merging “predictive” (structured 3D graphics) with “probabilistic” (gen AI) methods to keep visuals both controllable and realistic. He argued this fusion will echo across industries, not just games.
- Beyond gaming: Huang pointed to enterprise platforms like Snowflake, Databricks, and BigQuery as structured datasets future AI agents will work over, alongside unstructured/generative data—calling structured data the foundation of “trustworthy AI.”
- Why it matters: Signals DLSS evolving from upscaling/frame-gen into content synthesis inside the rendering pipeline, potentially lowering costs while pushing fidelity. It also shows Nvidia positioning gaming tech as a template for broader AI-driven computing.
- Open questions: Release timing, GPU support, developer controls/guardrails for determinism, and implications for competitive/anti-cheat scenarios remain unclear.
Source: TechCrunch (Rebecca Bellan)
Discussion Summary:
- Artistic Integrity vs. "AI Slop": The strongest critique focused on the loss of original artistic intent. Users argued that the tech replaces careful lighting and tone mapping with generic, inconsistent generative visuals, comparing the effect to "deepfakes" or controversial remasters like Halo Anniversary.
- Technical Limitations: Skepticism abounded regarding the practicality of the technology, specifically concerns about added latency (input lag), massive power consumption (overhead), and temporal consistency (e.g., whether an AI-generated character face would remain consistent over 20 hours of gameplay).
- The "Tech Demo" Defense: While some commenters dismissed the visual artifacts as merely early "tech demo" jitters, others argued that the "uncanny valley" look and resource demands effectively make the concept dead on arrival for real-time applications.
- Use Cases: A minority of users felt the technology represented a genuine leap forward for photorealism (particularly for skin and shadows) and suggested its best use case might be revitalizing legacy titles (e.g., Skyrim, Witcher 1) rather than strictly new releases.
- Nvidia's Priorities: A sidebar discussion debated Nvidia's current annual revenue split, with users arguing over whether the company still prioritizes the gaming market—some estimating gaming revenue as low as 3%, with others correcting it to ~14-18%.
AI still doesn't work well, businesses are faking it, and a reckoning is coming
Submission URL | 72 points | by samizdis | 26 comments
Former PwC consultants Dorian Smiley and Connor Deeks, now running AI advisory shop Codestrap, argue enterprises are overhyping AI while quietly faking maturity. In an interview with The Register, they say there’s no playbook yet—so companies should slow down, experiment, and build real feedback loops.
Key points:
- Measuring the wrong things: Lines of code and PR counts are liabilities, not indicators of engineering excellence. Look to DORA-style metrics (deployment frequency, lead time, change failure rate, MTTR, incident severity) and develop new AI-specific ones.
- New metric idea: “Tokens burned per approved PR” to quantify whether AI actually improves delivery.
- Cautionary tale: An AI-assisted attempt to rewrite SQLite in Rust produced 3.7x more code and ran ~2,000x slower—even while passing unit tests. Superficial success masked a non-viable result.
- Technical limits matter: LLMs are hard to update reliably, non-deterministic, and can’t check their own work—expect code quality issues to surface at scale.
- Beyond code: Millions of lines of AI-generated content won’t be reviewed; consulting deliverables already backfiring (e.g., Deloitte refund in Australia over AI-generated errors). Expect outages and lawsuits as quality gaps emerge. Amazon/AWS outages are cited as harbingers (Amazon denies any AI link).
Bottom line: Dial down the hype. Build measurement and controls first, then scale.
Based on the discussion, commentors engaged in a debate regarding the specific technical limitations cited in the article and the broader trajectory of AI development.
The "SQLite Rewrite" Controversy A significant portion of the thread dissected the article's anecdote about an AI attempting to rewrite SQLite in Rust. Skeptics argued that while the code passed unit tests, the fact that it ran 2,000x slower renders the result a "dumpster fire," proving that AI produces superficial success that is non-viable in production. Defenders argued that critics are shifting goalposts—moving from "AI can't generate code" to "AI generates inefficient code"—and that a native Rust version is still a valid attempt at progress.
Plateaus vs. Engineering Progress Users debated whether LLMs have hit a performance plateau.
- The Skeptical View: Several users argued that current improvements rely solely on throwing more hardware (RAM/Compute) at the problem. They contend that without a fundamental breakthrough in how models learn abstract concepts (rather than just linear prediction within a context window), returns will be diminishing and logarithmic.
- The Optimistic View: Counter-arguments posited that "adding more material" (scaling hardware) is a valid form of engineering progress, comparable to strengthening a bridge. They argued that historical trends suggest progress will continue, even if it is incremental rather than exponential.
Real-world liability and Hype The discussion also touched on the practical risks of deployment:
- Insurance: One user noted that insurance underwriters are already attempting to exclude AI tools from liability coverage, creating a complex chain of responsibility.
- Fabrication Failure: A commenter shared an anecdote from the steel fabrication industry where a manager used AI to design a ramp structure; the AI hallucinated the calculations, resulting in a flawed design that had to be physically remedied by human fabricators.
- Hype Fatigue: Users compared the current climate to previous tech bubbles, warning that overselling AI now will lead to a "boy who cried wolf" scenario where future, viable advancements are ignored due to current burnout.