AI Submissions for Thu Jul 02 2026
The short leash AI coding method for beating Fable
Submission URL | 175 points | by Riseed | 219 comments
-
What it is: A human-in-the-loop workflow for using AI coding agents to build high-quality, security-critical software. It emphasizes tight control via permissioned diffs and frequent human intervention, aiming to outperform “frontier” models’ default output in quality and direction.
-
Problems with current approaches:
- “Vibe”/orchestrated multi-agent systems drift, produce inefficiency/ugliness, and erode the developer’s understanding of the codebase.
- Even strong models (e.g., Fable 5) often generate working but subpar code, especially in niche domains with sparse training data.
- Removing humans from the loop leads to late discovery of misdirection and higher rework.
-
How it works (Short Leash method):
- Plan first: research, define a stepwise task breakdown (e.g., via a tasks skill), track progress.
- Never use “YOLO”/skip-permissions modes; the agent must present a diff before changes.
- The developer stays present, inspects every proposed diff, and denies permission whenever direction or quality drifts.
- Use the permission diffs to both maintain situational awareness of the codebase and constrain the agent.
- Intervene frequently; do not let the agent run unattended.
- Commit at the end of each subtask to prevent regressions or accidental deletions (observed in practice with models like Opus).
- Conclude with a review phase.
-
How to do AI reviews:
- Every PR gets both AI and human review; treat the AI as a fast linter that catches common issues, while humans handle higher-level design and direction.
- Provide the AI sufficient context: the issue, PR description, codebase, and diff.
- Use the best available models for review.
- Include an “AI Disclosure” in the PR description listing exact models used during development. This informs maintainers, invites suggestions for stronger models, and signals transparency.
- The PR author must self-review line-by-line as if reviewing someone else’s code, then explicitly approve before requesting maintainer review.
-
Why it matters:
- Maintains codebase comprehension and control while leveraging AI for speed.
- Reduces off-rails changes, enforces incremental quality, and yields more reliable outcomes than hands-off or mass-orchestration setups.
-
Caveats:
- Intended for professional developers who can out-reason the model in their domain.
- Requires discipline and time-on-task; it is not a “set and forget” automation approach.
- Claims are based on the author’s practice, including custom review tools and a maintained fork of an agent (Crush).
-
Skepticism of the "short leash" method: Several commenters argued that heavy hand-holding is a crutch for insufficient initial prompting. They suggested that micromanaging advanced models is inefficient, advocating instead for using them as sounding boards to refine system designs through iterative, high-level discussions.
-
Debate over AI reasoning and "nuanced discussions": A major point of contention was whether users can genuinely discuss code decisions with an LLM. Skeptics pointed out that when asked why a change was made, models often hallucinate plausible-sounding, post-hoc justifications rather than admitting ignorance. Defenders clarified that productive discussion involves co-designing and asking for critiques, rather than interrogating the model's past actions.
-
Comparisons to human cognition: The models' tendency to fabricate explanations sparked a philosophical tangent comparing LLMs to human psychology. Users debated whether human consciousness operates similarly by constructing acceptable post-hoc justifications for decisions made unconsciously.
-
Access to chain-of-thought: The thread touched on the technical mechanics of AI reasoning, specifically
<thinking>blocks. While analyzing an LLM's trace could theoretically help developers understand its decisions, commenters noted that frontier labs often hide these traces, and the output is often an unstructured blob optimized by reinforcement learning rather than a reliable logical map. -
Conflicting capability assessments: Anecdotal experiences with current models varied wildly. While one commenter claimed models like Fable perform better than staff engineers, others argued that without exhaustive context, the models regularly write duplicative code and fail to utilize existing codebase abstractions.
The takeaway: While the original submission advocates for tight control to rein in AI drift, the discussion highlights a clear divide between developers who treat models as unreliable junior coders requiring strict micromanagement, and those who view them as capable architecture partners that thrive on high-level iteration.
Claude-real-video - any LLM can watch a video
Submission URL | 151 points | by cortexosmain | 52 comments
-
What it is: A local pipeline that turns any online or local video into a compact, LLM-friendly bundle: key frames (JPGs), a transcript (text), and a MANIFEST.txt. No cloud upload; run entirely on your machine.
-
What’s different: Instead of fixed-interval frame grabs (e.g., 1 fps), it detects scene changes and enforces a minimum sampling density. It also deduplicates visually similar frames with a sliding window, so repeated shots aren’t re-sent after cutaways. Result: fewer, more meaningful frames and cheaper context.
-
How it works
- Fetch: Uses yt-dlp for URLs (supports cookies) or copies a local file.
- Extract: One ffmpeg pass selects frames at scene changes plus a floor of at least one frame every --fps-floor seconds.
- Dedup: Pixel-difference on downscaled RGB (not perceptual hash), comparing against the last --dedup-window kept frames to avoid A-B-A repeats.
- Text: Prefers existing subtitles (.srt/.vtt or embedded). Falls back to Whisper transcription only if no subtitles are present.
- Audio (optional): --keep-audio saves the full original soundtrack (audio.m4a) for models that can listen.
- Manifest: Writes MANIFEST.txt to guide the LLM across frames, transcript, and (optionally) audio.
-
Output: crv-out/frames/*.jpg, crv-out/transcript.txt, crv-out/MANIFEST.txt; optional audio.m4a. With --report, also keeps dropped frames and a report.html visualizing keep/drop decisions and diff percentages.
-
Usage
- crv "https://www.instagram.com/reel/XXXX/"
- crv lecture.mp4 -o out --lang en
- crv clip.mp4 --no-transcribe
- crv "https://..." --cookies cookies.txt (for login-gated sources)
- Python API: from claude_real_video import process; process("https://youtu.be/...", "out", lang="en")
-
Key options (defaults)
- --scene 0.30: Scene-change sensitivity (lower = more frames)
- --fps-floor 1.0: At least one frame every N seconds
- --max-frames 150: Hard cap on total frames
- --dedup-threshold 8: Percent of pixels that must change to count as new (higher = fewer frames)
- --dedup-window 4: Compare against last N kept frames (1 = consecutive-only)
- --lang auto: Whisper language (en, zh, auto, …)
- --report off, --no-transcribe off, --keep-audio off
-
Requirements
- Python 3.10+
- ffmpeg/ffprobe on PATH (macOS: brew install ffmpeg; Linux: apt/distro pkg; Windows: winget/choco or manual)
- For transcription: whisper CLI (openai-whisper), which also uses ffmpeg
- Works on macOS, Windows, Linux
-
Why it matters: Compared with naive fixed-interval frame sampling, this approach better captures fast cuts, collapses static slides, and reduces redundancy, producing a smaller, more informative context for LLMs.
-
Caveats
- The default --max-frames 150 may truncate very long or highly dynamic videos; tune options as needed.
- Quality of fallback transcription depends on Whisper and audio quality; supplied subtitles are preferred.
- Cookie-based fetching is supported but requires a Netscape-format cookie file.
-
Gemini as the preferred alternative: A major portion of the discussion centered on Google's Gemini, which many commenters argued is fundamentally better suited for this task. Users highlighted that Gemini natively processes video files, analyzes more than just transcripts, and is highly token-efficient (costing roughly $0.24 per hour of video with Flash Lite).
-
Disputing model limitations: Several users pushed back on the creator's premise that Claude and ChatGPT cannot internally process video files. Multiple commenters shared anecdotes of successfully uploading videos to these platforms or using agent orchestrators (like Claude Code) to get accurate frame-by-frame analyses without third-party preprocessing.
-
The limits of frame-based analysis: Commenters noted that extracting keyframes inherently strips out true motion and object permanence. Practical experiments shared in the thread showed models struggling to infer animations, scene liveliness, or specific sprite placements from contact sheets unless accompanied by plain-text descriptions.
-
Project naming: Multiple commenters suggested removing "Claude" from the tool's name. They argued that a generic name (like
llm-real-video) would better reflect the project's broad utility as a preprocessor for any vision-capable LLM. -
Privacy caveat: A few users clarified the "stays on your machine" pitch. While the extraction pipeline is local, passing the resulting frames to Anthropic's API ultimately means the data leaves the user's machine.
-
Traditional CV vs. LLMs: A user's idea to use the tool for reading battery voltage meters sparked a brief debate. Critics called it an over-engineered abandonment of basic problem-solving, suggesting that deterministic computer vision libraries remain far more efficient for reading gauges than pointing massive GPU stacks at the problem.
The takeaway: While the community praised the tool as a clever, model-agnostic way to optimize token usage and deduplicate slides, many felt the core problem is already being solved natively—and more affordably—by multimodal models like Gemini.
Spain Orders Blacklist of Palantir from Public and Private Companies
Submission URL | 702 points | by mgh2 | 283 comments
-
What happened: Spain instructed state-controlled entities to blacklist Palantir and halt future contracting, citing concerns about potential misuse of classified information and risks to national sovereignty.
-
Who’s affected: Companies overseen by SEPI, including Telefónica, Indra, and Navantia. The move has disrupted procurement, including a near-finalized Navantia project, and a planned Guardia Civil collaboration reportedly vetoed by Interior Minister Fernando Grande-Marlaska.
-
Scope and limits: The restrictions cover public and private state-controlled firms, but Palantir still holds a €16.5 million Ministry of Defense contract (signed 2023) with CIFAS that expires in November. Military leadership has urged Defense Minister Margarita Robles to renew it; a decision from Moncloa is pending.
-
European context: The action aligns with broader European pushback. France announced on June 10 it would cease working with Palantir, and German cyberdefense bodies and intelligence services are favoring European alternatives such as the French competitor ChaosVision.
-
Geopolitical angle: The blacklist coincides with tensions between Prime Minister Pedro Sánchez and the incoming U.S. administration. The report notes Palantir’s leadership has ties to Donald Trump, seen as at odds with Madrid’s diplomatic stance.
-
Domestic alternatives: Spain is accelerating investment in local platforms to preserve data sovereignty, including €115 million for Catalan firm Openchip as part of a larger €5 billion SEPI Digital–backed gigafactory initiative.
-
What to watch: Whether the Defense contract is renewed before November; potential spillover to other Spanish and EU defense/telecom contracts; pace and capability of domestic replacements.
-
Hypocrisy or pragmatism: Commenters heavily debated the logic of Spain blocking Palantir while simultaneously utilizing Huawei hardware for domestic data storage. This sparked a broader geopolitical argument weighing the espionage and political influence risks associated with the US and Israel against the threats posed by China and Russia.
-
Clarifying the Huawei contract: Several users corrected the assumption that Spain is sending state intelligence directly to Chinese servers. They noted that Spain simply purchased physical storage hardware from Huawei, which is housed domestically and managed by the Spanish Interior Ministry, though critics argued physical access and hardware-level risks remain.
-
The hurdles to data sovereignty: Participants discussed why European nations constantly default to foreign tech rather than relying on domestic equivalents like the Spanish firm Indra. Commenters pointed out that unwinding reliance on established global vendors to build and transition to home-grown infrastructure requires massive, often unpalatable economic investment.
-
International parallels: A few users pointed out similar political pushback elsewhere, noting that UK figures like the mayor of Greater Manchester have also completely avoided granting municipal contracts to Palantir.
The takeaway: While there is broad support in the thread for European technical autonomy, commenters remain highly divided on whether utilizing Chinese hardware is a safer interim step than relying on American intelligence software.
NSA tries to weaken mlkem standardisation?
Submission URL | 142 points | by SuperSandro2000 | 89 comments
The title implies an allegation that the NSA is attempting to influence or weaken the standardization of ML-KEM. At a high level, this would raise concerns about the integrity and transparency of the standardization process, the possibility of reduced security assurances in widely adopted cryptographic mechanisms, and the downstream risk to users and organizations that depend on standardized algorithms. It likely calls for closer scrutiny of decision-making, clearer rationale for design choices, and broader expert review to maintain trust in the resulting standards.
- DJB's tactics and working group drama: Commenters were sharply divided on Daniel J. Bernstein's (DJB) conduct in the IETF TLS working group. Critics accused him of driving the group into dysfunction with disruptive mailing list tactics, noting he has been moderated multiple times. Defenders argued these moderation actions are bureaucratic attempts to silence well-founded technical criticisms, framing DJB as a necessary, if combative, expert voice.
- The shadow of NSA history (NOBUS): Much of the discussion revolved around the plausibility of NSA interference. Many users pointed to the NSA's history of intentional cryptographic weakening—specifically NOBUS (Nobody But Us) and Dual_EC_DRBG—as proof that DJB's suspicions are inherently justified. Skeptics countered that there is no known NOBUS-style avenue against ML-KEM, dismissing the NSA meddling allegations as unfounded conspiracy theories.
- Pure ML-KEM vs. Hybrids: Participants debated the technical justifications for standardizing a "pure" (non-hybrid) ML-KEM specification. Proponents of the newly proposed draft clarified that it is explicitly marked "Recommended: N" and is intended strictly for constrained hardware environments that cannot support the overhead of hybrid schemes (e.g., running both SHA2 and SHA3, or carrying ECC components that might soon be vulnerable). Critics countered that publishing the standard at all legitimizes its broader use, aligning with alleged NSA procurement rules that shun hybrids.
- The risk of delaying standards: A counter-argument emerged suggesting that if a state actor already possesses advanced quantum capabilities, their primary goal would be to delay the global transition to post-quantum cryptography. Several users pointed out that intense procedural objections—like those currently stalling the working group—inadvertently achieve this delay.
The takeaway: The thread highlights a deep tension between practical engineering for constrained environments and a lingering, historically informed mistrust of state intelligence agencies, with DJB’s activism serving as a highly polarizing catalyst.
Senior SWE-Bench: open-source benchmark that assesses agents as senior engineers
Submission URL | 177 points | by matt_d | 112 comments
-
What it is: An open-source benchmark to evaluate software agents as senior engineers by assigning realistic feature builds and bug fixes. Tasks are written as natural-language messages (not over-specified checklists) and emphasize investigation, judgement, and code quality.
-
What’s new: A validation agent uses expert-designed recipes to generate behavioral tests that adapt to each submitted solution. Scoring combines runtime correctness with code quality “taste” metrics aligned to observed codebase practices, and can verify unstated, load‑bearing conventions.
-
How it works
- Feature tasks: Natural-language instructions approximating real PM/dev messages rather than rigid specs.
- Bug tasks: Derived from PRs that required significant runtime investigation (e.g., starting services, inspecting logs, profiling, reproductions).
- Evaluation: Runtime tests plus quality metrics; verifiers/validation can check implicit codebase practices.
-
Example task (excerpt): “Add Google Books as a metadata source to BookWorm for fallback/staging imports”
- Recognize “google_books” in STAGED_SOURCES so staged metadata is processed.
- Stage URL format: http://{affiliate_server_url}/isbn/{identifier}?high_priority=true&stage_import=true
- When supplementing records, extend (not replace) existing source_records.
- Implement stage_from_google_books to fetch via Google Books API and persist to a batch (Batch.add_items).
- Affiliate server: For ISBN-13, if Amazon returns no result and both high_priority=true and stage_import=true are set, fall back to Google Books.
- If Google Books returns multiple results for a single ISBN, log a warning and skip staging.
- Parse and stage at least: isbn_10, isbn_13, title, subtitle, authors, source_records, publishers, publish_date, number_of_pages, description.
- Update promise import flow to use stage_bookworm_metadata instead of Amazon-only logic.
- New public functions: fetch_google_book (returns raw JSON) and process_google_book (normalizes to Open Library edition fields).
-
Why it matters: Moves beyond junior-style, over-specified benchmarks to assess behaviors expected of senior engineers—working from ambiguous instructions, performing runtime debugging, and producing code that fits the project’s standards, not just passing narrow tests.
-
Adversarial evaluation: Inspired by the benchmark's design, users explored pitting LLMs against each other in an Elo system where models generate tests specifically to break rival models. Commenters who have experimented with this approach noted that while intriguing, models inevitably default to "degenerate" or unsolvable tasks (like asking for the input to a SHA256 hash). Suggested mitigations included requiring the generating model to solve its own problem or anchoring question viability against human solver baselines.
-
Data contamination risks: A few commenters questioned the benchmark's longevity since it relies on actual open-source pull requests. They pointed out that as models continuously scrape recent code, they will likely memorize these exact PRs. Attempting to rotate in fresh problems post-knowledge-cutoff would continuously break historical comparability across model updates.
-
The debate over underspecified prompts: The benchmark’s focus on ambiguous, natural-language tasks sparked a sharp debate on developer workflows. Some engineers argued that relying on vague prompts is an anti-pattern that shifts ambiguity into the model’s silent assumptions, forcing users to waste time unwinding errors. They advocated for workflows where models are prompted to interrogate the user for missing requirements. Conversely, others maintained that writing exhaustive specifications takes longer than just writing the code manually, arguing that an AI's true value lies in successfully inferring intent and filling gaps to save time.
-
Model performance anecdotes: Examining the benchmark naturally led to subjective comparisons, primarily between hypothetical/future iterations of models like "Opus 4.8" and "GPT 5.5." Consensus was split: proponents of Opus praised its ability to handle underspecified requirements and frontend design, while GPT defenders claimed it is vastly superior for strict instruction-following, code reviews, and mechanical refactoring speeds.
The takeaway: Assessing senior-level engineering capabilities in AI is widely supported, but the community is deeply divided over whether models should be judged on their ability to obediently follow highly detailed specs or their intuition in navigating ambiguous, low-effort prompts.
AI can't be listed as inventor on patent applications, Japan's top court rules
Submission URL | 387 points | by mushstory | 207 comments
-
What it is: A ruling that artificial intelligence cannot be named as an inventor on patent applications in Japan.
-
Why it matters: Confines legal inventorship to humans, shaping how AI-assisted innovations are attributed and filed.
-
Practical impact: Applicants using AI in R&D must list human inventors; applications that name AI are likely to be rejected; organizations may need processes to document human contributions when AI tools are used.
-
Open questions: How to determine inventorship when AI plays a significant role; whether and how AI use must be disclosed; any effects on ownership or enforcement are not indicated by the title.
-
The necessity of patents: The conversation was dominated by a broader debate over the economic value of intellectual property, anchored by references to the book Against Intellectual Monopoly. Some users argued there is little empirical evidence that patents actually boost innovation, welcoming the idea of phasing them out entirely as AI complicates traditional inventorship.
-
Pushback on anti-patent literature: Several commenters strongly criticized the cited anti-patent book, accusing the authors of cherry-picking historical examples (such as the development of the steam engine) and ignoring economic studies that demonstrate the beneficial impacts of the patent system.
-
The pharmaceutical R&D dilemma: A major sub-thread weighed how a patentless world would affect drug development. Defenders of the current system argued that the massive costs of late-stage clinical trials require patent exclusivity to prevent competitors from free-riding. Abolition advocates countered with historical examples (e.g., Italy and Switzerland prior to 1978) showing no drop in innovation without patents, suggesting that first-mover advantages, marketing, and the slow, expensive process of reverse-engineering provide sufficient market protection.
-
Historical intent vs. modern realities: Participants noted that the original purpose of patents was to incentivize public disclosure so that inventions wouldn't be lost as trade secrets when inventors died. While some suggested modern reverse-engineering lessens this need, others pointed out that without monopoly protections, pharma currently abandons many promising but unpatentable molecules.
The takeaway: Rather than focusing on the specific legal mechanisms of AI inventorship in Japan, the discussion pivoted into a fundamental economic debate over whether the global patent system remains a necessary incentive for expensive research or an outdated monopoly.
Is One Layer Enough? A Single Transformer Layer Matches Full-Parameter RL Train
Submission URL | 149 points | by tcp_handshaker | 40 comments
-
What it is: A layer-wise study of RL post-training for LLMs that introduces "layer contribution"—the fraction of full-parameter RL improvement recovered when only a single transformer layer is trained.
-
Key findings:
- Training just one transformer layer recovers most of the gains from full-parameter RL, and can sometimes surpass it.
- High-contribution layers consistently cluster in the middle of the stack; layers near the input and output contribute much less.
- Layer contribution rankings are strongly correlated across datasets, tasks, model families, and RL algorithms.
-
Setup: Evaluated across seven models in two families (Qwen3, Qwen2.5), three RL algorithms (GRPO, GiGPO, Dr. GRPO), and multiple domains (mathematical reasoning, code generation, agentic decision-making).
-
Why it matters: Indicates RL adaptations are highly localized, suggesting parameter- and memory-efficient RL post-training may be achievable by targeting a small subset (even a single) of layers, and offering guidance on where to focus updates.
-
Caveats: Results are reported for specific model families and RL methods; details on selection strategies for the high-contribution layer(s) and generalization to other architectures are not provided in the abstract.
-
Intuitive layer roles: Commenters broadly agreed the results align with the mental model of transformer mechanics: early layers parse syntax, late layers manage vocabulary and grammatical flow, and the middle layers execute the abstract reasoning and concept manipulation that reinforcement learning typically targets. One user hypothesized this middle-layer dominance might not hold for basic instruction-tuning, which focuses heavily on surface-level text phrasing.
-
Theoretical explanations: A technical debate emerged over whether transformers function as "autoencoders on steroids." Some users argued that because middle layers represent the expanded data manifold, a single layer or function pass in the middle is inherently sufficient to redirect the model's output. Others pushed back, pointing out technical distinctions between decoder-only latent representations and traditional autoencoder compression.
-
Practical challenges: RL practitioners cautioned that because RL post-training is notoriously fragile—prone to reward hacking, KL collapse, and out-of-distribution rollouts—introducing the variable of selecting a specific layer could make debugging significantly harder compared to using established parameter-efficient methods like LoRA.
-
Connections to other interventions: Readers linked the findings to similar experiments manipulating middle layers, including the "Repeat Yourself" technique (looping inner layers to simulate reasoning), models computing entirely in latent space ("neuralese"), and a recent winning Kaggle strategy that relied on splicing and duplicating middle layers.
The takeaway: While strictly freezing all but one layer may introduce debugging complexities in already fragile RL pipelines, the consensus is that fine-tuning budgets and learning rates should naturally focus on the middle of the stack where conceptual manipulation occurs.
Show HN: CLI tool for detecting non-exact code duplication with embedding models
Submission URL | 89 points | by rkochanowski | 48 comments
-
What it is: A lightweight CLI that finds non-exact code duplication by embedding code units and surfacing clusters of similar snippets that are often far apart in a codebase.
-
How it works:
- Computes an embedding for each code unit and searches for pairs with close embeddings (cosine similarity).
- Forms clusters ranked by similarity and by distance in the codebase (farther-apart duplicates are boosted).
- Outputs clusters as candidates; similar code isn’t always a real duplicate, and code that’s functionally identical but implemented very differently won’t be detected.
-
Supported languages: Python, TypeScript, JavaScript, Java, Kotlin, C#, Go, Rust, PHP, Elixir.
-
Embedding model:
- Uses external providers via LiteLLM-compatible APIs (e.g., code-focused models like Voyage AI; lower dimensions like 512 are acceptable).
- API key can be supplied via SLOPO_EMBEDDING_API_KEY or a .env file.
-
Workflow:
- Incremental re-indexing (only changed files).
- Review clusters and add their hashes to slopo.ignore.txt to suppress them in future runs.
- Designed to pair with an AI coding agent to filter false positives and assist with refactoring.
- Commit ignore/config files (omit API key); don’t commit slopo.db.
-
Usage:
- Install: uv tool install slopo (or uv tool upgrade slopo).
- Initialize config: slopo init.
- Validate/tune config: slopo show-config.
- Run analysis: slopo index; slopo embed; slopo analyze.
- Reports are written to report_dir (e.g., index.md plus per-cluster details).
-
Config highlights:
- Paths: source_dir, source_dir_exclude, db_file, report_dir, ignore_file.
- Embeddings: embedding_model (LiteLLM name), embedding_dimensions, embedding_batch_size/embedding_batch_chars.
- Thresholds: similarity_threshold (minimum cosine similarity), rerank_threshold (after distance-based boost).
- Some parameters (source_dir, embedding_model, embedding_dimensions, body_node_count_threshold) require deleting slopo.db to change after first indexing.
-
Caveats:
- Targets non-exact, structurally similar code; exact copy-paste is better handled by other tools.
- Functionally equivalent code with very different structure is unlikely to be flagged.
- Relies on external embedding APIs; batching is used for performance/cost control.
-
Example: An example report (doc/example-report) from the tool’s own codebase shows duplicated/similar language parsers, informing refactoring needs.
-
Unit granularity: Commenters suggested analyzing sub-function logical blocks (like individual conditional branches) and analyzing docstrings as a secondary signal of duplication. The author confirmed that Slopo currently chunks at the whole-function level but plans to introduce more granular sub-function extraction in future updates.
-
Handling false positives: Users noted that pure cosine similarity predictably flags non-duplicate functions that merely share semantic structures. The author agreed, emphasizing a "pragmatic" philosophy where the tool surfaces distant candidates and defers the actual validation to the developer or an LLM coding agent.
-
Algorithmic scaling: A participant cautioned that exact brute-force similarity search ($O(n^2)$) could face memory limitations in large monorepos. The author defended the approach, explaining that batching operations with NumPy blocks manages memory effectively and that code chunking/extraction is the actual performance bottleneck.
-
Comparisons to deterministic tools: In response to queries about AST edit distances,
jscpd, and BM25, the author clarified that Slopo intentionally avoids deterministic passes to fill the distinct gap of finding non-exact, structurally analogous code that evades exact-match tools. -
Evaluating dependencies: When asked if function embeddings include mean-pooled representations of the helper functions they call, the author noted Slopo only embeds the immediate function body and does not resolve dependencies.
-
Feature requests: Spurred by direct requests in the discussion, the author rapidly pushed updates to support PHP and Elixir via Tree-sitter.
The takeaway: Engineers see strong potential in using embeddings to uncover semantic, structural duplication that traditional AST tools miss, though the community consensus is that sub-function chunking and tight AI integration will be necessary to manage the resulting false positives.
Kimi K2.7 Code is generally available in GitHub Copilot
Submission URL | 415 points | by unliftedq | 172 comments
-
What it is: Kimi K2.7 Code is an open-weight coding model, now selectable in the Copilot model picker. It’s the first open-weight option in Copilot and is hosted by GitHub on Microsoft Azure.
-
What’s new: General availability in Copilot with a lower-cost option for coding workflows (per GitHub). Selection is via the model picker.
-
Availability: Rolling out now to Copilot Pro, Pro+, and Max. Expansion to Copilot Business, Enterprise, and additional surfaces is planned over the coming weeks.
-
Where you can use it:
- Visual Studio Code v1.127.0+
- Visual Studio v17.14.6+
- Copilot CLI
- GitHub Copilot cloud agent
- GitHub Copilot App
- github.com
- GitHub Mobile (iOS and Android)
- JetBrains v1.9.1-251+
- Xcode
- Eclipse
-
Billing: Charged at provider list pricing under usage-based billing; see Copilot pricing for details.
-
Admin controls (Business/Enterprise): Off by default; org admins must enable the “Kimi K2.7 Code” policy in Copilot settings. GitHub recommends reviewing open-weight models against security, compliance, and data-governance requirements before enabling.
-
Caveats: Gradual rollout; quality and performance are being monitored. If you don’t see it yet, check back as availability expands.
-
Cloud AI fatigue: Several commenters expressed exhaustion with hosted AI products, citing unannounced performance regressions ("nerfs"), price hikes, and shifting features. This has driven many to prioritize self-hosted models that guarantee stability and workflow control.
-
Local model alternatives: Instead of using cloud offerings, users heavily advocated for running local models such as Qwen 3.6 (27B dense or 35B MoE) and Gemma 4 31B. Participants noted that 4-bit quantized versions perform remarkably well, sharing success stories of running them on restricted setups or even older hardware like a GTX 1060.
-
Hardware sweet spots: A major debate centered on the best hardware for local inference. Mac Minis with 32GB or 64GB of unified memory were highly recommended for fitting larger context windows, while others promoted consumer GPUs (like RTX 3090s) and emerging unified-memory APUs like AMD's Strix Halo. The general consensus was that 64GB of RAM or VRAM is currently the ideal target for maximizing the capability of ~30B parameter open-weight models.
-
OS and inference troubleshooting: Users traded technical tips on avoiding memory bottlenecks, including sharing specific
llama.cppconfigurations. A sub-thread warned that running models via WSL on Windows can lead to hard system crashes without strict.wslconfigmemory limits, prompting some recommendations to use Linux natively.
The takeaway: Rather than discussing Copilot's new model integration, the thread served almost entirely as a collaborative guide for abandoning cloud AI in favor of self-hosting, focusing on the hardware specs and quantization tools required to run Qwen locally.
Show HN: I built an open-source alternative to Claude Cowork
Submission URL | 35 points | by wayneshng | 8 comments
-
What it is: An open-source, security-first agent platform for “coworker” tasks. Agents interact with 100+ business and productivity apps and can be driven via chat or automated multi-step workflows.
-
Why it exists: Addresses security gaps observed in OpenClaw-style assistants, where credentials can leak into model prompts/memory. Designed specifically for production work rather than personal assistance.
-
How it works: Agent runtimes run in isolated Docker containers with their own file systems. They cannot call third-party APIs directly; instead, they issue proxy requests to the host with a credential ID. The host performs the actual API/LLM call and returns JSON. You can even disable the agent container’s internet and still operate through the host proxy.
-
Security model:
- No agent access to raw API credentials or host files.
- All external calls (including to LLM providers) go through the host proxy.
- Per-agent credential scoping is enforced at the code level.
- Tools and credentials can be restricted per workflow step.
-
Integrations: 100+ supported, including Google Workspace, Slack, Notion, HubSpot, Salesforce, and Figma (see the integrations folder in the repo).
-
Workflows/automation:
- Build multi-step workflows on a canvas or ask the agent to generate them from a description.
- Triggers: cron, webhooks, and app events (e.g., new email, form submissions).
- Control flow: conditions (smart/NL or strict/programmatic), loops.
- Define output schemas per step for mapping and downstream use.
-
Multi-agent orchestration: Create a fleet of agents with distinct credentials, skills, and knowledge bases. Assign different LLMs per agent for cost/priority tuning. Some agents can act as “team leads” to coordinate others under human oversight.
-
Memory: Cross-session memory with four categories—episodic (events), semantic (facts), procedural (rules/constraints), and working (short-lived)—and automatic memory writing for useful discoveries.
-
Deployment: Dockerized with docker-compose configurations included. Agents are isolated from each other and from the host by default.
-
Zero-trust approach to LLMs: The author highlighted a design philosophy of offloading rigorous tasks to deterministic tools to avoid AI hallucinations. As a proof of concept, the platform uses a lightweight chess engine tool to generate valid chess moves rather than relying on the LLM's text output—an architecture planned for future calculation, data analysis, and deep research features.
-
Comparisons to n8n: When asked how the system differs from traditional automation platforms like n8n, the creator explained that workflow execution is "AI-native." Condition nodes use AI to dynamically evaluate true/false states instead of comparing fixed values, and workflows can be constructed conversationally via a chat interface.
-
Current limitations and roadmap: Responding to user inquiries, the author confirmed that multi-user support is not yet live, though the system is architected for role-based permissions and team-wide credential sharing is currently in development.
The takeaway: The discussion centered on the creator's philosophy of combining flexible AI-native workflow routing with strict, deterministic tool execution to prevent the hallucinations that typically plague agentic systems.
Show HN: ctx – Search the coding agent history already on your machine
Submission URL | 36 points | by luca-ctx | 15 comments
-
What it is: An open-source Rust CLI that ingests your local coding agent transcripts/logs into a structured SQLite database and provides fast, ranked text search across past sessions. It gives agents and humans a way to recover prior discussions, decisions, failed attempts, commands, and test results without any hosted memory service.
-
Why it matters: Coding agents often start from zero context and repeat past mistakes. By searching prior sessions, they can find root causes, rejected approaches, and runbooks before re-debugging. The project claims up to 50x better token efficiency than dumping raw transcripts, by returning ranked, cited matches tied to sessions/events.
-
How it works: Discovers supported local history sources, normalizes them into sessions, events, and touched-file metadata, and indexes them in a local SQLite DB. Searches return snippets with stable ctx IDs so agents can fetch just the relevant window or reconstruct a compact transcript.
-
Key commands
- Index/setup:
ctx setup - Natural-language search:
ctx search "failed migration" - File-scoped search:
ctx search --file path/to/file.rs - Multi-term:
ctx search --term "failed migration" --term rollback - Inspect raw index (read-only SQL):
ctx sql "SELECT provider, COUNT(*) AS sessions FROM ctx_sessions GROUP BY provider" - Show matching transcript window:
ctx show event <ctx-event-id> --window 3 - Export compact session transcript:
ctx show session <ctx-session-id> - Built-in docs:
ctx docs search "upgrade",ctx docs show cli-reference,ctx docs man --print - Install:
curl -fsSL https://ctx.rs/install | sh - Optional agent skill:
npx skills add ctxrs/ctx - Upgrades (installer-managed builds):
ctx upgrade check
- Index/setup:
-
Supported sources: Claude Code, Codex, Cursor, Pi, OpenCode, Antigravity/Gemini CLI, Factory AI Droid, Copilot CLI. Use
ctx sources --jsonto see what’s importable locally. -
Use cases
- Prevent re-debugging known failures (e.g., matching a test failure to an earlier “disk full” incident and surfacing the cleanup runbook).
- Generate cleaner, shareable transcripts by excluding noisy intermediate messages (e.g., attach to PRs so reviewers and their agents can see provenance).
- Give agents a pre-task “history brief” via an “Agent History Research” subagent; mine past sessions to find SDLC bottlenecks.
-
Caveats
- Fully local: no cloud calls, model APIs, or API keys; no writes to your repositories; no background service required.
- Privacy: transcript text is preserved verbatim (local paths and secret-shaped strings are not scrubbed). Review output before sharing externally.
- Self-upgrades apply only to installer-managed binaries; package-manager/source builds are unmanaged.
-
Native search vs. token efficiency: Commenters noted that agents like Claude Code can already use native tools like
jqandgrepto parse local logs, pushing back on the premise that agents always "start from zero." The creator clarified that the project's main benefit isn't raw speed but token efficiency—giving models a structured SQL interface prevents them from flooding their context windows with noisy intermediate messages. -
Tool proficiency: Some users warned that because models are heavily fine-tuned to use standard shell tools, introducing a custom local search tool might actually degrade agent performance. The author countered that models are equally well-trained on standard SQL, which makes the tool's interface familiar.
-
Crowdsourcing training data: One user suggested creating a platform to anonymously upload chat logs to help train open-source models, citing the immense value of coding data. The creator noted they are currently focusing on a secure cloud version aimed at enterprise team sharing rather than public data donation.
-
A common developer itch: Another developer joked that building agent transcript loggers is becoming the "todo list demo of the LLM era," as many engineers are building custom solutions for this exact problem. The creator agreed, suggesting the ecosystem is mature enough to need a standard specification for agent transcripts and runtime logs.
The takeaway: While agents can technically string-match their own history via standard shell tools, developers are increasingly building local, SQL-backed loggers to save context tokens and give agents a cleaner, structured memory bank.
Comparing Fable and 10 other LLMs on refactoring a LangGraph god node
Submission URL | 47 points | by Korridzy | 20 comments
-
What it is: A head-to-head experiment where 11 LLMs (5 US-based, 6 China-based), including Fable, are asked to refactor a “god node” in a real LangGraph agent. Each model first proposes a reorganization, then critiques others’ proposals; the author then applies multiple methods to decide which models to trust for generation and for evaluation.
-
Why it matters: A single overgrown node hides orchestration logic, making the graph cease to represent the system. That impedes explanation, debugging, testing, and safe change. The goal isn’t just splitting a big function but lifting control flow into the graph so behavior is explicit and composable.
-
Original god node responsibilities: The central plan node concealed ~350 lines of control and routing logic, including:
- Iteration/bookkeeping: loop/iteration control, abort/max-iter checks, transient flags.
- Bootstrap questions: forced user prompts for core.region and core.currency.
- Decomposition and tasking: dynamic decompositions, assembling acquisition “recipes,” schema prep/merging components.
- Limits and recovery: calculator-attempt caps, handling blocked-calculator scenarios and fallbacks.
- Deterministic routing: fast-pass task selection without the LLM; auto-finish when inputs are complete.
- LLM planning: building prompt context, structured call for a decision, post-LLM decomposition if needed.
- Decision normalization: redirecting/rewriting choices (e.g., derived fields, ask_user→search), retries/limits per field, correcting premature finish to calculation.
-
Experiment design:
- Stage 1 (generation): Each model proposes how to untangle the node and lift its logic into the graph.
- Stage 2 (peer review): Models evaluate and critique each other’s proposals.
- Stage 3 (analysis): Multiple selection methods are applied to identify the best proposal and the most reliable evaluator.
-
Evaluation methods:
- Agreement-based: Compare scoring consistency across models to pick a top proposal.
- Thesis-based: Decompose reviews into concrete theses and compare their support to pick the most accurate analyst.
- Opinion-center/medoid: Find the central reviewer relative to others to select a robust evaluator.
- “Deus ex machina”: A further tie-break/confirmation step to re-pick the best analyst.
-
Materials: All proposals, cross-reviews, thesis runs, and the ranking script are published to enable inspection and reproduction.
-
Caveats: This is a single detailed experiment on one agent and one large node; conclusions about model reliability and role fit (generator vs evaluator) may not generalize without further tasks and domains.
-
Frustrations with Anthropic's Fable: Multiple users reported that Fable's aggressive safety filters make it difficult to use for development. Commenters shared experiences of innocuous prompts (like React Native edits or local security audits) triggering policy flags mid-execution, which resulted in silent downgrades to Opus, broken code generation, and wasted usage limits.
-
Causes for strict filtering: The degraded Fable experience sparked debate over the root causes. Users pointed to recent US export controls, corporate sabotage, and Anthropic's own alarmist safety marketing for inviting heavy regulatory scrutiny.
-
Domain flagging: A brief thread discussed the author's blog being blocked by an opt-in UK Protective DNS service. Others clarified this was an automated, overly cautious flag caused by the domain being brand new, rather than a genuine security threat.
-
Future experiments: Commenters expressed interest in seeing the benchmark re-run with newer iterations of Opus, GLM, and Kimi, which the author confirmed is likely for future work tasks.
The takeaway: While the article explores Fable's theoretical capability as an evaluator and generator, commenters focused heavily on the model's poor practical usability, emphasizing that strict safety filters and forced downgrades currently render it unreliable for complex coding agents.
Weird Al Yankovic Pulled Out of AI Ad Deal: 'I Can't Be the Poster Boy for AI'
Submission URL | 71 points | by fortran77 | 43 comments
-
What happened: In a Syracuse.com interview, Yankovic said he backed out of a lucrative commercial for business productivity software after learning a week before the shoot that it would involve AI. He described himself as “not a fan of AI,” added “I can’t be the poster boy for AI,” and said he felt bad about pulling out at the last minute despite the “nice pile of money” offered.
-
Context: The move aligns with other public pushbacks from creatives. “Backrooms” director Kane Parsons has called AI “genuinely harmful,” Emma Thompson said it induces “intense irritation” in her creative process, and Madonna argued AI/algorithms are the opposite of taking risks (though her “Confessions II” short film used multiple AI artists). Yankovic also acknowledged seeing “Weird AI” jokes about him online.
-
Why it matters: It’s a high-profile example of a mainstream artist rejecting an AI-branded endorsement despite financial incentive, reflecting ongoing reputational and creative concerns around AI in entertainment and advertising.
-
Caveats: The company and specific AI usage weren’t named; no contract or legal details were disclosed.
-
Tech worker cynicism vs. enthusiasm: Commenters strongly related to Yankovic's resistance, noting a prevalent sentiment among experienced tech workers who are increasingly wary of AI and "smart" appliance integrations. A prominent viewpoint cited Cory Doctorow’s framing to explain the fatigue, noting that AI currently feels like it is doing things to the populace rather than for them.
-
The duality of generated content: A recurring observation was that AI tools are highly beneficial or fun for the creator, but often feel "terrible" to receive as a consumer. Users debated the public pushback; some dismissed it as standard historical FUD toward new technology, while others argued AI is unique because it is being pushed on the public via corporate coercion rather than immediate, obvious consumer pull (like the early internet).
-
"Al" vs. "A.I." font confusion: A lighter sub-thread focused on the visual ambiguity of sans-serif fonts, with several users sharing anecdotes of younger generations genuinely misinterpreting "Weird Al" as "Weird A.I." or wondering why Paul Simon's song is titled "You Can Call Me A.I."
-
Admiration for Yankovic's integrity: The majority of the thread praised Yankovic for consistently prioritizing his morals over financial gain, pointing out that he has managed to avoid controversy for 45 years and famously turns down alcohol sponsorships as well. Only a slight minority argued he should "take the money while it's still there."
The takeaway: The comments largely reflected a deep industry fatigue with AI, viewing Yankovic's financial refusal not just as an artistic stance, but as a highly relatable rejection of user-hostile technological trends.
OpenAI ‘in early talks to give 5% stake to US government’
Submission URL | 133 points | by tosh | 141 comments
-
What it is: OpenAI is in early, conceptual talks to give a 5% equity stake to the US government, according to the Financial Times, as part of a broader idea to share AI-driven gains with the public.
-
What’s proposed: Altman has floated that each major US AI developer could contribute 5% of equity to an investment vehicle modeled on the Alaska Permanent Fund, which could distribute dividends to citizens. It’s unclear if other companies (e.g., Anthropic, Google, Meta) would participate.
-
Why it matters: The move is framed as a way to share AI wealth with the public and improve relations with the Trump administration amid growing federal scrutiny of AI firms.
-
Who’s involved: Altman has reportedly discussed public ownership with Donald Trump, Commerce Secretary Howard Lutnick, and Treasury Secretary Scott Bessent, and has also spoken with Sen. Bernie Sanders, who backs a sovereign wealth fund financed by a one-time 50% tax on the stock of the biggest AI companies.
-
Context: Federal pressure has intensified; Anthropic recently paused a new model after a government order restricting access for foreign nationals, then restored access after addressing safety concerns.
-
Status and caveats: Talks are preliminary and may require an act of Congress. Participation by other firms is uncertain. OpenAI and Anthropic have previously suggested public or sovereign wealth funds in policy papers and are preparing US stock listings that some investors believe could value each at over $1tn.
-
Regulatory capture and conflict of interest: Several commenters argued that giving the government an equity stake is a deliberate move to secure favorable treatment. Users warned that government ownership creates a conflict of interest, potentially insulating OpenAI from impartial antitrust scrutiny or guaranteeing a taxpayer-funded bailout in a "too big to fail" scenario if the company later struggles. Many characterized the move as transactional, "pay-to-play" politics.
-
Equity versus taxation: A prominent debate centered on the mechanics of sharing AI gains. Critics questioned why a 5% equity stake or dividend is preferable to simply enforcing a well-proportioned corporate tax. Some argued that while taxes apply uniformly across an industry via law, a negotiated one-off equity stake functions more similarly to a bribe.
-
Preempting harsher policies: Users noted that this proposal appears to be an attempt by Altman to front-run more severe political threats, specifically referencing Bernie Sanders' past proposals to confiscate up to half of windfall AI profits or equity.
-
State-built alternatives: A sub-thread questioned why the federal government doesn't develop its own frontier LLMs, akin to the Manhattan Project or ARPANET. Replies pointed out that the modern US government heavily relies on private contractors and cannot politically justify the specialized hardware budgets or lucrative compensation required to attract top AI talent.
The takeaway: Commenters are highly skeptical of the proposal's altruistic framing, overwhelmingly viewing the proposed equity offer as a strategic corporate maneuver to achieve regulatory capture, preempt higher taxes, and secure state backing.
No LLM Code in Dependencies
Submission URL | 118 points | by edward | 111 comments
-
What it is: A maintainer’s account of auditing and restructuring git-annex’s build to avoid any dependencies containing LLM-generated code.
-
What he did: Spent ~100 hours reviewing the entire dependency tree and reworking builds so git-annex can compile without such dependencies; plans ongoing monitoring.
-
What he found:
- Large LLM-generated changes reverted in the next release without explanation.
- An incoherent 1489-line commit message accompanying ~10,000 lines of changes to a ~26,000 LOC codebase.
- A prompt directing the model to copy code from another project, narrowly avoiding copyright issues.
-
Why it matters: The audit surfaced signals about dependency quality that will influence future choices. The author argues that casual LLM-driven commits can impose review burdens and risk legal/maintenance problems for downstream users.
-
Author’s stance: Sees this as “holding back the tide”; says Software Freedom Conservancy has “punted” on the issue and doubts the FSF will do better. He’s reconsidering community participation but continues supporting users. He urges contributors to consider broader impacts; in one case, an LLM-formatting commit led him to end further collaboration with that project.
-
Pragmatism vs. idealism: Commenters debated the long-term viability of cutting off major dependencies. Some argued that avoiding newer versions of essential tools like Git or GHC to maintain LLM purity is an untenable tradeoff for end-users, though others defended the project's historical significance and the maintainer's right to strictly curate its build tree.
-
LLM code vs. junior developers: A significant debate compared LLMs to mid-to-low-tier human developers. Some users argued LLMs are actually better at avoiding basic syntax errors, while detractors countered that human mistakes are bound to human comprehension, whereas LLM logic errors can be fundamentally harder to reason about and fix.
-
Community building and mentorship: Echoing recent statements from the Godot Foundation, several commenters noted a qualitative difference in reviewing subpar code. They argued that helping a struggling human junior developer is a form of community investment and mentorship, whereas debugging an LLM's output provides no such reciprocal benefit to the project.
-
Maintainer burden and open-source hostility: Discussions highlighted the danger of volunteer maintainers being overwhelmed by massive, unvetted LLM-generated pull requests (referred to by some as "slop"). While some supported outright bans to preserve project resources, others warned that immediate hostility toward any AI assistance assumes laziness and risks driving away genuine contributors.
-
Detection methodology: Users questioned how the LLM usage was identified, pointing out that some flagged commits in the audit were actually trivial or only caught because the original author explicitly disclosed the AI's involvement. The git-annex author (
joeyh) joined the thread to clarify that other surrounding code churn had a high probability of being undisclosed LLM generation.
The takeaway: The discussion largely framed the rejection of LLM code not just as a debate over code quality, but as a pragmatic defense against the asymmetrical review burden it places on open-source maintainers, though opinions remain split on whether blanket bans are practically enforceable.