📐 The Big Picture
Foundation models continue their relentless march forward. New frontier model releases, capability improvements, and a growing ecosystem of tools are pushing the state of the art. AI-assisted development is becoming the new normal. From automated code generation to debugging assistants, the tools transforming how software gets built keep getting better. Taking models from notebook to production remains the industry’s central challenge. Practical patterns for inference, serving, and operationalizing AI at scale continue to evolve. Today’s 12 picks across 4 categories span language models, AI coding, model deployment · curated for the practical builder.
ArXiv MLRESEARCH
PROBLEMCoding agents powered by LLMs routinely fail on real-world repositories because they lack tacit operational knowledge: which files encapsulate which subsystems, the correct test commands, and the idioms that prevent common mistakes. Manually maintained AGENTS.md files aim to fill this gap, but their utility is inconsistent and maintenance is effortful.
APPROACHProbe-and-Refine Tuning automates generation of effective repository guides. The method first probes an agent on a curated set of tasks (e.g., historical bug fixes), collects trajectories of failures—such as modifying the wrong file, running incorrect tests, or misunderstanding module boundaries. It then refines a concise textual guide (similar to an AGENTS.md file) by prompting an LLM to synthesize corrective instructions from those mistakes, iterating until task success rate plateaus. The guide is kept lightweight, focusing on high-impact heuristics rather than exhaustive documentation.
KEY RESULTSIn experiments across 50 open-source Python repos and over 200 historical issues, agents using the probe-refined guide solved 44% more issues correctly compared to no guidance, narrowing the gap with human-written AGENTS.md files to within 6% while fully automating maintenance. The tuned guides also reduced average agent token consumption by 19% by eliminating irrelevant context exploration.
BUILDERS TAKEAWAYAdopt an operational feedback loop: capture failure logs from your agent on representative tasks, then programmatically update your repository guidance to target those specific error modes. Treat your AGENTS.md as a tunable prompt, not static documentation; a small set of high-signal heuristics (e.g., “always run lint before commit”, “UI logic lives in src/ui/”) often beats a long, generic guide.
LIMITATIONSThe tuning process can overfit to the probe task suite and may degrade on novel issues or after significant repo refactoring, requiring periodic re-tuning.
🔬 RESEARCH
LLMs deployed as agents often miss critical evidence buried in long tool traces or multimodal inputs, leading to brittle task failures. ContextRL addresses this by using RL to directly optimize for evidence identification, improving task completion rates where standard supervised fine-tuning fails.
FID scores are notoriously unstable across different training runs and even sampling seeds, yet papers routinely compare single-point estimates as if they're definitive. This paper quantifies that lottery effect, showing that rank-ordering models by FID can flip with trivial seed changes, undermining the entire evaluation protocol.
Multicalibration guarantees that a model's predictions are unbiased across any specified subgroup, preventing systematic over- or under-estimation that can lead to discriminatory outcomes. This paper provides optimal deterministic algorithms, making it feasible to enforce this property in production models without probabilistic sampling.
Coding agents fail on real-world repos because they lack tacit operational knowledge—like which files to modify for a given feature or how to invoke tests—that isn't in the code. Probe-and-Refine Tuning automatically discovers and encodes this knowledge into a lightweight guide, reducing the manual effort of writing repository-specific documentation for LLM agents.