深入分析基于 Git 的 Markdown 文件与图数据库在构建 AI 代理记忆系统时的优劣权衡。

代码维护📅 2026/04/13

#API#开发者#文档#GitHub#LLM 选型#手动触发#中风险#可复用#半自动#上下文管理#代码仓库#报告#记忆系统

Garry is kinda correct here, but is oversimplifying memory. Harrison (the author of the original article) makes a very good point but also makes memory sound easier than it is.

(before reading this article, note that I wrote down my thoughts and then passed it through Claude Code. I read every word. read it like like a coworker's Claude Code output)

Let me start with where Garry is right, because he IS right about something important.

Git-backed markdown is a memory format that is simultaneously human-readable, version-controlled, diffable, and greppable. No database gives you all four at once by default. If your agent's memory is an opaque blob in someone else's database, you have no idea what it "knows" about you. You can't correct it, can't diff it, can't even look at it.

That matters. A lot. I agree with this completely, and it's the right starting posture.

But it's a storage format. It's not a memory system. And the difference matters more than most people in this debate seem to realize.

Harrison's argument is different. He says memory is tied to the harness, the harness must be open, therefore you should use their open harness. The first two points are correct, the third is iffy because it assumes you want to be responsible for memory, which is hard (probably right but not a trivial decision). But that core insight--that the harness and memory can't be separated--is real and more important than people give it credit for.

Let me explain why, and then why everyone in this debate is still underselling the difficulty.

The harness owns the critical moment

The most important time for memory to be created or updated is during compaction.

Compaction is when the context window fills up and the agent compresses everything into a summary. Information that doesn't survive the summary is gone--not archived, gone. This is memory triage, and the harness controls it. Always.

OpenCode, OpenClaw, and Hermes all handle this. OpenClaw does it by default. OpenCode's SDK exposes compaction hooks--you can listen for session.compacting events and handle memory yourself. This is a great place for memory logic to live.

Now look at what Codex does: it produces an opaque, encrypted compaction summary that isn't usable outside the OpenAI ecosystem. Harrison himself flagged this in his article. This isn't just vendor lock-in, it's architectural lock-in by design.

Harrison is right to be alarmed by this. Garry is right that being "above the API line" matters. But neither one grapples with what actually makes memory hard once you've decided to own it.

Where files break down: forgetting

Garry's model inherits all the strengths of git: version history, diffs, blame, rollback.

But git's greatest strength is the core problem: nothing is ever truly forgotten.

When do you choose to forget a memory? How do you know it's outdated? You changed jobs six months ago--is the memory about your old team's coding standards still valid? Your codebase migrated from REST to GraphQL--are the API pattern memories stale, or still useful for legacy endpoints that still exist?

With files, you can delete them. But you need to know they exist AND that they're stale. And you need to check this proactively, because nobody is going to tell you.

This is actually a structured problem with real solutions starting to emerge. Zep's Graphiti engine uses what they call bi-temporal knowledge graphs--every fact gets timestamps for when the system recorded it AND when it was true in the real world. Facts are invalidated, not deleted. You can query "what did I know about X on March 15th" separately from "what is currently true about X."

Most memory providers are converging on some version of this. Supermemory has a graph-based system. Hydra is moving toward mixed graph/vector approaches. Mem0 added graph memory. This convergence is telling--it means the industry is collectively figuring out that flat files and pure vector search aren't enough for temporal reasoning.

Files don't have temporal validity windows. Git has history, but history and validity are different things. Knowing a file changed on March 15th doesn't tell you whether its contents are still true today.

Then there's the injection problem.

OpenClaw's memory.md is a trivial file with memories, injected into context every time, updated at compaction.

It's also fully observable because it's just.. a file. This was a genuine innovation and a really good idea.

But my OpenClaw installation clients keep running into the same wall: not all memory needs to be in context every time, and there's a ceiling on how much fits. Claude Code caps MEMORY.md at something like 200 lines. After that, the content just doesn't get loaded at session start. You lose it.

Most memory systems solve this with a reactive search_memories tool. The agent needs something, searches for it, finds it. Fine. But what happens when the agent doesn't know it should be searching?

A coding agent drifts off-track and violates a pattern your team agreed on three months ago. The memory exists. The agent didn't search for it because it didn't know it was relevant. There was no trigger. It just.. didn't know what it didn't know.

This is the proactive injection problem, and it's the hardest open question in memory right now.

There IS real research on this. MemGuide ranks candidate memories by something they call "marginal slot-completion gain"--basically asking "would injecting this memory fill a gap the agent actually needs right now?" PRIME takes a different angle, building proactive reasoning through iterative memory evolution. These are promising but none of them are production-ready for synchronous agents where you can't afford an extra inference round on every turn.

Mesa's Saguaro is interesting here. After every agent turn, it spawns a separate LLM that reviews what the agent just did against the full codebase. If the agent is drifting, it corrects course. They kinda built a memory system without calling it one. It's just really slow because you're doing LLM inference after every single turn.

Supermemory proved the logical extreme of this in their April Fools experiment: throw enough inference at the problem (eight parallel prompt variants, a dozen model calls per query) and you beat basically every memory benchmark. 98.6% accuracy. But the per-query cost is absurd. Their actual production system--the graph-based one--scores lower on benchmarks but is, you know, usable. For async agents where latency doesn't matter, brute force actually make sense. Not for Jarvis, my OpenClaw agent.

Where files break down: relationships and search

If you store everything as files, there's no way to search "all people I know" or "bugs I often make in this codebase" unless the agent happens to organize memories that way. And it won't, because agents are inconsistent organizers.

Concepts and relationships aren't flat. They're graphs. A person connects to a company, a project, a set of conversations. A coding pattern connects to a language, a framework, a set of past mistakes. Files can represent individual nodes but they can't represent the edges without becoming something else entirely.

So you solve it by adding structured search over your markdown files. Oops, you've built a database!

https://t.co/ICR6XoxJyn

@swyx wrote about this years ago: developers who avoid using a real database inevitably build one, badly, through incremental decisions. You start with files, add search, add indexing, add schemas, add conflict resolution, and suddenly you have Postgres except worse.

This is actually what happened with GBrain--Garry's own implementation of "memory is markdown, brain is a git repo." The files go in as markdown. But underneath? Postgres and pgvector for hybrid search. The markdown is the interface, the database is the engine. Even the strongest advocate for file-based memory needed a database to make it actually work.

The Composio model

Here's something I think is underexplored: portable memory across agents.

The same way Composio lets you move integrations across agents, some memory providers are moving toward letting you own and share memories across Claude, ChatGPT, OpenClaw, whatever. Your memories live in a vault you control, and each agent reads from and writes to it.

I'd call this the Composio model of memory. It's a good idea and more providers should pursue it.

But then you're potentially running two memory systems--one inside the harness (memory.md, CLAUDE.md, whatever the harness does at compaction) and one external. What a mess.

Hermes and OpenClaw both let the user choose their memory backend. Flexibility sounds great until you realize it means the system has to handle the possibility that memory is in two places at once, managed by two different things, with two different update cadences. I still think giving users this choice is the right call. But it is genuinely complicated.

The cost that nobody talks about

Every sophisticated memory system costs inference tokens. Letta's self-editing model--where the agent actively decides what to remember during reasoning via tool calls--is the most architecturally interesting approach I've seen. The agent curates its own memory as a first-class part of thinking. But every core_memory_replace call is tokens. Mesa's per-turn review is a whole extra LLM call. Supermemory's brute force approach is a dozen.

File-based memory is effectively free. Read a file, inject it, done.

The bar for beating memory.md isn't just "is it smarter?" It's "is it enough smarter to justify the cost?" And for most use cases today, the honest answer is no.

But here's something that should make people pay attention: recent benchmarks on agentic memory (AMA-Bench among others) are finding that the design of your memory system matters way more than which model you're running. We're talking maybe an order of magnitude more variance from architecture choices than from model scaling. The architecture matters enormously. It just also costs real money, and that tension is why most production systems still use the simple thing.

The unsolved problems

Recent research has started to identify what a memory system actually needs to do well:

Accurate retrieval--find the right memory when asked.
Learning in real time--update what you know from new information as it comes in.
Long-range understanding--connect things across sessions that happened weeks apart.
Selective forgetting--know when a memory is stale and stop using it.

No current system is good at all four. Graph-based systems handle forgetting and long-range connections better than anything else, which is probably why everyone is converging on them. Letta does well on retrieval and real-time learning. File-based systems do ok on retrieval and struggle with the rest.

Now add multi-agent coordination. Multiple agents on the same filesystem. Multiple people cooperating with agents on different projects. Who organizes the memory? Who resolves conflicts? Do we deploy an async agent to consolidate memories at compaction time? At session end? On a cron job overnight?

How do we prioritize recent memories over old ones? How much control should the agent have over its own memory? How do we handle that some people want aggressive memory and some want minimal? And they might want to export it and bring it to another agent!

These aren't rhetorical questions I'm asking to sound smart. I deal with these every week deploying agents for clients. Nobody has good answers.

Benchmarks exist now. They're just not reliable.

A year ago there were no memory benchmarks worth talking about. That's changed. LOCOMO, LongMemEval, AMA-Bench, MemoryAgentBench all exist. There's even an ICLR workshop this year dedicated to agent memory.

But here's the problem: evaluation choices that look like implementation details--the prompt you use for the judge model, the scoring methodology, the answer generation setup--can swing accuracy by double digits. Supermemory showed this directly when they demonstrated you could score 98.6% by letting any of eight prompt variants count as correct. That's not a benchmark result. That's a configuration choice dressed up as one.

So we have benchmarks. They're just not trustworthy enough to settle any debates. If you overcomplicate your memory system, you still can't be sure it's actually outperforming a memory.md other than vibes. Just vibes with numbers attached.

Nobody has memory right

Not Garry, not Harrison, not OpenClaw, not Letta, not Zep, not Supermemory, not Mem0. Nobody.

Garry's instinct--keep it simple, keep it readable, keep it yours--is the right starting posture. Harrison's instinct--the harness and memory are inseparable, own both of them--is architecturally correct. Sarah Wooders' framing--memory is context management, not a retrieval problem--is the most precise explanation of why this is so hard.

But memory.md is not the end state. It's the beginning.

It's the simplest thing that works, and for most use cases today it's the right choice. Not because it's good. Because everything else is either too expensive, too complex, too slow, or too unproven to justify the leap.

The gap will close. The research is real, the providers are converging on graphs, and the benchmarks are slowly forming.

But if anyone tells you they've solved memory, they haven't. They've solved one of the four problems and they're hoping you don't ask about the other three.