The characteristics of what I think everyone is referring to by the term “Agent OS” seem achievable through abstraction by configuring GitHub as a harness. In fact, much of the Zero Trust requirements as described by Anthropic come for free this way. And with all that’s on offer via the API much of the observability challenge shifts to analysis of the data.
GitHub-native orchestration turns the agentic black box into a machine you can monitor and tune.
This exploration started because I wanted to use LLMs to generate questions from some complex data, but I started worrying as much about the process of building with agents as I did about what I was building. So, I separated out the repos and went down the Agent OS rabbit hole in an “ops” repo. Here’s what it has become, so far:
| Agent OS Feature | GitHub / Git Mapping | Implementation |
|---|---|---|
| System Clock & CPU Interrupts | GitHub webhooks | GitHub webhooks fire events to trigger the Hermes gateway, spawning containerized agent sessions on-demand. |
| Task Queue & Scheduler | Issues & Labels | The orchestrator schedules work by creating GitHub Issues and labeling them (e.g., agent:builder, status:to-do). |
| Memory Registers & State | Labels | Issue and PR labels (e.g., status:in-progress, status:approved, status:changes-requested) act as registers tracking the agent’s execution state machine. |
| Inter-Process Communication (IPC) | Pull Requests & Comments | Agents coordinate via GitHub’s social layer; the builder opens a PR, the reviewer reads the diff and comments, and the builder responds via follow-up commits. |
| Workspace Sandboxing | Git Branches | Builder agents are sandboxed on feature branches (e.g., agent/role/issue-num-slug) and cannot touch production directly. |
| System Gates & Security Policies | Branch Protection & PR Merging | Human orchestrators merge code only after the reviewer agent applies the status:approved label. |
| System Registry / Living Docs | Living Markdown Files | Files like CLAUDE.md and AGENTS.md serve as system configuration registers that agents automatically keep updated. |
| Audit Trail | GitHub Event Log | Every action (commit, comment, label change) is timestamped, attributed to a GitHub App, and permanently stored. |
| Privilege Isolation | GitHub App Scoped Tokens | Each agent role is a separate GitHub App with scoped API permissions, creating hard technical boundaries. |
| Prompt Injection Containment | Per-Agent Context Scope | Downstream agents operate on structured metadata rather than full context, isolating them from potential injection the builder encountered. |
| Anomaly Detection | Label State + Timestamps | A stuck label state is detectable via the API; unexpected bursts of commits or label changes enable rule-based alerting. |
Several interesting outcomes from this approach became apparent.
By using the Issue as the session storage record, data can be written down for use later, and, of course, GitHub’s GraphQL API makes it very easy to collect and do post-session analysis. Having a trajectory log (which agent acted, what label triggered it, what it did, what it commented, what it committed) that is actually readable and connects outputs to its actions gives me some confidence that I’ll be able to understand problems and instruct agents on how to address them rather than using a “Fix it” prompt that turns into a huge mess.
The problem of starting a new agent session with all the insight needed without flooding it with context becomes more manageable which could make it possible to solve harder problems with smaller, cheaper models. Collaboration gets interesting too. For example, Memory could be shared with your team on a granular level, as in, per agent per repo per milestone, and, since it’s getting versioned, you can look back at how it evolved if you’re questioning a change to the Memory.
Introducing some process determinism in a way that feels a bit more reliable than an AGENTS.md instruction is a good thing, too. Using labels for state management is not going to be flawless. It’s easy to misfire a webhook that then blocks the whole process and maybe it needs something more robust than a retry in the gateway, but the workflow or pipeline control you get with something so simple as a GitHub label as the trigger system is much easier for a normal person to understand. Plus, in a more serious development environment you can’t just rely on the output as the proof that the system itself is functioning properly, so, even if the pipeline with the webhooks seems fragile, getting event-level data that you can track is a much more robust architecture than the “wait-for-magic” mode some agent loop systems offer.
The centralized orchestration ideas in many agent loop systems has merit, but I can’t see how that scales as well as scoping and managing process on a per-issue basis. If the issue documentation is clear about scope, acceptance criteria, test cases, etc., and if comments are clear about what each agent has done, then all subsequent engagements with an issue can just act on an as-needed basis using the shared context in the issue, triggering each other via the labels.
I haven’t run any evaluations, but it won’t be hard to add a tool that collects enough data to analyze things like task completion rate, trajectory accuracy, step success rate, etc. That won’t help me understand failures of reasoning or poor choices, but it will help me see where those issues may or may not be occurring. The data will be in there. I just need to build the tooling that gets it and makes it usable. Then I can run some of the agent process evals suggested by Cameron Wolfe.
The Zero Trust requirements can be managed well this way, too.
Every action is a GitHub event (comment, commit, label change) timestamped, attributed to a specific GitHub App, and permanently stored, so you get a rich audit trail for free.
Permission granting can be very granular and strictly controlled if each agent is a separate GitHub App with scoped tokens. The reviewer agent can be physically blocked from writing any code. Similarly, a builder agent can be physically blocked from merging a PR to main. It literally cannot call the merge endpoint. And by working in branches their work is safely sandboxed by default.
Then subsequent agents in the workflow can check for alignment with the goals of the issue. They can identify problematic out-of-scope changes without requiring the context used by the builder agent. So, as long as they operate on the metadata, they are removed from any prompt injection other agents may have been exposed to which makes them a much safer agent-as-judge in the workflow.
For now, consider this a thought experiment. The pieces are all working for the most part, and the early tests are promising. My motivation for sharing it now is to see if anyone is tackling the same problem the same way or if there are lessons already learned that would save me some time.