Concepts — Research Agora

The Evolution Stack: Chat to Skills

AI tools form a complexity ladder. Each level builds on the previous one.

Level	What changes	Example
Chat	You talk to an LLM	ChatGPT, Claude.ai — Q&A, drafting, brainstorming
Context	The LLM knows your project	`CLAUDE.md` files, uploaded documents, conversation history
Tools	The LLM can act	Read files, run code, search the web, call APIs
Agents	The LLM pursues multi-step goals	Autonomous research pipelines, iterative debugging
Skills	Reusable compound workflows	`/paper-review`, `/paper-references` — packaged expertise

Most researchers use Chat. The jump to Tools is where AI becomes genuinely useful for research tasks. Skills are the top of the stack — they encode the workflow once and let you (and your collaborators) invoke it repeatedly without rethinking the prompt structure each time.

Key Concepts

LLM (Large Language Model)

Stateless next-token predictor. No memory, tools, or goals unless scaffolded by an agent framework. Everything it appears to "know" is a learned statistical pattern over training data.

Agent

Unlike a chatbot (stateless Q&A), an agent uses tools, maintains context across steps, and pursues multi-step goals. It reads files, runs code, searches the web, and decides what to do next. Stateless between sessions — it retains nothing from prior conversations unless you re-provide the context.

Tool use

Agents are not limited to generating text. They can:

Capability	Example
Read and write files	Edit your LaTeX source, update a BibTeX file
Execute code	Run a Python script and interpret the output
Search the web	Look up a paper on Semantic Scholar
Call APIs	Query a database, interact with GitHub
Chain actions	Find a bug, fix it, run tests, commit — all in one go

Context window

The buffer of text the model can "see" at once. Everything the agent knows about your project in a given session lives here: your instructions, files it has read, conversation history, tool outputs. When the context window fills, older content gets dropped or summarized. Bigger context = more expensive per token.

Tokens

The units LLMs process text in. Roughly: 1 token ≈ 0.75 words, or ~4 characters in English. A 10-page paper is approximately 3,000–5,000 tokens. Token count determines API cost and context window usage.

Skills (Custom Commands)

Reusable prompt templates invoked with a slash command. A skill encodes a complete workflow — role, objective, process, output format — so you get consistent, structured output every time without rewriting the prompt.

Skill	What it does
`/commit`	Reviews changes, writes a conventional commit message, commits
`/paper-review`	Reads a paper and generates structured review feedback
`/literature-synthesizer`	Searches for related work given a research question
`/paper-references`	Checks that all citations in a BibTeX file resolve to real papers

Context engineering

The practice of systematically providing agents with the right information to do their job well. Think of it as writing documentation for your AI collaborator. The better the "working with me" manual — explicit instructions, project knowledge, acceptance criteria — the better every interaction. A CLAUDE.md file placed in your project root is the primary mechanism.

What to Delegate, What to Protect

Not everything should go to an AI agent. The value of delegation depends on whether the task has verifiable outputs and whether the judgment required is yours alone.

Delegate	Protect
Code boilerplate and scaffolding	Algorithm design and architectural choices
BibTeX formatting and citation verification	Deciding which papers are actually relevant
Generating figure code from described data	Interpreting what the results mean
Drafting rebuttal structure	Deciding which reviewer concerns are legitimate
Literature search and summary	Assessing novelty and framing the contribution
Grammar, style, and consistency passes	Scientific claims and argument structure
Repetitive data cleaning scripts	Deciding what anomalies mean
Camera-ready formatting passes	Judging whether a result is publishable

The pattern: delegate generation, protect judgment. AI expands what you can do; it does not replace what you are. If you delegate problem selection, you atrophy the judgment that makes good problem selection possible.

The verification asymmetry: Survey data from 38 ML researchers found that 74% verify AI outputs for coding tasks, but only 11% verify for ideation tasks. This gap is explained entirely by tooling: citation checkers and test suites exist; ideation oracles don't. Tools enable verification. Where tools don't exist yet, the burden falls on your judgment — which means that domain is riskier to delegate, not safer.

Where Research Agora Fits

Research Agora lives at the Skills level of the stack. It provides:

A skills marketplace — a repository of community-contributed, versioned prompt workflows covering the full research lifecycle (writing, verification, review, dissemination, code quality).
Verification workflows — skills that run formal checks: citation resolution against scholarly databases, code-paper consistency checking, statistical claim validation. These turn the "delegate freely, verify with tools" pattern into a one-command workflow.
A benchmark agenda — standardized evaluation protocols so researchers can compare skill performance objectively rather than relying on vibes. (Currently proposed; working examples shipped for pillars 1 and 2.)

The key insight: each pillar alone is insufficient. A skills marketplace without verification lets bad outputs circulate. Verification without discovery means researchers rebuild the same checks from scratch. Benchmarks without skills have nothing to evaluate. The Agora requires all three.

For the full argument, see the position paper.