Skills¶
Turn local documentation folders into a portable, queryable skill bundle, then
expose it to any agent as a tool or a router→executor pipeline.
lazytools.skills indexes your docs with full BM25 and returns grounded context
briefs — accurate retrieval for technical docs, with zero dependencies beyond
the standard library.
Status & install
Status: alpha. No extra needed:
The package islazytoolkit (PyPI); the import root is lazytools. Add
generated_skills/ to your .gitignore — build_skill() writes there by
default.
Synopsis¶
A skill is a folder on disk that captures everything an agent needs to answer
from your documentation: the chunked text, the BM25 weights, and an LLM
instruction file. You build it once (build_skill), then either query it directly
(query_skill), hand it to an agent as a retrieval tool (skill_tools), let an
agent build skills (skill_builder_tools), or wrap it in a two-step
router→executor pipeline (skill_pipeline).
Retrieval uses full BM25 (Robertson IDF, k1=1.5, b=0.75) with IDF weights
computed once at index time — substantially more accurate than TF-only matching
for technical docs, where rare terms (class names, parameter names, error codes)
matter most. Chunking is heading-aware for Markdown/RST/AsciiDoc so each
section becomes its own semantically-bounded chunk.
How it works¶
build_skill(source_dirs, name) query_skill(skill_dir, task, mode)
│ iter docs (DEFAULT_EXTENSIONS) │ tokenize task
│ heading-aware chunks (md/rst/adoc), │ BM25 score every chunk vs query
│ char-split fallback elsewhere │ take top_k (score > 0)
│ compute avgdl + Robertson IDF │ resolve mode (auto detects intent)
└─ write bundle ↓ └─ render grounded brief + evidence
SKILL.md (LLM instructions)
manifest.json (name, description, indexed_files, avgdl, bundle version 3.0.0)
vocab.json (IDF weights)
chunks.jsonl (one DocChunk per line: path, title, heading, text, …)
- Portable bundle. The four files are self-contained — copy the folder anywhere; querying needs only the standard library.
- Heading-aware chunking. For
.md/.mdx/.rst/.adoc, each section is its own chunk (large sections sub-split by char count, tiny adjacent ones merged). Other file types use character splitting with paragraph-boundary snapping. - Query modes.
autodetects intent from the task wording; or forceanswer/extract/locate/summarize. Each shapes the returned brief (e.g.locatelists relevant files,extractreturns excerpts). - Grounded by construction. When nothing scores above zero, the brief says so and instructs the model not to answer beyond the indexed evidence.
Signature¶
from lazytools.skills import (
build_skill, query_skill,
skill_tools, skill_builder_tools, skill_pipeline,
)
build_skill(
source_dirs, # list[str] — folders to index
skill_name, # str — bundle folder name + title
output_root="./generated_skills",
description="", # str — used in SKILL.md + tool description
usage_notes="", # str — extra rules appended to SKILL.md
include_extensions=[".md", ".mdx", ".txt", ".rst", ".adoc",
".py", ".json", ".yaml", ".yml", ".toml"],
chunk_size=1800, # int — max chars per chunk
chunk_overlap=180, # int — overlap between char-mode chunks
copy_sources=False, # bool — copy originals into bundle/sources/
overwrite=True, # bool — replace an existing bundle of the same name
max_chars_per_file=200_000, # int — safety cap on chars read per file
) -> dict # {skill_dir, skill_name, description, indexed_files, total_chunks, avgdl}
query_skill(
skill_dir, # str — path to a bundle
task, # str — question / task
mode="auto", # "auto" | "answer" | "extract" | "locate" | "summarize"
top_k=8, # int — chunks to retrieve
max_chars=10_000, # int — cap on the returned brief
include_quotes=True, # bool — append full excerpts
) -> str
skill_tools(*, skill_dir, name=None, description=None, strict=False) -> list[Tool]
skill_builder_tools(*, name="build_doc_skill", description=..., strict=False) -> list[Tool]
skill_pipeline(*, skill_dir, provider="anthropic", router_model=None,
executor_model=None, session=None, native_tools=None) -> Tool
build_skill parameters¶
| Parameter | Type | Default | Meaning |
|---|---|---|---|
source_dirs |
list[str] |
— | Folders to index. Each must be a directory or FileNotFoundError is raised. |
skill_name |
str |
— | Bundle name; slugified into the on-disk folder name and used as the title. |
output_root |
str |
"./generated_skills" |
Parent directory for the bundle. |
description |
str |
"" |
What the skill covers; surfaces in SKILL.md and the tool description. Defaults to a count-based summary. |
usage_notes |
str |
"" |
Extra operational rules appended to SKILL.md. |
include_extensions |
list[str] |
DEFAULT_EXTENSIONS |
File types to index (Markdown, RST, AsciiDoc, text, plus .py/.json/.yaml/.yml/.toml). |
chunk_size |
int |
1800 |
Max characters per chunk. |
chunk_overlap |
int |
180 |
Overlap between character-mode chunks. |
copy_sources |
bool |
False |
Copy original docs into bundle/sources/. |
overwrite |
bool |
True |
Replace an existing bundle of the same name (else FileExistsError). |
max_chars_per_file |
int |
200_000 |
Safety cap on characters read per file. |
query_skill parameters¶
| Parameter | Type | Default | Meaning |
|---|---|---|---|
skill_dir |
str |
— | Path to a bundle from build_skill. |
task |
str |
— | The question or task to answer from the indexed docs. |
mode |
Literal |
"auto" |
auto detects intent; or answer / extract / locate / summarize. |
top_k |
int |
8 |
Number of chunks to retrieve. |
max_chars |
int |
10_000 |
Cap on the returned context brief. |
include_quotes |
bool |
True |
Append full excerpts after the evidence bullets (answer mode). |
The five functions¶
| Function | Returns | Use it to… |
|---|---|---|
build_skill(...) |
dict metadata |
Index folders into a bundle on disk. |
query_skill(...) |
str brief |
Retrieve grounded context for a task (no LLM call — pure retrieval). |
skill_tools(skill_dir=…) |
list[Tool] |
Hand one retrieval tool to an agent (answers only from the skill). |
skill_builder_tools() |
list[Tool] |
Let an agent build skill bundles on demand. |
skill_pipeline(skill_dir=…) |
Tool |
A router (sharpens the query, preserving identifiers) → executor (calls the skill, synthesises a grounded answer) chain, exposed as one tool. |
When to use it¶
- Ground an agent in your own docs without a vector DB or embeddings service — stdlib BM25 is enough and ships nothing extra.
- Technical retrieval where exact identifiers matter (class/method/parameter names, error codes); BM25's IDF weighting favours those rare terms.
- Portable, reproducible skills you can commit-build, copy between machines, or ship inside a container.
- Agent-built skills: expose
skill_builder_tools()so an agent can turn a folder into a queryable skill mid-task.
When NOT to use it¶
- One-off folder reads. If you just need "read these files now," use Documents — no index step.
- Semantic / fuzzy paraphrase matching at scale. BM25 is lexical; for heavy synonym/semantic recall over very large corpora, an embeddings retriever may do better.
- Constantly-changing corpora. The bundle is a point-in-time index — rebuild when the docs change.
Example¶
from lazytools.skills import build_skill, skill_pipeline
meta = build_skill(["./docs"], "my-project")
pipe = skill_pipeline(skill_dir=meta["skill_dir"], provider="anthropic")
# The router sharpens the query (keeping identifiers); the executor calls the
# skill tool and answers only from the retrieved evidence, naming its sources.
agent = Agent("claude-opus-4-8", tools=[pipe])
query_skill(d, "where is retry configured?", mode="locate") # → relevant files
query_skill(d, "show the backoff settings", mode="extract") # → excerpts
query_skill(d, "overview of the auth flow", mode="summarize") # → condensed bullets
query_skill(d, "what does k1 control?", mode="answer") # → evidence + quotes
Bundle format¶
A bundle is a folder named after the slugified skill_name, containing:
| File | Contents |
|---|---|
SKILL.md |
LLM instructions — what the skill covers and how to use it. |
manifest.json |
Metadata: name, description, source_dirs, indexed_files, total_chunks, avgdl, extensions, and bundle-format version (3.0.0). |
vocab.json |
Robertson IDF weights, precomputed at index time. |
chunks.jsonl |
One DocChunk per line: path, title, heading, text, token data, doc_len, ordinal. |
sources/ |
(Optional, when copy_sources=True) copies of the original docs. |
The bundle-format version is independent of the lazytoolkit package version.
Troubleshooting¶
| Symptom | Cause | Fix |
|---|---|---|
FileNotFoundError: Not a directory |
A source_dirs entry isn't a folder |
Pass existing directories only |
ValueError: No indexable documentation found |
No files matched include_extensions |
Widen include_extensions or check the paths |
FileExistsError: Skill already exists |
Bundle exists and overwrite=False |
Set overwrite=True or pick a new skill_name |
FileNotFoundError: Skill directory not found |
query_skill given a bad skill_dir |
Use the skill_dir from build_skill's return dict |
"No relevant documentation was retrieved…" |
Nothing scored above zero for the task | Rephrase with the exact identifiers, raise top_k, or re-index more docs |
Pitfalls¶
- Rebuild on change. The index is a snapshot; stale docs mean stale answers.
overwrite=Trueis the default — building over an existing name replaces it. Setoverwrite=Falseto guard against accidental clobbering.- BM25 is lexical. Use the terms that appear in the docs; pure paraphrase may
under-retrieve. The
skill_pipelinerouter exists precisely to sharpen vague queries into identifier-preserving ones. - Per-file char cap.
max_chars_per_file(200k) truncates very large files before chunking — raise it if you index big single documents. - Add
generated_skills/to.gitignore— the defaultoutput_root.
See also¶
- Documents — read a folder once without building an index.
- Tools overview — every connector at a glance.
- Tool and
Agent.chain — what
skill_toolsandskill_pipelinebuild on.