Skip to content

Documents

Read local documents — .txt, .md, .pdf, .docx, .html/.htm — from a single file or a whole folder, and return their text in a shape ready for LLM consumption. lazytools.documents works as a plain Python function, as a sandboxed agent tool, or as a one-shot CLI.

Status & install

Status: alpha. Plain text and Markdown need no extra dependencies. For PDF / DOCX / HTML parsing:

pip install 'lazytoolkit[docs]'   # adds pypdf, python-docx, trafilatura
The package is lazytoolkit (PyPI); the import root is lazytools. Each format reader degrades gracefully — a missing optional dependency yields a "[… unavailable — pip install …]" placeholder per file rather than an exception, so a folder of mixed types still reads what it can.

Synopsis

read_folder_docs is one function that accepts either a single file or a folder. Given a folder it scans for matching extensions (optionally recursive), reads each file with the right per-format reader, and concatenates the results into a single human/LLM-readable string — or a structured JSON object. It is built for the agent setting: when exposed as a tool the path argument is LLM-controlled, so a base_dir sandbox, per-file size caps, a file-count cap, and symlink rejection are all first-class.

Two entry points:

  • read_folder_docs(...) — the function. Call it directly from trusted code.
  • read_docs_tools(base_dir=...) — returns a [Tool] with path sandboxed to base_dir. base_dir is required; passing nothing raises ValueError.

How it works

read_folder_docs(path, ...)
  ├─ resolve path  ── if base_dir set: reject anything escaping the sandbox
  ├─ file?  → read that one file (extensions filter ignored)
  ├─ dir?   → glob matching extensions (recursive optional), skip symlinks,
  │           sort, cap at max_files
  └─ per file → pick reader by suffix:
        .txt/.md → plain text
        .pdf     → pypdf            (page text joined)
        .docx    → python-docx      (paragraphs + tables as " | " rows)
        .html    → trafilatura      (parsed | full | both)
     skip files over max_file_bytes; record per-file errors inline
  → output_format "text" (headered string) or "json" (records + truncation meta)
  • Optional deps, graceful degradation. pypdf, python-docx, and trafilatura are imported lazily inside their readers; if missing, that file's content becomes a placeholder string and the scan continues.
  • HTML modes. parsed (default) extracts clean body text via trafilatura — stripping nav/ads/boilerplate; full returns raw HTML; both returns the parsed body then the raw source.
  • Robust scanning. A file that vanishes between glob and stat, or is unreadable, is recorded as an error entry rather than aborting the whole scan.

Signature

from lazytools.documents import read_folder_docs, read_docs_tools

read_folder_docs(
    path,                          # str — a single file OR a folder to scan
    extensions="txt,md,pdf,docx,html",  # str — comma-separated; folder mode only
    html_mode="parsed",            # "parsed" | "full" | "both"
    recursive=False,               # bool — recurse into subfolders (folder mode)
    output_format="text",          # "text" | "json"
    *,
    base_dir=None,                 # str | None — sandbox; reject paths escaping it
    max_file_bytes=10_000_000,     # int | None — per-file size ceiling (DEFAULT_MAX_FILE_BYTES)
    max_files=500,                 # int | None — files read per scan (DEFAULT_MAX_FILES)
) -> str


# As an agent tool — base_dir is REQUIRED.
read_docs_tools(
    *,
    base_dir,                      # str — sandbox directory (required; "" / None → ValueError)
    max_file_bytes=10_000_000,
    max_files=500,
) -> list[Tool]

read_folder_docs parameters

Parameter Type Default Meaning
path str A single file (read directly, extensions ignored) or a folder (scanned).
extensions str "txt,md,pdf,docx,html" Comma-separated extensions to include in folder mode. Selecting html also matches .htm.
html_mode str "parsed" parsed = clean body text; full = raw HTML; both = parsed then raw.
recursive bool False Recurse into subfolders (folder mode only).
output_format str "text" text = headered human/LLM string; json = structured object (see below).
base_dir str \| None None Sandbox. When set, any path resolving outside raises PermissionError. Required via read_docs_tools.
max_file_bytes int \| None 10_000_000 Files larger than this are reported as skipped, not read. None disables the cap.
max_files int \| None 500 Cap on files read per scan; extras are truncated and flagged. None disables the cap.

output_format="json" shape

{
  "records": [
    {"filename": "q4.pdf", "relative_path": "q4.pdf", "extension": "pdf",
     "size_bytes": 81234, "char_count": 5120, "content": "…"}
  ],
  "truncated": false,
  "max_files": 500,
  "total_found": 1
}

Parse with json.loads and index ["records"]. Two cases always return a plain (non-JSON) string even when output_format="json": a path that does not exist (FileNotFoundError text) and a folder with no matching files ("[No documents found …]"). Guard your json.loads accordingly (e.g. only parse output starting with "{").

When to use it

  • "Summarise these reports / this folder." Point an agent at a directory of PDFs, Word docs, and Markdown and let it read them in one tool call.
  • RAG-lite ingestion from a trusted local corpus, without standing up a vector store.
  • Mixed-format reading where you want one consistent text surface across .pdf, .docx, .html, and plain text.

When NOT to use it

  • Untrusted, un-sandboxed file access. Always go through read_docs_tools with a base_dir when an LLM controls path — never expose read_folder_docs directly to an agent.
  • Large-scale / persistent retrieval. For thousands of docs with ranked retrieval, build a Skill (BM25 index) instead of slurping a folder each call.
  • Binary or exotic formats (xlsx, pptx, images). Only the five listed suffixes are supported; extend the reader map yourself for more.

Example

from lazytools.documents import read_folder_docs

# Read every PDF and Word doc under /reports, recursively, as text.
text = read_folder_docs("/reports", extensions="pdf,docx", recursive=True)
print(text)
from lazybridge import Agent
from lazytools.documents import read_docs_tools

# base_dir is REQUIRED — the agent can only read inside /safe/docs.
tools = read_docs_tools(base_dir="/safe/docs")
agent = Agent("claude-opus-4-8", tools=tools)
agent("Summarise every PDF in the reports subfolder")
import json
from lazytools.documents import read_folder_docs

out = read_folder_docs("/reports", output_format="json")
if out.startswith("{"):                    # guard the not-found / empty cases
    data = json.loads(out)
    for rec in data["records"]:
        print(rec["filename"], rec["char_count"])
    if data["truncated"]:
        print(f"capped at {data['max_files']} of {data['total_found']} files")
python -m lazytools.documents /path/to/folder
python -m lazytools.documents /path/to/file.pdf
python -m lazytools.documents /path/to/folder --extensions pdf,docx --recursive
python -m lazytools.documents /path/to/folder --html-mode both --format json

Security & safety

  • base_dir sandbox. With base_dir set, a path that resolves outside it raises PermissionError. read_docs_tools requires it because the tool's path is LLM-controlled — without a sandbox an agent could read /etc/passwd, SSH keys, or .env files. Call read_folder_docs directly only from trusted, non-LLM code if you genuinely need un-sandboxed access.
  • Symlinks are not followed. Folder scans skip symlinked files, closing symlink-loop hangs and preventing a symlink from silently widening the read surface to other directories.
  • Size and count caps. max_file_bytes (default 10 MB) bounds per-file memory; max_files (default 500) bounds how much a single call can slurp. Note a small, heavily-compressed file (e.g. a PDF) can still expand on extract.

Troubleshooting

Symptom Cause Fix
"[PDF unavailable — pip install pypdf]" (and similar) Optional dep missing pip install 'lazytoolkit[docs]'
PermissionError: … escapes base_dir … path resolved outside the sandbox Pass a path inside base_dir, or widen base_dir for trusted code
ValueError: read_docs_tools(base_dir=...) is required Built the tool without a sandbox Pass base_dir="/safe/dir", or use read_folder_docs directly for trusted use
"[No documents found … matching extensions: …]" No files matched in folder mode Check extensions and recursive; remember html also matches .htm
"[Skipped: file is N bytes, exceeds max_file_bytes=…]" File over the size cap Raise max_file_bytes, or set it to None to disable
"[Error reading file: …]" in output Per-file read failure Inspect the message; the scan continues for other files

Pitfalls

  • extensions is ignored for single files. Point path at a file and it's read regardless of suffix (using the matching reader, or an "unsupported extension" note).
  • JSON output isn't always JSON. The not-found and empty-folder cases return plain strings — guard json.loads.
  • parsed HTML can drop content. trafilatura strips boilerplate aggressively; use full or both if you need the raw markup.
  • Compression bombs. The byte cap is a first line of defence, not a guarantee — a tiny PDF can expand massively on extract.

See also

  • Skills — for ranked retrieval over a corpus, build a BM25 skill bundle instead of reading a whole folder per call.
  • Safety — the sandbox/guard philosophy shared across LazyTools.
  • Tools overview — every connector at a glance.