Knowledge
Knowledge Module: Step-by-Step Tutorial
Build retrieval-ready knowledge from local markdown, HTTP docs, and MCP sources, then ask with references.
Step 1: Ingest local markdown (first run)
mosaic --project-state knowledge ingest \
--source local_md \
--path docs \
--namespace knowledge
Expected: ingest summary shows scanned and indexed chunk counts.
Step 2: Enable incremental mode for large corpora
mosaic --project-state knowledge ingest \
--source local_md \
--path docs \
--namespace knowledge \
--incremental \
--stale-after-hours 24 \
--max-files 5000 \
--max-file-size 524288
Use incremental mode for daily updates; it reuses unchanged chunks and removes stale entries.
Step 3: Add HTTP source docs
mosaic --project-state knowledge ingest \
--source http \
--url https://example.com/guide.md \
--namespace knowledge
mosaic --project-state knowledge ingest \
--source http \
--url-file .mosaic/http-knowledge-urls.txt \
--header-env "Authorization=MOSAIC_DOC_TOKEN" \
--http-retries 3 \
--http-retry-backoff-ms 200 \
--continue-on-error \
--report-out .mosaic/reports/knowledge-http-ingest.json \
--namespace knowledge
Step 4: Add MCP source docs
mosaic --project-state knowledge ingest \
--source mcp \
--mcp-server local-mcp \
--mcp-path docs \
--namespace knowledge
MCP ingestion uses configured server cwd and optional subpath.
Step 5: Validate retrieval quality by search
mosaic --project-state knowledge search "gateway retry" --namespace knowledge --limit 20 --min-score 4
mosaic --project-state --json knowledge search "sandbox policy" --namespace knowledge --limit 10 --min-score 6
Step 6: Ask with retrieval augmentation
mosaic --project-state knowledge ask "How does retry policy work?" --namespace knowledge --top-k 8 --min-score 6
mosaic --project-state --json knowledge ask "What is the sandbox default?" --namespace knowledge --top-k 6 --min-score 6
mosaic --project-state --json knowledge ask "What is the sandbox default?" --namespace knowledge --top-k 6 --min-score 6 --references-only
knowledge ask augments the prompt using top-k snippets, and --min-score removes weak matches before they reach the model.
Step 7: Evaluate retrieval quality in batch
mosaic --project-state --json knowledge evaluate \
--query "gateway retry policy" \
--query "sandbox default profile" \
--query-file .mosaic/knowledge-eval-queries.txt \
--namespace knowledge \
--top-k 8 \
--min-score 6 \
--report-out .mosaic/reports/knowledge-eval.json
knowledge evaluate is retrieval-only: it outputs per-query hit counts and aggregate quality metrics (coverage, avg/p50/p90 top score).
Step 8: Baseline and regression gating
mosaic --project-state --json knowledge evaluate \
--query "gateway retry policy" \
--namespace knowledge \
--history-window 20 \
--update-baseline
mosaic --project-state --json knowledge evaluate \
--query "gateway retry policy" \
--namespace knowledge \
--max-coverage-drop 0.05 \
--max-avg-top-score-drop 1.0 \
--fail-on-regression
By default the baseline is read from .mosaic/data/knowledge-eval-baselines/<namespace>.json; override with --baseline.
Each evaluate run appends trend samples to .mosaic/data/knowledge-eval-history/<namespace>.jsonl; the JSON output includes previous-run deltas and window summaries.
Step 9: Tuning for huge markdown datasets
mosaic --project-state knowledge ingest \
--source local_md \
--path docs \
--namespace product \
--incremental \
--max-chunk-bytes 4096 \
--chunk-overlap-bytes 384 \
--max-content-bytes 8192 \
--stale-after-hours 12 \
--retain-missing
Recommended pattern: split by namespace, run frequent incremental jobs, and tune chunk/index limits by corpus size.
Step 10: Storage and observability
- Project mode staging path:
.mosaic-knowledge/sources/<namespace>/<source>/ - XDG mode staging path:
<xdg-root>/knowledge/sources/<namespace>/<source>/ - Search index and runtime stats continue to use memory namespace storage.
knowledge ingest --report-outexports a unified ingest diagnostic report forlocal_md/http/mcp.
Step 11: Dataset lifecycle cleanup
mosaic --project-state --json knowledge datasets list
mosaic --project-state --json knowledge datasets list --namespace knowledge
mosaic --project-state --json knowledge datasets remove knowledge --dry-run
mosaic --project-state --json knowledge datasets remove knowledge
Use datasets list to audit staged sources + memory index + eval baseline/history. Use remove --dry-run before destructive cleanup.