bim

BUVIS InfoMesh — full-featured Zettelkasten manager with query engine, templates, Jira sync, and a web dashboard.

Extra: uv tool install buvis-gems[bim]

Configuration

Setting

Default

Description

path_zettelkasten

~/bim/zettelkasten/

Root directory for zettels

path_archive

~/bim/reference/40-archives/

Archive directory

Env vars: BUVIS_BIM_PATH_ZETTELKASTEN, BUVIS_BIM_PATH_ARCHIVE.

Commands

bim create

Create a new zettel from a template.

# interactive (prompts for template type, title, tags)
bim create

# specify type and title directly
bim create -t project --title "Redesign homepage" --tags "web,design"

# list available templates
bim create -l

# pre-fill template answers
bim create -t meeting -a "attendees=Alice,Bob" -a "date=2025-01-15"

Options:

  • -t, --type TEXT — template type (note, project, etc.)

  • --title TEXT — zettel title

  • --tags TEXT — comma-separated tags

  • -a, --answer TEXT — template answer as key=value (repeatable)

  • -l, --list — list available templates

bim query

Query zettels with a YAML filter/sort/output spec.

# inline query: first 5 zettels
bim query -q '{output: {limit: 5}}'

# filter by type, pick columns
bim query -q '{
  columns: [{field: title}, {field: tags}],
  filter: {field: type, op: eq, value: project},
  output: {format: table}
}'

# load saved query from file
bim query -f my-query

# list saved queries
bim query -l

# pick result with fzf, open in nvim
bim query -q '{filter: {field: type, op: eq, value: note}}' -e

# interactive TUI
bim query -q '{output: {limit: 20}}' --tui

Options:

  • -f, --file TEXT — query name or path to YAML spec

  • -q, --query TEXT — inline YAML query string

  • -e, --edit — pick result with fzf and open in nvim

  • --tui — render output in interactive TUI

  • -l, --list — list available queries

Output formats: table, csv, markdown, json, jsonl, html, pdf, kanban.

See bim-query-examples.md for a comprehensive reference with filter operators, calculated columns, lookups, and more.

bim import

Import a markdown file into the zettelkasten.

bim import ~/Downloads/meeting-notes.md
bim import ~/Downloads/draft.md --tags "imported,review" --force --remove-original

Options:

  • --tags TEXT — comma-separated tags

  • --force — overwrite if target exists

  • --remove-original — delete source file after import

When importing interactively (no flags), if the note has no tags and ollama_model is configured globally (see Configuration), bim suggests tags via ollama. Each suggested tag is presented for confirmation. If ollama is unreachable, tag suggestion is skipped with a warning.

bim edit

Modify zettel metadata in-place.

bim edit ~/bim/zettelkasten/my-note.md --title "Better title"
bim edit ~/bim/zettelkasten/my-note.md --tags "updated,important"
bim edit ~/bim/zettelkasten/my-note.md --processed
bim edit ~/bim/zettelkasten/my-note.md -s "priority=high" -s "reviewer=alice"

Options:

  • --title TEXT — new title

  • --tags TEXT — comma-separated tags

  • --type TEXT — note type

  • --processed / --no-processed — processed flag

  • --publish / --no-publish — publish flag

  • -s, --set TEXT — arbitrary key=value metadata (repeatable)

bim format

Format a note’s metadata and content.

bim format ~/bim/zettelkasten/my-note.md
bim format ~/bim/zettelkasten/my-note.md -d    # show diff
bim format ~/bim/zettelkasten/my-note.md -h    # highlight output
bim format ~/bim/zettelkasten/my-note.md -o formatted.md

Options:

  • -h, --highlight — highlight formatted content

  • -d, --diff — show side-by-side diff if content changed

  • -o, --output FILE — write to file instead of in-place

bim show

Pretty-print a zettel.

bim show ~/bim/zettelkasten/my-note.md

bim archive

Mark zettel(s) as processed and move to archive directory.

bim archive ~/bim/zettelkasten/done-note.md
bim archive ~/bim/zettelkasten/a.md ~/bim/zettelkasten/b.md
bim archive --undo ~/bim/reference/40-archives/done-note.md

Options:

  • --undo — unarchive (move back to zettelkasten)

bim delete

Permanently delete zettel(s).

bim delete ~/bim/zettelkasten/obsolete.md
bim delete --force ~/bim/zettelkasten/a.md ~/bim/zettelkasten/b.md

Options:

  • --force — skip confirmation prompt

bim sync

Synchronize a note with an external system (currently Jira).

bim sync ~/bim/zettelkasten/project-note.md jira

Arguments: PATH_TO_NOTE, TARGET_SYSTEM.

bim serve

Start the web dashboard (SvelteKit frontend).

bim serve
bim serve -p 3000 -H 0.0.0.0
bim serve --no-browser

Options:

  • -p, --port INTEGER — port (default: 8000)

  • -H, --host TEXT — host (default: 127.0.0.1)

  • --no-browser — don’t auto-open browser

bim doc

Document ingestion + triage workflow. Files PDFs into a canonical filesystem layout and indexes each one with a Zettelkasten note, OCR’d and structured.

Extra: uv tool install buvis-gems[doc]

System dependencies (install separately):

  • Tesseract (with Czech language pack): brew install tesseract tesseract-lang

  • OCRmyPDF: brew install ocrmypdf

  • Ollama: brew install ollama, then ollama pull qwen3:30b-a3b (and ollama pull qwen3:14b for the fallback)

Configuration lives under [doc] in the bim config (see DocSettings for the full schema): paths.business_root, paths.vault_root, paths.state_dir, paths.issuers_file, plus ocr, classifier, and zettel blocks.

bim doc ingest

Run the ingest pipeline against a single staged PDF. The eight steps - dedup, OCR, classify, extract, name, write zettel, file, record - run with mocked-friendly boundaries so dry-run-like behaviour is easy to test.

bim doc ingest ~/Downloads/invoice.pdf
bim doc ingest ~/Downloads/invoice.pdf --source email
bim doc ingest ~/cez-as/inbox/x.pdf --source issuer-inbox --issuer cez-as

Arguments: PDF_PATH (must exist).

Options:

  • --source — where the document entered the system. One of email, scan, download, issuer-inbox, backfill-canonical, backfill-noncanonical. Default: download.

  • --issuer — pre-pin an issuer slug. Honoured when --source issuer-inbox.

  • --strict — exit 1 on pipeline failure (for scripting). Default exit code is 0 even on success=False, matching the rest of the bim CLI. Triaged and duplicate outcomes are not failures and remain exit 0 regardless of this flag.

Outcomes (printed to console and recorded in state.db):

  • filed — PDF moved to <business_root>/<issuer-slug>/<canonical>.pdf and zettel written to <vault_root>/Zettelkasten/documents/<issuer-slug>/<canonical>.md (per-issuer subfolder; the vault layout mirrors the business root).

  • triaged — confidence too low or required field missing. The PDF lands in <business_root>/_triage/ with a .proposed.yml sidecar awaiting human review.

  • duplicate — sha256 already mapped to a filed document. A .duplicate.yml sidecar is written next to the staged input.

Zettel v1 shape

Both bim doc ingest and bim doc promote produce zettels in the v1 shape: kebab-case keys throughout, single issuer field (the human display name), ISO-8601 ingested-at datetime with offset, the source-file link embedded in the file-path frontmatter key as a double-quoted Markdown link, optional LLM-generated summary paragraph, and per-issuer vault subfolder.

---
id: 20210311083422
title: ČEZ a.s. invoice 7102105594
type: document
doc-type: invoice
issuer: ČEZ a.s.
doc-number: 7102105594
doc-date: 2021-03-11
doc-amount: 4218.0
doc-currency: CZK
doc-language: cs
ingested-at: 2026-05-04 14:30:15+02:00
ingest-source: email
file-path: "[Open file](file:///Users/bob/Library/Mobile%20Documents/com~apple~CloudDocs/Business/cez-as/20210311083422-cez-as-7102105594.invoice.pdf)"
file-sha256: 3f4a8c2b91e7d5a6b1c2d3e4f5061728394a5b6c7d8e9f0a1b2c3d4e5f607182
ocr-engine: tesseract
ocr-mean-confidence: 0.91
extraction-method: rule:cez-invoice-2024-template:v1
tags:
  - document/invoice
  - issuer/cez-as
  - year/2021
---

# ČEZ a.s. invoice 7102105594

Vyúčtování za elektřinu za období 1.1.2021 – 28.2.2021. Splatnost 25.3.2021. Variabilní symbol 7102105594.

## OCR text

> [!quote]- Full text
> <full OCR text>

Reserved frontmatter keys:

  • id — bare 14-digit integer (Zettelkasten timestamp); never quoted.

  • title — single-line human-readable label, equals the body H1.

  • type — always document for ingested zettels.

  • doc-number — emitted as a bare integer when the string round-trips (str(int(s)) == s); otherwise quoted to preserve leading zeros.

  • ingested-at — tz-aware ISO 8601 with offset, parses with datetime.fromisoformat.

  • file-path — a double-quoted Markdown link with text Open file wrapping a URL-encoded file:// URL: "[Open file](file://<URL-encoded-absolute-path>)". The URL encoder preserves slashes and tildes (urllib.parse.quote(path, safe="/~")); spaces become %20. Obsidian renders the value as a clickable link in the Properties pane. The body carries no [Open PDF] or [Open file] line — the link is metadata, not prose.

bim doc promote

Promote an approved triage proposal into a filed document. Re-derives OCR from the staged PDF and ignores user-edited OCR text in the proposal.

bim doc promote ~/Business/_triage/x.invoice.pdf.proposed.yml

Arguments: YML_PATH — path to a <basename>.pdf.proposed.yml file whose sibling <basename>.pdf exists.

Options:

  • --strict — exit 1 on promote failure (for scripting). Default exit code is 0 on failure to match the rest of the bim CLI.

The proposal must have approved: true and a slug present in the issuer registry (or register_issuer: true to add a new issuer entry under flock).

Retry behaviour

The classifier and extractor stages retry transient HTTP failures up to classifier.max_retries (default 2) times against classifier.primary_model, then fall back once to classifier.fallback_model. Semantic failures (missing required fields, uncoercible values, unparseable model output) and requests.exceptions.Timeout short-circuit to triage immediately without retry or fallback - retrying with the same input won’t help on a model-output problem.

Issuer registry

The registry lives at ~/.dotfiles/bim/issuers.yml (configurable via paths.issuers_file). Top-level keys: version, doc_types, reserved_slugs, issuers. Each issuer maps a canonical kebab-case slug to display_name and a list of aliases the classifier uses to canonicalise vendor names from OCR text.

The file is treated as plaintext by all bim processes; encryption (e.g. via git-secret) happens at the dotfiles management layer.

Originals retention

Re-OCR keeps the pre-modification copy under <state_dir>/originals/<timestamp>-<sha256>.pdf for originals_retention_days (default 30). A garbage-collection command (bim doc gc-originals) is out of scope for v1; clean these manually if needed.

Rule engine

The pipeline runs a deterministic, declarative rule engine before the LLM classifier and extractor. For documents whose templates are stable (recurring vendor invoices, statements with fixed layouts), rules eliminate LLM calls entirely, making extraction reproducible and auditable.

When no rule matches, behavior is unchanged from LLM-only ingestion.

Why rules exist:

  • Determinism. A rule for CEZ invoices either matches or doesn’t. No probabilistic drift across model versions or sampling.

  • Auditability. A zettel’s extraction-method: rule:cez-invoice-2024-template:v1 records exactly which rule produced its metadata.

  • Cost. No round-trip to Ollama for documents a regex can pin.

Rule schema (under each issuer in ``issuers.yml``):

issuers:
  cez-as:
    display_name: ČEZ a.s.
    aliases: [ČEZ, cez.cz]

    rules:
      - id: cez-invoice-2024-template
        version: 1
        priority: 100
        partial: false
        match:
          ocr_contains: ["IČ: 45274649", "Faktura"]
          ocr_matches: ["Faktura č\\.\\s*(\\d{10})"]
        extract:
          doc_type: invoice
          doc_number:
            from: ocr_match
            pattern: "Faktura č\\.\\s*(\\d{10})"
            group: 1
          doc_date:
            from: ocr_match
            pattern: "Datum vystavení:\\s*(\\d{2}\\.\\d{2}\\.\\d{4})"
            group: 1
            format: "%d.%m.%Y"
            transform: parse_date
          doc_amount:
            from: ocr_match
            pattern: "Celkem k úhradě:\\s*([\\d\\s]+),\\d{2}\\s*Kč"
            group: 1
            transform: strip_whitespace_to_int
          doc_currency: CZK
          doc_language: cs

      - id: cez-fingerprint
        partial: true
        match:
          ocr_contains: ["IČ: 45274649"]
        extract:
          issuer_slug: cez-as
          issuer_display: ČEZ a.s.
          doc_language: cs

A rule with partial: true pins some fields and lets the LLM fill the rest (typical use: fingerprint by IČO, let the LLM resolve the doc_type and specific fields).

Match clauses (v1 set):

Clause

Behavior

ocr_contains

Substring(s) appear in OCR text. Case-folded + ASCII-folded.

ocr_matches

Regex(es) match OCR text via re.search.

email_from_domain

Sender domain matches .email.yml sidecar’s from.

email_subject_contains

Substring(s) appear in email subject.

email_subject_matches

Regex match against email subject.

original_filename_matches

Regex match against the source file’s original name.

All clauses within a rule are ANDed. Source-irrelevant clauses (e.g. email_* on a scan) are silently false.

Transforms (v1 set):

strip_whitespace_to_int, strip_whitespace_to_decimal, parse_date (uses format), lowercase, uppercase, strip, slugify.

Precedence:

  1. Full rules (partial: false) beat partial rules.

  2. Among same partial-ness, higher priority wins.

  3. Ties broken by definition order in issuers.yml.

Conflict (two rules of same partial-ness and same priority pinning the same field to different values, on any pinned field) sends the document to triage with a rule_conflict: <id1> vs <id2> reason.

bim doc rules subcommands:

bim doc rules list

Print all rules with id, issuer, version, partial, priority, enabled.

bim doc rules validate

Static validation of issuers.yml rule blocks. Catches duplicate rule ids, uncompilable regexes, unknown transforms, reserved-field assignments. Run this after editing rules.

bim doc rules test <rule-id> --pdf <path>

Run one rule against one PDF. Prints clause-by-clause pass/fail and extracted fields. Read-only — no zettel, no file move.

bim doc rules backtest [--rule ID] [--issuer SLUG]

Walk business_root and report per-rule match counts grouped by issuer folder. Read-only. Slow on large archives (OCRs on demand). Run this before deploying any new rule — false positives that file documents under the wrong issuer with confident metadata are the most dangerous failure mode.

Authoring workflow: write rule → rules validaterules test on a sample → rules backtest to verify no cross-folder hits → deploy.

bim doc audit

Read-only walk of the Business folder; reports drift between filed PDFs and their corresponding zettels. Never moves, deletes, or rewrites any file.

bim doc audit

No flags, no positional arguments. Output is a human-readable summary on stdout plus a structured JSON report at <state_dir>/audit/<iso-timestamp>.json.

What audit checks for each PDF:

Check

Pass condition

Filename canonical

Matches <14digits>-<slug>-<title>.<doctype>.<ext>

Issuer registered

Folder name is a key in issuers.yml.issuers

Doc type valid

Suffix is in issuers.yml.doc_types

Zettel exists

<vault>/<doc-subdir>/<issuer-slug>/<basename>.md exists

OCR present

PDF has a text layer (audit uses pdfminer; mean confidence is not computed in v1, so the “low OCR confidence” check fires only when a confidence reader is plugged in)

sha256 in state.db

Document is tracked in the doc subsystem’s processed table

What audit checks for the rule engine:

Check

Pass condition

Rule file syntax

issuers.yml parses and validates against the schema

Rule id uniqueness

No two rules share an id (across all issuers)

Regex compiles

All pattern: values compile

No conflicts

No two enabled rules with same priority whose match clauses can both apply to the same document. Static-overlap heuristic: rules whose email_from_domain lists are disjoint cannot overlap; all other clause types (substrings, regex, filename regex) are conservatively treated as potentially overlapping because regex/substring disjointness is undecidable in general. When both rules pin the same extract field to statically-different constant values, that disagreement is included in the finding detail to help authors locate the source of the conflict.

Rule freshness

Each enabled rule has matched at least one document in the last 90 days (warning only — never fails the audit)

Reports surfaced in stdout but treated as informational:

  • Per-issuer inbox/ directories with unprocessed PDFs.

  • _triage/ directory awaiting review.

JSON report contract. Top-level fields:

  • walked_pdf_count — every PDF the walker yielded.

  • clean_pdf_count — PDFs with no findings and no legacy zettel.

  • non_clean_pdf_count — PDFs with one or more findings, a legacy zettel, or both. The pair clean_pdf_count / non_clean_pdf_count is a true partition: clean + non_clean == walked. Consumers cannot derive non_clean from len(pdf_findings) + len(legacy_layout_zettels) — one PDF can contribute multiple findings and a legacy entry, so that arithmetic double-counts.

  • ocr_confidence_assessable_count — PDFs for which the OCR-quality reader actually returned a numeric mean confidence (has_text and confidence is not None). Zero means the reader cannot assess confidence at all for this run (the production pdfminer-based reader is one such reader); the stdout reporter uses this signal to replace the misleading 0 low OCR confidence row with an explicit low OCR confidence: not assessed notice.

  • pdf_findings — one entry per finding (not per PDF). A PDF with N findings produces N entries that share the same pdf_path. Each entry carries code (one of the PdfFindingCode literals, including the ocr_check_failed / hash_check_failed adapter failures), issuer_slug, doc_type, and an optional detail.

  • legacy_layout_zettels — absolute paths of zettels found at the v0 flat path <vault>/<doc-subdir>/<basename>.md rather than the v1 per-issuer path <vault>/<doc-subdir>/<issuer-slug>/<basename>.md. This array is the input for a future legacy-zettel migration command.

  • rule_findings — registry-loadability errors, priority conflicts, and stale-rule warnings.

  • issuer_inboxes — per-issuer inbox/ directories with unprocessed PDFs.

  • triage_pending — count of .proposed.yml files in _triage/.

  • n_issuers_walked — distinct folder slugs the walker entered (top-level PDFs contribute the empty slug).

  • total_rules_in_registry / total_issuers_in_registry — registry totals.

  • generated_at — ISO-8601 timestamp.

A PDF whose zettel is at the legacy flat path is reported in legacy_layout_zettels rather than as missing_zettel and counts toward non_clean_pdf_count.