Files
Gabriel Radureau 1d38f25c23 arcodange-email-ingest V8.1: filter calendar invites + newsletter senders
email-list.sh gains two hard-exclusion filters (applied before the
candidate test, regardless of attachments):

- EXCLUDE_PATTERN matches subjects starting with Invitation: / Updated
  invitation: / Canceled event: / Accepted: / Declined: / Tentative: /
  Maybe: (after stripping Re:/Fwd:/Tr: prefixes). Filters Google Calendar
  events that always carry an .ics attachment.
- EXCLUDE_SENDER matches updates.<domain>, noreply@*calendar, news@,
  newsletter@. Filters newsletter blast traffic.

Effect on --all-folders --candidates-only baseline: 27 noisy → 12
actionable (calendar invites + the staying-ahead.ai newsletter blast
removed). Real supplier docs intact: Darnis F1042 in /Notification, 3 Free
Mobile factures in /Inbox/abonnements, Mistral + Anthropic in /Inbox/books.

The originally-planned --mark-ingested feature is deferred to V8.2:
flag-set requires the Zoho OAuth scope ZohoMail.messages.UPDATE which our
read-only refresh_token doesn't have. Documented in SKILL.md: once the
user opts in to the wider scope, --mark-ingested becomes a one-line flag
on email-inspect.sh and is_candidate() learns to skip flag_info messages.

Captured the new --all-folders baseline at examples/email-list-all-folders.txt.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-31 15:18:31 +02:00

8.1 KiB

name: arcodange-email-ingest description: Scrape supplier-invoice emails from the Arcodange Zoho mailbox (gabrielradureau@arcodange.fr + its books@arcodange.fr alias + forwarded Gmail) via the Zoho Mail OAuth API, list candidates matching supplier patterns, download PDF attachments, run pdftotext + heuristic extract, and emit Dolibarr-ready supplier-invoice draft JSON for the operator to paste into the Dolibarr UI. Two workflows — (1) list candidates in a folder (default /Inbox/books where the alias auto-routes mail); (2) inspect one message by id, download + parse PDFs, propose draft entries. Surfaces concrete data: supplier name guess (first PDF line), invoice ref, invoice date, total HT/TVA/TTC, VAT rate. Read-only at every layer (Zoho scopes are READ-only; no write to Dolibarr). Use when the user asks "list pending supplier invoices in mail", "ingest invoices from email", "draft Dolibarr entry from this email", "audit cohort supplier docs from mail". Depends on dolibarr for the shared .env. SKIP for write-side Dolibarr operations (V9 candidate), for non-Zoho mailboxes (use IMAP fallback in a future skill if needed), and for attachments that aren't PDFs (only PDF text extraction is wired today). requires: bins: ["curl", "jq", "python3", "pdftotext"] auth: true

arcodange-email-ingest — supplier-invoice emails → Dolibarr draft

Close the inbound side of the accounting loop: bills land in books@arcodange.fr, this skill turns them into Dolibarr-ready draft entries for the operator to validate + create.

Depends on the dolibarr base skill (shared .env).

CLI shortcuts: bin/arcodange email list | inspect | curl

Architecture choice — Zoho API, not IMAP

We chose the Zoho Mail OAuth API over IMAP because:

  • Richer metadata — folder paths, attachment IDs, search operators, threads.
  • One account covers everythingbooks@arcodange.fr is an alias of gabrielradureau@arcodange.fr. One refresh_token + the /accounts endpoint exposes both, plus all the other aliases (contact@, bonjour@, etc.).
  • Gmail folded in via forwardingarcodange@gmail.com forwards incoming to books@ (configured in Gmail UI). No Google API setup, no app-passwords, no second OAuth flow.
  • Token-only auth — no app-password fragility, no SCA dance (unlike Wise).

The single canonical inbox path: /Inbox/books — Zoho's auto-filter routes incoming mail to the books@ alias into this sub-folder. Scan it first; widen with --all-folders only if needed.

Prerequisites

  1. Base skill set up (dolibarr/README.md).
  2. Zoho OAuth Self-Client created and a refresh_token generated. The .env extension:
    ZOHO_CLIENT_ID=<from api-console.zoho.com self-client>
    ZOHO_CLIENT_SECRET=<same>
    ZOHO_REFRESH_TOKEN=<exchanged from one-time code>
    ZOHO_DC=eu                # eu | com | in | au
    
    Setup walkthrough is in the V8 prep section of the cohort review notes.
  3. Gmail forwarding to books@arcodange.fr enabled (Gmail Settings → Forwarding and POP/IMAP).
  4. pdftotext (brew install poppler on macOS).

Workflows

1. List candidates

bin/arcodange email list                      # default: /Inbox/books, last 30 msgs, no filter
bin/arcodange email list --candidates-only    # filter to subjects/attachments matching supplier patterns
bin/arcodange email list --folder /Inbox/contact --limit 50
bin/arcodange email list --all-folders --candidates-only   # scan everything (slower, more API calls)

Captured at examples/email-list.txt. The candidate filter matches subjects against facture|invoice|receipt|reçu|payment|paiement|abonnement|subscription|order|commande|bill OR any message with an attachment.

Hard exclusions (V8.1) — applied before the candidate test, regardless of attachments:

  • Subjects starting with Invitation: / Updated invitation: / Canceled event: / Accepted: / Declined: / Tentative: / Maybe: (after stripping Re: / Fwd: / Tr: prefixes) → filters calendar events that always carry an .ics attachment.
  • Senders matching newsletter/marketing patterns (updates.<domain>, noreply@*calendar*, news@, newsletter@, etc.).

The [*] column marks candidates, [Y] marks emails with attachments. Compared to V8.0, V8.1 cuts the --all-folders --candidates-only baseline from ~27 noisy entries down to ~12 actionable ones.

2. Inspect one email + draft Dolibarr entry

bin/arcodange email inspect 1775141901205014300
bin/arcodange email inspect 1775141901205014300 --folder /Inbox/books   # default
bin/arcodange email inspect 1775141901205014300 --save-pdf ~/Documents/factures-2026-Q2/
bin/arcodange email inspect 1775141901205014300 --json    # machine-readable

The script:

  1. Fetches the email metadata (subject / from / date) via /messages/view.
  2. Lists attachments via /messages/{mid}/attachmentinfo.
  3. Downloads each attachment via /messages/{mid}/attachments/{aid}.
  4. For each .pdf, runs pdftotext -layout, applies regex heuristics to extract:
    • Supplier name guess (first non-empty PDF line — often the supplier letterhead).
    • Invoice reference (facture/invoice n° XXX).
    • Invoice date.
    • Total HT / TVA / TTC + VAT rate %.
  5. Emits a draft JSON record per attachment — paste into the Dolibarr UI manually.

Heuristics are intentionally conservative (regex-based, no LLM dependency). For PDF templates where the heuristic fails, the raw pdftotext output is on disk in the work dir; rerun with --save-pdf to grab the PDF for manual entry.

Captured at examples/email-inspect.txt for the V8 baseline (Mistral AI receipt).

What it doesn't do (V8.0 scope)

  • Does not write to Dolibarr. The supplier invoice is still created manually in the Dolibarr UI from the draft JSON. V9 candidate: automate via /supplierinvoices POST.
  • Does not mark emails as ingested. Each run re-emits the same candidates. Implementing this requires extending the OAuth scope: the current refresh_token only has READ scopes (ZohoMail.messages.READ etc.). The flag-set endpoint (PUT /api/accounts/{aid}/updatemessage) requires ZohoMail.messages.UPDATE, which would force the user to regenerate the refresh_token. V8.2 candidate — once the user opts in to the wider scope, --mark-ingested becomes a one-line flag on email-inspect.sh and is_candidate() in email-list.sh learns to skip messages with flagid == flag_info.
  • No body extraction yet. We only parse PDF attachments. Inline-HTML invoices (rare — most suppliers send PDFs) would need body fetch via /content.
  • Heuristic extraction is best-effort. Different supplier PDF templates yield different field-extraction reliability. The draft JSON is a starting point, not ground truth.

Token cache

zoho-curl.sh caches the OAuth access_token in $TMPDIR/zoho-access-$USER (mode 600, TTL 50 min). Avoids hitting Zoho's OAuth refresh rate-limit on every invocation. On 401, the wrapper auto-refreshes once and retries.

API endpoints used (Zoho Mail)

Endpoint Purpose
POST /oauth/v2/token (accounts.zoho.{dc}) Refresh access_token from refresh_token
GET /accounts Discover accountId + aliases on the account
GET /accounts/{aid}/folders List folders (with paths like /Inbox/books)
GET /accounts/{aid}/messages/view?folderId=&limit=&start= List messages in a folder
GET /accounts/{aid}/folders/{fid}/messages/{mid}/attachmentinfo List attachments metadata
GET /accounts/{aid}/folders/{fid}/messages/{mid}/attachments/{aid} Download attachment bytes

Out of scope

  • Writing to Dolibarr (V9 candidate — would lift the read-only constraint on the API key, or use a separate write-scoped key).
  • Marking ingested emails (V8.1 trivial follow-up).
  • Non-PDF attachments (heuristics are PDF-specific).
  • Body-text extraction (would need /content endpoint, deferred).
  • IMAP fallback for non-Zoho mailboxes (deferred — Gmail forwarding to books@ covers the only known external mailbox today).
  • LLM-based extraction (deferred — regex covers the current set of supplier templates well enough).