add arcodange-email-ingest — Zoho Mail → Dolibarr supplier-invoice drafts #9

Merged
arcodange merged 1 commits from claude/arcodange-email-ingest into main 2026-05-31 14:56:59 +02:00
Owner

Summary

V8 — first inbound cross-system skill. Closes the loop from "bill arrives by email" to "ready to enter in Dolibarr UI". 11th skill in the family, 2nd arcodange-* after arcodange-bank-reco.

What ships

  • zoho-curl.sh — read-only OAuth wrapper for the Zoho Mail API. Caches the access_token in $TMPDIR/zoho-access-$USER (mode 600, 50 min TTL) to avoid Zoho's aggressive OAuth refresh throttle. Retries once on 401 with a fresh token.
  • email-list.sh — list candidate supplier-invoice emails. Default scope /Inbox/books (the alias-auto-routed folder). --candidates-only filters subjects matching supplier patterns OR attachments. --all-folders widens the scan.
  • email-inspect.sh — download all attachments for one message id, run pdftotext on each PDF, apply regex heuristics, emit a Dolibarr supplier-invoice draft JSON record per attachment. --save-pdf <DIR> to keep the PDFs for manual fallback when heuristics miss.

Architecture choice

  • Zoho API > IMAP : richer metadata (folders, attachments, search), token-only auth, no app-password fragility.
  • books@ is an alias of gabrielradureau@ → one OAuth refresh_token covers everything.
  • Gmail folded in via forwarding : arcodange@gmail.combooks@arcodange.fr. Zero Google API setup.
  • No SCA rabbit hole (Wise lesson learned).

V8.0 baseline findings (in /Inbox/books)

3 candidates currently:

  • Mistral AI facture (Apr 2) — invoice-MSTRL-API-814045-001.pdf, 20 % FR VAT
  • Anthropic Stripe receipt (Apr 12, Fwd from Gmail) — Invoice-9BF... + Receipt-2109-4005.pdf, 180 € autoliquidation 0 %
  • INPI payment receipt (Jan 9, Fwd from Gmail) — text-only, no attachment

With --all-folders --candidates-only the scan widens to 27 candidates including:

  • Darnis Operations F1042 (in /Notification — supplier invoice not in /Inbox/books!)
  • 3 Free Mobile factures (in /Inbox/abonnements)
  • Several KissMetrics-related emails (in /clients/KissMetrics)
  • Calendar invites (noise — V8.1 candidate to filter out)

Rate-limit pitfall documented

Zoho OAuth /token endpoint has an aggressive throttle ("too many requests continuously" within a few seconds of refreshes). The cache file at $TMPDIR/zoho-access-$USER (mode 600, 50 min TTL) prevents this entirely. We hit it during V8 development — documented so the next operator knows.

V8.1+ candidates (out of scope here)

  • Mark ingested emails (IMAP flag \Flagged or Zoho label ingested) to avoid re-processing on the next run.
  • Filter calendar invites / newsletters out of the candidate set (subject Invitation: / Updated invitation: / sender domain matching newsletter patterns).
  • Body text extraction (inline-HTML invoices — fetch via /content).
  • Per-template parsers (Mistral / Stripe / INPI / URSSAF templates) or LLM-based extraction for higher field-extraction reliability.
  • Auto-create the Dolibarr supplier invoice via API (V9, write-side — would need a write-scoped Dolibarr key).

Heuristic extraction notes

PDF text varies wildly by template. The V8.0 heuristics work for some fields on some templates:

  • VAT rate detection is reliable (both Mistral 20% and Anthropic 0% extracted correctly)
  • Total amounts extract for Anthropic (180€) but miss on Mistral (different layout)
  • Invoice ref extracts the wrong token in some cases (Anthropic captures Arcodange's TVA number)

Operator uses the draft JSON as a starting point, fills missing fields from the PDF (saved via --save-pdf).

Test plan

  • bin/arcodange email curl /accounts → returns the Arcodange account + alias list
  • bin/arcodange email list --candidates-only → 3 candidates in /Inbox/books
  • bin/arcodange email list --all-folders --candidates-only --limit 50 → 27+ candidates across folders
  • bin/arcodange email inspect 1775141901205014300 → Mistral PDF downloaded (74377 bytes), invoice_ref=API-814045-001, vat=20.0
  • bin/arcodange email inspect 1776017238960014300 → 2 Anthropic PDFs, both with total_ht=180.00 / total_ttc=180.00
  • Cache file at $TMPDIR/zoho-access-$USER mode 600 after first call
  • git diff --cached | grep -F <ZOHO_REFRESH_TOKEN> empty (verified pre-commit)
## Summary V8 — first **inbound** cross-system skill. Closes the loop from "bill arrives by email" to "ready to enter in Dolibarr UI". 11th skill in the family, 2nd `arcodange-*` after `arcodange-bank-reco`. ### What ships - **`zoho-curl.sh`** — read-only OAuth wrapper for the Zoho Mail API. Caches the access_token in `$TMPDIR/zoho-access-$USER` (mode 600, 50 min TTL) to avoid Zoho's aggressive OAuth refresh throttle. Retries once on 401 with a fresh token. - **`email-list.sh`** — list candidate supplier-invoice emails. Default scope `/Inbox/books` (the alias-auto-routed folder). `--candidates-only` filters subjects matching supplier patterns OR attachments. `--all-folders` widens the scan. - **`email-inspect.sh`** — download all attachments for one message id, run `pdftotext` on each PDF, apply regex heuristics, emit a Dolibarr supplier-invoice draft JSON record per attachment. `--save-pdf <DIR>` to keep the PDFs for manual fallback when heuristics miss. ### Architecture choice - **Zoho API > IMAP** : richer metadata (folders, attachments, search), token-only auth, no app-password fragility. - **books@ is an alias of gabrielradureau@** → one OAuth refresh_token covers everything. - **Gmail folded in via forwarding** : `arcodange@gmail.com` → `books@arcodange.fr`. Zero Google API setup. - No SCA rabbit hole (Wise lesson learned). ### V8.0 baseline findings (in `/Inbox/books`) 3 candidates currently: - Mistral AI facture (Apr 2) — `invoice-MSTRL-API-814045-001.pdf`, 20 % FR VAT - Anthropic Stripe receipt (Apr 12, Fwd from Gmail) — `Invoice-9BF...` + `Receipt-2109-4005.pdf`, 180 € autoliquidation 0 % - INPI payment receipt (Jan 9, Fwd from Gmail) — text-only, no attachment With `--all-folders --candidates-only` the scan widens to 27 candidates including: - Darnis Operations F1042 (in `/Notification` — supplier invoice not in `/Inbox/books`!) - 3 Free Mobile factures (in `/Inbox/abonnements`) - Several KissMetrics-related emails (in `/clients/KissMetrics`) - Calendar invites (noise — V8.1 candidate to filter out) ### Rate-limit pitfall documented Zoho OAuth `/token` endpoint has an aggressive throttle ("too many requests continuously" within a few seconds of refreshes). The cache file at `$TMPDIR/zoho-access-$USER` (mode 600, 50 min TTL) prevents this entirely. We hit it during V8 development — documented so the next operator knows. ### V8.1+ candidates (out of scope here) - Mark ingested emails (IMAP flag `\Flagged` or Zoho label `ingested`) to avoid re-processing on the next run. - Filter calendar invites / newsletters out of the candidate set (subject `Invitation:` / `Updated invitation:` / sender domain matching newsletter patterns). - Body text extraction (inline-HTML invoices — fetch via `/content`). - Per-template parsers (Mistral / Stripe / INPI / URSSAF templates) or LLM-based extraction for higher field-extraction reliability. - Auto-create the Dolibarr supplier invoice via API (V9, write-side — would need a write-scoped Dolibarr key). ### Heuristic extraction notes PDF text varies wildly by template. The V8.0 heuristics work for some fields on some templates: - VAT rate detection is reliable (both Mistral 20% and Anthropic 0% extracted correctly) - Total amounts extract for Anthropic (180€) but miss on Mistral (different layout) - Invoice ref extracts the wrong token in some cases (Anthropic captures Arcodange's TVA number) Operator uses the draft JSON as a starting point, fills missing fields from the PDF (saved via `--save-pdf`). ## Test plan - [ ] `bin/arcodange email curl /accounts` → returns the Arcodange account + alias list - [ ] `bin/arcodange email list --candidates-only` → 3 candidates in `/Inbox/books` - [ ] `bin/arcodange email list --all-folders --candidates-only --limit 50` → 27+ candidates across folders - [ ] `bin/arcodange email inspect 1775141901205014300` → Mistral PDF downloaded (74377 bytes), invoice_ref=`API-814045-001`, vat=20.0 - [ ] `bin/arcodange email inspect 1776017238960014300` → 2 Anthropic PDFs, both with total_ht=180.00 / total_ttc=180.00 - [ ] Cache file at `$TMPDIR/zoho-access-$USER` mode 600 after first call - [ ] `git diff --cached | grep -F <ZOHO_REFRESH_TOKEN>` empty (verified pre-commit)
arcodange added 1 commit 2026-05-31 14:56:50 +02:00
V8 — first inbound-side skill. Closes the loop from "bill arrives by email"
to "ready to enter in Dolibarr UI". Read-only at every layer.

What ships:
- arcodange-email-ingest/scripts/zoho-curl.sh   OAuth wrapper with token cache
                                                (50 min TTL, mode 600) — avoids
                                                hitting Zoho OAuth rate limit on
                                                every invocation.
- arcodange-email-ingest/scripts/email-list.sh   List candidates in /Inbox/books
                                                (where the books@ alias auto-
                                                routes mail). --candidates-only
                                                filter on supplier patterns or
                                                attachments. --all-folders to
                                                scan everything.
- arcodange-email-ingest/scripts/email-inspect.sh   Pull message + attachments,
                                                pdftotext on each PDF, heuristic
                                                extract (supplier, ref, dates,
                                                totals, VAT rate), emit Dolibarr
                                                supplier-invoice draft JSON.

Architecture choice — Zoho API (not IMAP):
- books@arcodange.fr is an alias of gabrielradureau@arcodange.fr → one OAuth
  refresh_token covers everything.
- Gmail folded in via forwarding (arcodange@gmail.com → books@) — no Google
  API setup, no app-passwords, no second OAuth flow.
- Token-based auth, no SCA rabbit hole.

V8.0 baseline (in /Inbox/books):
- 3 candidates: Mistral AI facture, Anthropic Stripe receipt (Fwd Gmail),
  INPI payment receipt (Fwd Gmail).
- Heuristic extraction is best-effort: works on amounts/refs for some
  templates, misses others (Mistral PDF format, Stripe receipt layout).
- --save-pdf <DIR> lets the operator grab the PDFs for manual entry when
  the heuristic falls short.

Rate-limit pitfall documented: Zoho OAuth refresh has an aggressive throttle
("too many requests continuously"). The cache file at $TMPDIR/zoho-access-$USER
(mode 600, 50 min TTL) prevents this; on 401 the wrapper auto-refreshes once
and retries.

V8.1+ ideas in SKILL.md out-of-scope:
- mark ingested emails (IMAP flag or Zoho label)
- body text extraction (inline-HTML invoices)
- per-template parsers or LLM-based extraction
- IMAP fallback for non-Zoho mailboxes

CLI: bin/arcodange email {list|inspect|curl} integrated.
Base updates: dolibarr/SKILL.md cross-link, dolibarr/README.md env schema
extended with ZOHO_CLIENT_ID/SECRET/REFRESH_TOKEN/DC.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
arcodange merged commit 794aa18d2a into main 2026-05-31 14:56:59 +02:00
arcodange deleted branch claude/arcodange-email-ingest 2026-05-31 14:57:00 +02:00
Sign in to join this conversation.
No Reviewers
No Label
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: arcodange-org/erp#9