add arcodange-email-ingest — Zoho Mail → Dolibarr supplier-invoice drafts
V8 — first inbound-side skill. Closes the loop from "bill arrives by email"
to "ready to enter in Dolibarr UI". Read-only at every layer.
What ships:
- arcodange-email-ingest/scripts/zoho-curl.sh OAuth wrapper with token cache
(50 min TTL, mode 600) — avoids
hitting Zoho OAuth rate limit on
every invocation.
- arcodange-email-ingest/scripts/email-list.sh List candidates in /Inbox/books
(where the books@ alias auto-
routes mail). --candidates-only
filter on supplier patterns or
attachments. --all-folders to
scan everything.
- arcodange-email-ingest/scripts/email-inspect.sh Pull message + attachments,
pdftotext on each PDF, heuristic
extract (supplier, ref, dates,
totals, VAT rate), emit Dolibarr
supplier-invoice draft JSON.
Architecture choice — Zoho API (not IMAP):
- books@arcodange.fr is an alias of gabrielradureau@arcodange.fr → one OAuth
refresh_token covers everything.
- Gmail folded in via forwarding (arcodange@gmail.com → books@) — no Google
API setup, no app-passwords, no second OAuth flow.
- Token-based auth, no SCA rabbit hole.
V8.0 baseline (in /Inbox/books):
- 3 candidates: Mistral AI facture, Anthropic Stripe receipt (Fwd Gmail),
INPI payment receipt (Fwd Gmail).
- Heuristic extraction is best-effort: works on amounts/refs for some
templates, misses others (Mistral PDF format, Stripe receipt layout).
- --save-pdf <DIR> lets the operator grab the PDFs for manual entry when
the heuristic falls short.
Rate-limit pitfall documented: Zoho OAuth refresh has an aggressive throttle
("too many requests continuously"). The cache file at $TMPDIR/zoho-access-$USER
(mode 600, 50 min TTL) prevents this; on 401 the wrapper auto-refreshes once
and retries.
V8.1+ ideas in SKILL.md out-of-scope:
- mark ingested emails (IMAP flag or Zoho label)
- body text extraction (inline-HTML invoices)
- per-template parsers or LLM-based extraction
- IMAP fallback for non-Zoho mailboxes
CLI: bin/arcodange email {list|inspect|curl} integrated.
Base updates: dolibarr/SKILL.md cross-link, dolibarr/README.md env schema
extended with ZOHO_CLIENT_ID/SECRET/REFRESH_TOKEN/DC.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
107
.claude/skills/arcodange-email-ingest/SKILL.md
Normal file
107
.claude/skills/arcodange-email-ingest/SKILL.md
Normal file
@@ -0,0 +1,107 @@
|
||||
---
|
||||
name: arcodange-email-ingest
|
||||
description: Scrape supplier-invoice emails from the Arcodange Zoho mailbox (`gabrielradureau@arcodange.fr` + its `books@arcodange.fr` alias + forwarded Gmail) via the Zoho Mail OAuth API, list candidates matching supplier patterns, download PDF attachments, run pdftotext + heuristic extract, and emit Dolibarr-ready supplier-invoice draft JSON for the operator to paste into the Dolibarr UI. Two workflows — (1) list candidates in a folder (default `/Inbox/books` where the alias auto-routes mail); (2) inspect one message by id, download + parse PDFs, propose draft entries. Surfaces concrete data: supplier name guess (first PDF line), invoice ref, invoice date, total HT/TVA/TTC, VAT rate. Read-only at every layer (Zoho scopes are READ-only; no write to Dolibarr). Use when the user asks "list pending supplier invoices in mail", "ingest invoices from email", "draft Dolibarr entry from this email", "audit cohort supplier docs from mail". Depends on `dolibarr` for the shared `.env`. SKIP for write-side Dolibarr operations (V9 candidate), for non-Zoho mailboxes (use IMAP fallback in a future skill if needed), and for attachments that aren't PDFs (only PDF text extraction is wired today).
|
||||
requires:
|
||||
bins: ["curl", "jq", "python3", "pdftotext"]
|
||||
auth: true
|
||||
---
|
||||
|
||||
# arcodange-email-ingest — supplier-invoice emails → Dolibarr draft
|
||||
|
||||
Close the inbound side of the accounting loop: bills land in `books@arcodange.fr`, this skill turns them into Dolibarr-ready draft entries for the operator to validate + create.
|
||||
|
||||
Depends on the [dolibarr](../dolibarr/SKILL.md) base skill (shared `.env`).
|
||||
|
||||
**CLI shortcuts:** `bin/arcodange email list | inspect | curl`
|
||||
|
||||
## Architecture choice — Zoho API, not IMAP
|
||||
|
||||
We chose the Zoho Mail OAuth API over IMAP because:
|
||||
- **Richer metadata** — folder paths, attachment IDs, search operators, threads.
|
||||
- **One account covers everything** — `books@arcodange.fr` is an alias of `gabrielradureau@arcodange.fr`. One refresh_token + the `/accounts` endpoint exposes both, plus all the other aliases (`contact@`, `bonjour@`, etc.).
|
||||
- **Gmail folded in via forwarding** — `arcodange@gmail.com` forwards incoming to `books@` (configured in Gmail UI). No Google API setup, no app-passwords, no second OAuth flow.
|
||||
- **Token-only auth** — no app-password fragility, no SCA dance (unlike Wise).
|
||||
|
||||
The single canonical inbox path: **`/Inbox/books`** — Zoho's auto-filter routes incoming mail to the `books@` alias into this sub-folder. Scan it first; widen with `--all-folders` only if needed.
|
||||
|
||||
## Prerequisites
|
||||
|
||||
1. Base skill set up ([dolibarr/README.md](../dolibarr/README.md)).
|
||||
2. Zoho OAuth Self-Client created and a refresh_token generated. The `.env` extension:
|
||||
```
|
||||
ZOHO_CLIENT_ID=<from api-console.zoho.com self-client>
|
||||
ZOHO_CLIENT_SECRET=<same>
|
||||
ZOHO_REFRESH_TOKEN=<exchanged from one-time code>
|
||||
ZOHO_DC=eu # eu | com | in | au
|
||||
```
|
||||
Setup walkthrough is in the V8 prep section of the cohort review notes.
|
||||
3. Gmail forwarding to `books@arcodange.fr` enabled (Gmail Settings → Forwarding and POP/IMAP).
|
||||
4. `pdftotext` (`brew install poppler` on macOS).
|
||||
|
||||
## Workflows
|
||||
|
||||
### 1. List candidates
|
||||
|
||||
```bash
|
||||
bin/arcodange email list # default: /Inbox/books, last 30 msgs, no filter
|
||||
bin/arcodange email list --candidates-only # filter to subjects/attachments matching supplier patterns
|
||||
bin/arcodange email list --folder /Inbox/contact --limit 50
|
||||
bin/arcodange email list --all-folders --candidates-only # scan everything (slower, more API calls)
|
||||
```
|
||||
|
||||
Captured at [examples/email-list.txt](examples/email-list.txt). The candidate filter matches subjects against `facture|invoice|receipt|reçu|payment|paiement|abonnement|subscription|order|commande|bill` OR any message with an attachment. The `[*]` column marks candidates, `[Y]` marks emails with attachments.
|
||||
|
||||
### 2. Inspect one email + draft Dolibarr entry
|
||||
|
||||
```bash
|
||||
bin/arcodange email inspect 1775141901205014300
|
||||
bin/arcodange email inspect 1775141901205014300 --folder /Inbox/books # default
|
||||
bin/arcodange email inspect 1775141901205014300 --save-pdf ~/Documents/factures-2026-Q2/
|
||||
bin/arcodange email inspect 1775141901205014300 --json # machine-readable
|
||||
```
|
||||
|
||||
The script:
|
||||
1. Fetches the email metadata (subject / from / date) via `/messages/view`.
|
||||
2. Lists attachments via `/messages/{mid}/attachmentinfo`.
|
||||
3. Downloads each attachment via `/messages/{mid}/attachments/{aid}`.
|
||||
4. For each `.pdf`, runs `pdftotext -layout`, applies regex heuristics to extract:
|
||||
- Supplier name guess (first non-empty PDF line — often the supplier letterhead).
|
||||
- Invoice reference (`facture/invoice n° XXX`).
|
||||
- Invoice date.
|
||||
- Total HT / TVA / TTC + VAT rate %.
|
||||
5. Emits a draft JSON record per attachment — paste into the Dolibarr UI manually.
|
||||
|
||||
Heuristics are intentionally conservative (regex-based, no LLM dependency). For PDF templates where the heuristic fails, the raw `pdftotext` output is on disk in the work dir; rerun with `--save-pdf` to grab the PDF for manual entry.
|
||||
|
||||
Captured at [examples/email-inspect.txt](examples/email-inspect.txt) for the V8 baseline (Mistral AI receipt).
|
||||
|
||||
## What it doesn't do (V8.0 scope)
|
||||
|
||||
- **Does not write to Dolibarr.** The supplier invoice is still created manually in the Dolibarr UI from the draft JSON. V9 candidate: automate via `/supplierinvoices` POST.
|
||||
- **Does not mark emails as ingested.** Each run re-emits the same candidates. V8.1 candidate: set the IMAP `\Flagged` flag or add a Zoho label `ingested` after the operator confirms.
|
||||
- **No body extraction yet.** We only parse PDF attachments. Inline-HTML invoices (rare — most suppliers send PDFs) would need body fetch via `/content`.
|
||||
- **Heuristic extraction is best-effort.** Different supplier PDF templates yield different field-extraction reliability. The draft JSON is a starting point, not ground truth.
|
||||
|
||||
## Token cache
|
||||
|
||||
`zoho-curl.sh` caches the OAuth access_token in `$TMPDIR/zoho-access-$USER` (mode 600, TTL 50 min). Avoids hitting Zoho's OAuth refresh rate-limit on every invocation. On 401, the wrapper auto-refreshes once and retries.
|
||||
|
||||
## API endpoints used (Zoho Mail)
|
||||
|
||||
| Endpoint | Purpose |
|
||||
|---|---|
|
||||
| `POST /oauth/v2/token` (accounts.zoho.{dc}) | Refresh access_token from refresh_token |
|
||||
| `GET /accounts` | Discover accountId + aliases on the account |
|
||||
| `GET /accounts/{aid}/folders` | List folders (with paths like `/Inbox/books`) |
|
||||
| `GET /accounts/{aid}/messages/view?folderId=&limit=&start=` | List messages in a folder |
|
||||
| `GET /accounts/{aid}/folders/{fid}/messages/{mid}/attachmentinfo` | List attachments metadata |
|
||||
| `GET /accounts/{aid}/folders/{fid}/messages/{mid}/attachments/{aid}` | Download attachment bytes |
|
||||
|
||||
## Out of scope
|
||||
|
||||
- **Writing to Dolibarr** (V9 candidate — would lift the read-only constraint on the API key, or use a separate write-scoped key).
|
||||
- **Marking ingested emails** (V8.1 trivial follow-up).
|
||||
- **Non-PDF attachments** (heuristics are PDF-specific).
|
||||
- **Body-text extraction** (would need `/content` endpoint, deferred).
|
||||
- **IMAP fallback** for non-Zoho mailboxes (deferred — Gmail forwarding to books@ covers the only known external mailbox today).
|
||||
- **LLM-based extraction** (deferred — regex covers the current set of supplier templates well enough).
|
||||
Reference in New Issue
Block a user