arcodange-email-ingest V8.1: filter calendar invites + newsletter senders

email-list.sh gains two hard-exclusion filters (applied before the
candidate test, regardless of attachments):

- EXCLUDE_PATTERN matches subjects starting with Invitation: / Updated
  invitation: / Canceled event: / Accepted: / Declined: / Tentative: /
  Maybe: (after stripping Re:/Fwd:/Tr: prefixes). Filters Google Calendar
  events that always carry an .ics attachment.
- EXCLUDE_SENDER matches updates.<domain>, noreply@*calendar, news@,
  newsletter@. Filters newsletter blast traffic.

Effect on --all-folders --candidates-only baseline: 27 noisy → 12
actionable (calendar invites + the staying-ahead.ai newsletter blast
removed). Real supplier docs intact: Darnis F1042 in /Notification, 3 Free
Mobile factures in /Inbox/abonnements, Mistral + Anthropic in /Inbox/books.

The originally-planned --mark-ingested feature is deferred to V8.2:
flag-set requires the Zoho OAuth scope ZohoMail.messages.UPDATE which our
read-only refresh_token doesn't have. Documented in SKILL.md: once the
user opts in to the wider scope, --mark-ingested becomes a one-line flag
on email-inspect.sh and is_candidate() learns to skip flag_info messages.

Captured the new --all-folders baseline at examples/email-list-all-folders.txt.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-31 15:18:31 +02:00
parent 794aa18d2a
commit 1d38f25c23
3 changed files with 45 additions and 3 deletions

View File

@@ -49,7 +49,13 @@ bin/arcodange email list --folder /Inbox/contact --limit 50
bin/arcodange email list --all-folders --candidates-only # scan everything (slower, more API calls)
```
Captured at [examples/email-list.txt](examples/email-list.txt). The candidate filter matches subjects against `facture|invoice|receipt|reçu|payment|paiement|abonnement|subscription|order|commande|bill` OR any message with an attachment. The `[*]` column marks candidates, `[Y]` marks emails with attachments.
Captured at [examples/email-list.txt](examples/email-list.txt). The candidate filter matches subjects against `facture|invoice|receipt|reçu|payment|paiement|abonnement|subscription|order|commande|bill` OR any message with an attachment.
**Hard exclusions** (V8.1) — applied before the candidate test, regardless of attachments:
- Subjects starting with `Invitation:` / `Updated invitation:` / `Canceled event:` / `Accepted:` / `Declined:` / `Tentative:` / `Maybe:` (after stripping `Re:` / `Fwd:` / `Tr:` prefixes) → filters calendar events that always carry an `.ics` attachment.
- Senders matching newsletter/marketing patterns (`updates.<domain>`, `noreply@*calendar*`, `news@`, `newsletter@`, etc.).
The `[*]` column marks candidates, `[Y]` marks emails with attachments. Compared to V8.0, V8.1 cuts the `--all-folders --candidates-only` baseline from ~27 noisy entries down to ~12 actionable ones.
### 2. Inspect one email + draft Dolibarr entry
@@ -78,7 +84,7 @@ Captured at [examples/email-inspect.txt](examples/email-inspect.txt) for the V8
## What it doesn't do (V8.0 scope)
- **Does not write to Dolibarr.** The supplier invoice is still created manually in the Dolibarr UI from the draft JSON. V9 candidate: automate via `/supplierinvoices` POST.
- **Does not mark emails as ingested.** Each run re-emits the same candidates. V8.1 candidate: set the IMAP `\Flagged` flag or add a Zoho label `ingested` after the operator confirms.
- **Does not mark emails as ingested.** Each run re-emits the same candidates. Implementing this requires extending the OAuth scope: the current refresh_token only has READ scopes (`ZohoMail.messages.READ` etc.). The flag-set endpoint (`PUT /api/accounts/{aid}/updatemessage`) requires `ZohoMail.messages.UPDATE`, which would force the user to regenerate the refresh_token. **V8.2 candidate** — once the user opts in to the wider scope, `--mark-ingested` becomes a one-line flag on `email-inspect.sh` and `is_candidate()` in `email-list.sh` learns to skip messages with `flagid == flag_info`.
- **No body extraction yet.** We only parse PDF attachments. Inline-HTML invoices (rare — most suppliers send PDFs) would need body fetch via `/content`.
- **Heuristic extraction is best-effort.** Different supplier PDF templates yield different field-extraction reliability. The draft JSON is a starting point, not ground truth.