arcodange-email-ingest V8.1: filter calendar invites + newsletter senders

email-list.sh gains two hard-exclusion filters (applied before the
candidate test, regardless of attachments):

- EXCLUDE_PATTERN matches subjects starting with Invitation: / Updated
  invitation: / Canceled event: / Accepted: / Declined: / Tentative: /
  Maybe: (after stripping Re:/Fwd:/Tr: prefixes). Filters Google Calendar
  events that always carry an .ics attachment.
- EXCLUDE_SENDER matches updates.<domain>, noreply@*calendar, news@,
  newsletter@. Filters newsletter blast traffic.

Effect on --all-folders --candidates-only baseline: 27 noisy → 12
actionable (calendar invites + the staying-ahead.ai newsletter blast
removed). Real supplier docs intact: Darnis F1042 in /Notification, 3 Free
Mobile factures in /Inbox/abonnements, Mistral + Anthropic in /Inbox/books.

The originally-planned --mark-ingested feature is deferred to V8.2:
flag-set requires the Zoho OAuth scope ZohoMail.messages.UPDATE which our
read-only refresh_token doesn't have. Documented in SKILL.md: once the
user opts in to the wider scope, --mark-ingested becomes a one-line flag
on email-inspect.sh and is_candidate() learns to skip flag_info messages.

Captured the new --all-folders baseline at examples/email-list-all-folders.txt.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-31 15:18:31 +02:00
parent 794aa18d2a
commit 1d38f25c23
3 changed files with 45 additions and 3 deletions

View File

@@ -49,7 +49,13 @@ bin/arcodange email list --folder /Inbox/contact --limit 50
bin/arcodange email list --all-folders --candidates-only # scan everything (slower, more API calls)
```
Captured at [examples/email-list.txt](examples/email-list.txt). The candidate filter matches subjects against `facture|invoice|receipt|reçu|payment|paiement|abonnement|subscription|order|commande|bill` OR any message with an attachment. The `[*]` column marks candidates, `[Y]` marks emails with attachments.
Captured at [examples/email-list.txt](examples/email-list.txt). The candidate filter matches subjects against `facture|invoice|receipt|reçu|payment|paiement|abonnement|subscription|order|commande|bill` OR any message with an attachment.
**Hard exclusions** (V8.1) — applied before the candidate test, regardless of attachments:
- Subjects starting with `Invitation:` / `Updated invitation:` / `Canceled event:` / `Accepted:` / `Declined:` / `Tentative:` / `Maybe:` (after stripping `Re:` / `Fwd:` / `Tr:` prefixes) → filters calendar events that always carry an `.ics` attachment.
- Senders matching newsletter/marketing patterns (`updates.<domain>`, `noreply@*calendar*`, `news@`, `newsletter@`, etc.).
The `[*]` column marks candidates, `[Y]` marks emails with attachments. Compared to V8.0, V8.1 cuts the `--all-folders --candidates-only` baseline from ~27 noisy entries down to ~12 actionable ones.
### 2. Inspect one email + draft Dolibarr entry
@@ -78,7 +84,7 @@ Captured at [examples/email-inspect.txt](examples/email-inspect.txt) for the V8
## What it doesn't do (V8.0 scope)
- **Does not write to Dolibarr.** The supplier invoice is still created manually in the Dolibarr UI from the draft JSON. V9 candidate: automate via `/supplierinvoices` POST.
- **Does not mark emails as ingested.** Each run re-emits the same candidates. V8.1 candidate: set the IMAP `\Flagged` flag or add a Zoho label `ingested` after the operator confirms.
- **Does not mark emails as ingested.** Each run re-emits the same candidates. Implementing this requires extending the OAuth scope: the current refresh_token only has READ scopes (`ZohoMail.messages.READ` etc.). The flag-set endpoint (`PUT /api/accounts/{aid}/updatemessage`) requires `ZohoMail.messages.UPDATE`, which would force the user to regenerate the refresh_token. **V8.2 candidate** — once the user opts in to the wider scope, `--mark-ingested` becomes a one-line flag on `email-inspect.sh` and `is_candidate()` in `email-list.sh` learns to skip messages with `flagid == flag_info`.
- **No body extraction yet.** We only parse PDF attachments. Inline-HTML invoices (rare — most suppliers send PDFs) would need body fetch via `/content`.
- **Heuristic extraction is best-effort.** Different supplier PDF templates yield different field-extraction reliability. The draft JSON is a starting point, not ground truth.

View File

@@ -0,0 +1,16 @@
date cand att messageId folder from subject
----------------------------------------------------------------------------------------------------------------------------------
2026-05-20 [*] [Y] 1779312401677014300 /clients/KissMetrics rsirvent@digitalocean.com Re: VM not running despite status=active, after volume
2026-05-20 [*] [Y] 1779298419301014300 /clients/KissMetrics tdziuba@kissmetrics.io Re: VM not running despite status=active, after volume
2026-05-20 [*] [Y] 1779285954272004400 /clients/KissMetrics tdziuba@kissmetrics.io Re: VM not running despite status=active, after volume
2026-05-05 [*] [ ] 1777970798248014300 /Inbox/abonnements freemobile@free-mobile.fr Votre facture mobile Free est disponible
2026-04-21 [*] [Y] 1776785469477004300 /Notification noreply@hiway.fr Darnis Operations - Facture F1042
2026-04-12 [*] [Y] 1776017238960014300 /Inbox/books arcodange@gmail.com Fwd: Your receipt from Anthropic Ireland, Limited #2109
2026-04-04 [*] [ ] 1775264759983014300 /Inbox/abonnements freemobile@free-mobile.fr Votre facture mobile Free est disponible
2026-04-02 [*] [Y] 1775141901205014300 /Inbox/books no-reply@mistral.ai Votre facture nº MSTRL-API-814045-001 de Mistral AI SAS
2026-03-05 [*] [ ] 1772689535069004400 /Inbox/helloworld freemobile@free-mobile.fr Votre facture mobile Free est disponible
2026-02-08 [*] [Y] 1770582421208004400 /Inbox/bureaux ne-pas-repondre@portailpro.gouv.fr Valider votre espace personnel sur Portailpro.gouv
2026-01-09 [*] [ ] 1767989744791004400 /Inbox/books gabrielradureau@gmail.com Fwd: INPI - Votre paiement pour la commande Réf. 181876
2026-01-06 [*] [Y] 1767710535894005600 /Inbox gabrielradureau@gmail.com Statuts
----------------------------------------------------------------------------------------------------------------------------------
# 12 message(s) (candidates only)

View File

@@ -91,9 +91,29 @@ CANDIDATE_PATTERN = re.compile(
re.IGNORECASE,
)
# Subjects that look like calendar invites / event updates / generic notifications
# get filtered out of --candidates-only — they always have a .ics attachment so
# the "has-attachment" heuristic alone catches them as false positives.
EXCLUDE_PATTERN = re.compile(
r'^(?:re:\s*|fwd:\s*|tr:\s*)*' # strip Re:/Fwd:/Tr: prefixes
r'(?:invitation|updated\s+invitation|canceled\s+event|accepted|declined|tentative|maybe)\s*:',
re.IGNORECASE,
)
# Senders that are pure noise — newsletter/marketing patterns.
EXCLUDE_SENDER = re.compile(
r'(updates\.|noreply@.*calendar|@calendar\.|news@|newsletter@|@updates\.)',
re.IGNORECASE,
)
def is_candidate(m):
subj = m.get("subject","") or ""
sender = m.get("fromAddress","") or m.get("sender","") or ""
# Hard exclusions take precedence over inclusions
if EXCLUDE_PATTERN.match(subj.strip()): return False
if EXCLUDE_SENDER.search(sender): return False
if str(m.get("hasAttachment","")) == "1": return True
if CANDIDATE_PATTERN.search(m.get("subject","") or ""): return True
if CANDIDATE_PATTERN.search(subj): return True
return False
rows = []