add arcodange-email-ingest — Zoho Mail → Dolibarr supplier-invoice drafts

V8 — first inbound-side skill. Closes the loop from "bill arrives by email" to "ready to enter in Dolibarr UI". Read-only at every layer. What ships: - arcodange-email-ingest/scripts/zoho-curl.sh OAuth wrapper with token cache (50 min TTL, mode 600) — avoids hitting Zoho OAuth rate limit on every invocation. - arcodange-email-ingest/scripts/email-list.sh List candidates in /Inbox/books (where the books@ alias auto- routes mail). --candidates-only filter on supplier patterns or attachments. --all-folders to scan everything. - arcodange-email-ingest/scripts/email-inspect.sh Pull message + attachments, pdftotext on each PDF, heuristic extract (supplier, ref, dates, totals, VAT rate), emit Dolibarr supplier-invoice draft JSON. Architecture choice — Zoho API (not IMAP): - books@arcodange.fr is an alias of gabrielradureau@arcodange.fr → one OAuth refresh_token covers everything. - Gmail folded in via forwarding (arcodange@gmail.com → books@) — no Google API setup, no app-passwords, no second OAuth flow. - Token-based auth, no SCA rabbit hole. V8.0 baseline (in /Inbox/books): - 3 candidates: Mistral AI facture, Anthropic Stripe receipt (Fwd Gmail), INPI payment receipt (Fwd Gmail). - Heuristic extraction is best-effort: works on amounts/refs for some templates, misses others (Mistral PDF format, Stripe receipt layout). - --save-pdf <DIR> lets the operator grab the PDFs for manual entry when the heuristic falls short. Rate-limit pitfall documented: Zoho OAuth refresh has an aggressive throttle ("too many requests continuously"). The cache file at $TMPDIR/zoho-access-$USER (mode 600, 50 min TTL) prevents this; on 401 the wrapper auto-refreshes once and retries. V8.1+ ideas in SKILL.md out-of-scope: - mark ingested emails (IMAP flag or Zoho label) - body text extraction (inline-HTML invoices) - per-template parsers or LLM-based extraction - IMAP fallback for non-Zoho mailboxes CLI: bin/arcodange email {list|inspect|curl} integrated. Base updates: dolibarr/SKILL.md cross-link, dolibarr/README.md env schema extended with ZOHO_CLIENT_ID/SECRET/REFRESH_TOKEN/DC. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-31 14:56:15 +02:00
parent a1042a483b
commit c2d8479f5e
9 changed files with 685 additions and 1 deletions
--- a/.claude/skills/arcodange-email-ingest/scripts/email-inspect.sh
+++ b/.claude/skills/arcodange-email-ingest/scripts/email-inspect.sh
@@ -0,0 +1,256 @@
+#!/usr/bin/env bash
+# Inspect one email by id and propose a Dolibarr supplier-invoice draft.
+#
+# Usage:
+#   email-inspect.sh <messageId> [--folder PATH]    # default folder: /Inbox/books
+#                                [--save-pdf DIR]   # save PDF attachments under DIR/
+#                                [--json]           # emit a single JSON object on stdout
+#
+# Pipeline (read-only):
+#   1. Find the message (in the given folder, default /Inbox/books).
+#   2. List attachments via /attachmentinfo.
+#   3. For each PDF attachment: download, run pdftotext, extract supplier-side
+#      heuristics (name, totals, dates, ref).
+#   4. Emit a draft "Dolibarr-ready" record per attachment so the operator can
+#      hand-create the supplier invoice in the Dolibarr UI.
+#
+# This skill DOES NOT write to Dolibarr. Auto-creation of supplier invoices is
+# V9 candidate.
+
+set -euo pipefail
+
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+ZOHO_CURL="${SCRIPT_DIR}/zoho-curl.sh"
+
+if [[ $# -lt 1 ]]; then
+  echo "email-inspect.sh: missing <messageId>" >&2
+  echo "  Hint: bin/arcodange email list  to see candidate ids." >&2
+  exit 2
+fi
+MID="$1"; shift || true
+FOLDER="/Inbox/books"; SAVE_PDF_DIR=""; FMT="text"
+while [[ $# -gt 0 ]]; do
+  case "$1" in
+    --folder)   FOLDER="$2"; shift 2 ;;
+    --save-pdf) SAVE_PDF_DIR="$2"; shift 2 ;;
+    --json)     FMT="json"; shift ;;
+    -h|--help)  sed -n '2,18p' "$0" | sed 's/^# \{0,1\}//'; exit 0 ;;
+    *) echo "email-inspect.sh: unknown arg: $1" >&2; exit 2 ;;
+  esac
+done
+
+command -v pdftotext >/dev/null || { echo "email-inspect.sh: pdftotext not found (brew install poppler)" >&2; exit 2; }
+
+WORK="$(mktemp -d -t emailinspect.XXXXXX)"
+trap 'rm -rf "${WORK}"' EXIT
+
+# 1. accountId + folderId
+"${ZOHO_CURL}" /accounts > "${WORK}/accounts.json"
+AID=$(python3 -c "import json,sys; d=json.load(open(sys.argv[1])); print((d.get('data') or [{}])[0].get('accountId',''))" "${WORK}/accounts.json")
+"${ZOHO_CURL}" "/accounts/${AID}/folders" > "${WORK}/folders.json"
+FID=$(python3 -c "
+import json, sys
+d = json.load(open(sys.argv[1]))
+target = sys.argv[2]
+for f in (d.get('data') or []):
+    if f.get('path') == target:
+        print(f.get('folderId')); break" "${WORK}/folders.json" "${FOLDER}")
+[[ -z "${FID}" ]] && { echo "email-inspect.sh: folder '${FOLDER}' not found" >&2; exit 2; }
+
+# 2. Find the message in the folder listing (to grab metadata: subject, from, date)
+"${ZOHO_CURL}" "/accounts/${AID}/messages/view?folderId=${FID}&limit=100&sortorder=false&start=1" > "${WORK}/folder_msgs.json"
+python3 - "${WORK}/folder_msgs.json" "${MID}" > "${WORK}/meta.json" <<'PY'
+import json, sys
+d = json.load(open(sys.argv[1]))
+mid = sys.argv[2]
+for m in (d.get("data") or []):
+    if str(m.get("messageId")) == mid:
+        json.dump(m, sys.stdout); sys.exit(0)
+sys.exit(f"messageId {mid} not found in this folder")
+PY
+
+# 3. Attachment metadata
+"${ZOHO_CURL}" "/accounts/${AID}/folders/${FID}/messages/${MID}/attachmentinfo" > "${WORK}/attachinfo.json"
+
+# 4. Download each attachment — needs raw bytes (Accept: */*), not the JSON
+# wrapper's default. We bypass zoho-curl.sh for the attachment download but
+# reuse the cached access_token it wrote.
+set -a; source "${SCRIPT_DIR}/../../dolibarr/.env"; set +a
+: "${ZOHO_DC:=eu}"
+TOKEN_CACHE="${TMPDIR:-/tmp}/zoho-access-$(whoami)"
+if [[ ! -s "${TOKEN_CACHE}" ]]; then
+  echo "email-inspect.sh: missing access token cache — run any zoho-curl call first to populate it" >&2
+  exit 2
+fi
+ACCESS_TOKEN=$(cat "${TOKEN_CACHE}")
+MAIL_BASE="https://mail.zoho.${ZOHO_DC}/api"
+
+mkdir -p "${WORK}/atts" "${WORK}/text"
+ATT_IDS=$(python3 -c "
+import json, sys
+d = json.load(open(sys.argv[1]))
+data = d.get('data') or {}
+for a in (data.get('attachments') or []):
+    print(f\"{a.get('attachmentId')}|{a.get('attachmentName','-')}\")" "${WORK}/attachinfo.json")
+while IFS='|' read -r aid aname; do
+  [[ -z "${aid}" ]] && continue
+  outpath="${WORK}/atts/${aname}"
+  curl -sS \
+    -H "Authorization: Zoho-oauthtoken ${ACCESS_TOKEN}" \
+    -H "Accept: */*" \
+    --max-time 60 \
+    -o "${outpath}" \
+    "${MAIL_BASE}/accounts/${AID}/folders/${FID}/messages/${MID}/attachments/${aid}" || true
+  # If pdf, extract text (bash 3.2 compatible — no ${var,,})
+  aname_lc=$(echo "${aname}" | tr '[:upper:]' '[:lower:]')
+  if [[ "${aname_lc}" == *.pdf ]]; then
+    pdftotext -layout "${outpath}" "${WORK}/text/${aname%.pdf}.txt" 2>/dev/null || true
+  fi
+done <<< "${ATT_IDS}"
+
+# Optional save
+if [[ -n "${SAVE_PDF_DIR}" ]]; then
+  mkdir -p "${SAVE_PDF_DIR}"
+  cp "${WORK}/atts/"*.pdf "${SAVE_PDF_DIR}/" 2>/dev/null || true
+fi
+
+# 5. Heuristic extract + render
+python3 - "${WORK}" "${FMT}" <<'PY'
+import json, sys, os, re, datetime, glob
+work, fmt = sys.argv[1:3]
+
+meta = json.load(open(os.path.join(work,"meta.json")))
+ts = int(meta.get("sentDateInGMT") or meta.get("receivedTime") or 0) // 1000
+mail_date = datetime.datetime.fromtimestamp(ts).strftime("%Y-%m-%d") if ts else None
+mail_from = (meta.get("fromAddress") or meta.get("sender") or "-").replace("&lt;","<").replace("&gt;",">").replace("<","").replace(">","")
+mail_subject = meta.get("subject") or "-"
+
+# Heuristics on PDF text
+def extract(text):
+    out = {}
+    # First non-empty line is often the supplier name (or the address block first line)
+    lines = [l.strip() for l in text.splitlines() if l.strip()]
+    out["pdf_top_line"] = lines[0] if lines else None
+
+    # Total TTC / HT / TVA — try multiple French/English patterns
+    def first_match(*patterns):
+        for p in patterns:
+            for line in lines:
+                m = re.search(p, line, re.IGNORECASE)
+                if m: return m.group(1).replace(",", ".").replace(" ", "")
+        return None
+
+    def parse_amount(s):
+        if not s: return None
+        clean = s.replace(",", ".").replace(" ", "")
+        try:
+            v = float(clean)
+            # Money amounts < 1M EUR; filters out VAT-number false positives (FR12345678901)
+            return v if 0 <= v < 1_000_000 else None
+        except: return None
+
+    def first_amount(*patterns):
+        for p in patterns:
+            for line in lines:
+                m = re.search(p, line, re.IGNORECASE)
+                if m:
+                    v = parse_amount(m.group(1))
+                    if v is not None: return f"{v:.2f}"
+        return None
+
+    out["total_ht"]  = first_amount(r'(?:total\s*ht|montant\s*ht|net\s*amount|subtotal)[^\d-]*([\d \.,]+)')
+    # TVA: require currency suffix to avoid matching VAT-number digits
+    out["total_tva"] = first_amount(r'(?:tva|vat)[^\d-]*([\d \.,]+)\s*(?:€|eur)\b')
+    out["total_ttc"] = first_amount(r'(?:total\s*ttc|amount\s*due|total\s*due|grand\s*total|montant\s*total|amount\s*paid)[^\d-]*([\d \.,]+)')
+
+    # Invoice ref — must contain a digit (filters "umber", "Invoice", etc.)
+    m = re.search(r'(?:facture|invoice|receipt|reçu)\s*(?:n[°o]?|number|#|:)\s*([A-Za-z0-9][\w\d/-]{2,})', text, re.IGNORECASE)
+    if m and any(c.isdigit() for c in m.group(1)):
+        out["invoice_ref"] = m.group(1)
+    else:
+        # Fallback: any reasonable ref-shaped token after "Invoice" / "Facture" header
+        m = re.search(r'\b([A-Z]{2,}[-/]?\d[\w\d/-]{2,})\b', text)
+        out["invoice_ref"] = m.group(1) if m else None
+
+    # Invoice date — try ISO, French DD/MM/YYYY, English MM/DD/YYYY, French long form
+    out["invoice_date_raw"] = None
+    for p in (
+        r'\b(\d{4}-\d{2}-\d{2})\b',
+        r'(?:date|émise\s*le|invoice\s*date|date\s*de\s*facturation)[:\s]*(\d{1,2}[\s/.-]\d{1,2}[\s/.-]\d{2,4})',
+        r'(?:date|émise\s*le|invoice\s*date)[:\s]*(\d{1,2}\s+\w{3,9}\.?\s+\d{4})',
+    ):
+        m = re.search(p, text, re.IGNORECASE)
+        if m: out["invoice_date_raw"] = m.group(1).strip(); break
+
+    # VAT rate (e.g. "20%") — restrict to 0-25% so "100%" / page footers don't match.
+    vrate = None
+    for line in lines:
+        m = re.search(r'\b(\d{1,2}([.,]\d+)?)\s*%', line)
+        if m:
+            v = float(m.group(1).replace(",", "."))
+            if 0 <= v <= 25:
+                vrate = m.group(1).replace(",", "."); break
+    out["vat_rate_pct"] = vrate
+
+    return out
+
+pdfs = []
+for pdf in sorted(glob.glob(os.path.join(work,"atts","*.pdf")) +
+                  glob.glob(os.path.join(work,"atts","*.PDF"))):
+    name = os.path.basename(pdf)
+    txt_path = os.path.join(work,"text", os.path.splitext(name)[0] + ".txt")
+    text = open(txt_path).read() if os.path.isfile(txt_path) else ""
+    h = extract(text)
+    h["attachment_name"] = name
+    h["pdf_size_bytes"] = os.path.getsize(pdf)
+    h["pdf_text_len"] = len(text)
+    pdfs.append(h)
+
+result = {
+    "email": {
+        "messageId": meta.get("messageId"),
+        "subject":   mail_subject,
+        "from":      mail_from,
+        "date":      mail_date,
+        "hasAttachment": str(meta.get("hasAttachment","")) == "1",
+    },
+    "attachments": pdfs,
+    "dolibarr_draft_suggestions": [
+        {
+            "supplier_hint":  p.get("pdf_top_line"),
+            "invoice_ref":    p.get("invoice_ref"),
+            "invoice_date":   p.get("invoice_date_raw"),
+            "total_ht":       p.get("total_ht"),
+            "total_tva":      p.get("total_tva"),
+            "total_ttc":      p.get("total_ttc"),
+            "vat_rate_pct":   p.get("vat_rate_pct"),
+            "source_email":   meta.get("messageId"),
+            "source_attachment": p.get("attachment_name"),
+        } for p in pdfs
+    ]
+}
+
+if fmt == "json":
+    print(json.dumps(result, indent=2, ensure_ascii=False))
+    sys.exit(0)
+
+print("=" * 80)
+print(f" Email {meta.get('messageId')}")
+print("=" * 80)
+print(f"  subject   : {mail_subject}")
+print(f"  from      : {mail_from}")
+print(f"  date      : {mail_date}")
+print(f"  attached  : {result['email']['hasAttachment']}")
+print()
+if not pdfs:
+    print("  (no PDF attachments — try inspecting body or other types)")
+for i, p in enumerate(pdfs, 1):
+    print(f"  -- Attachment {i}: {p['attachment_name']} ({p['pdf_size_bytes']} bytes, {p['pdf_text_len']} chars extracted) --")
+    for k in ("pdf_top_line","invoice_ref","invoice_date_raw","total_ht","total_tva","total_ttc","vat_rate_pct"):
+        v = p.get(k)
+        print(f"    {k:<16} = {v!r}")
+    print()
+
+print("  Suggested Dolibarr supplier-invoice draft entries:")
+print(json.dumps(result["dolibarr_draft_suggestions"], indent=4, ensure_ascii=False))
+PY