Files
erp/.claude/skills/arcodange-email-ingest/scripts/email-inspect.sh
Gabriel Radureau c2d8479f5e add arcodange-email-ingest — Zoho Mail → Dolibarr supplier-invoice drafts
V8 — first inbound-side skill. Closes the loop from "bill arrives by email"
to "ready to enter in Dolibarr UI". Read-only at every layer.

What ships:
- arcodange-email-ingest/scripts/zoho-curl.sh   OAuth wrapper with token cache
                                                (50 min TTL, mode 600) — avoids
                                                hitting Zoho OAuth rate limit on
                                                every invocation.
- arcodange-email-ingest/scripts/email-list.sh   List candidates in /Inbox/books
                                                (where the books@ alias auto-
                                                routes mail). --candidates-only
                                                filter on supplier patterns or
                                                attachments. --all-folders to
                                                scan everything.
- arcodange-email-ingest/scripts/email-inspect.sh   Pull message + attachments,
                                                pdftotext on each PDF, heuristic
                                                extract (supplier, ref, dates,
                                                totals, VAT rate), emit Dolibarr
                                                supplier-invoice draft JSON.

Architecture choice — Zoho API (not IMAP):
- books@arcodange.fr is an alias of gabrielradureau@arcodange.fr → one OAuth
  refresh_token covers everything.
- Gmail folded in via forwarding (arcodange@gmail.com → books@) — no Google
  API setup, no app-passwords, no second OAuth flow.
- Token-based auth, no SCA rabbit hole.

V8.0 baseline (in /Inbox/books):
- 3 candidates: Mistral AI facture, Anthropic Stripe receipt (Fwd Gmail),
  INPI payment receipt (Fwd Gmail).
- Heuristic extraction is best-effort: works on amounts/refs for some
  templates, misses others (Mistral PDF format, Stripe receipt layout).
- --save-pdf <DIR> lets the operator grab the PDFs for manual entry when
  the heuristic falls short.

Rate-limit pitfall documented: Zoho OAuth refresh has an aggressive throttle
("too many requests continuously"). The cache file at $TMPDIR/zoho-access-$USER
(mode 600, 50 min TTL) prevents this; on 401 the wrapper auto-refreshes once
and retries.

V8.1+ ideas in SKILL.md out-of-scope:
- mark ingested emails (IMAP flag or Zoho label)
- body text extraction (inline-HTML invoices)
- per-template parsers or LLM-based extraction
- IMAP fallback for non-Zoho mailboxes

CLI: bin/arcodange email {list|inspect|curl} integrated.
Base updates: dolibarr/SKILL.md cross-link, dolibarr/README.md env schema
extended with ZOHO_CLIENT_ID/SECRET/REFRESH_TOKEN/DC.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-31 14:56:15 +02:00

257 lines
10 KiB
Bash
Executable File

#!/usr/bin/env bash
# Inspect one email by id and propose a Dolibarr supplier-invoice draft.
#
# Usage:
# email-inspect.sh <messageId> [--folder PATH] # default folder: /Inbox/books
# [--save-pdf DIR] # save PDF attachments under DIR/
# [--json] # emit a single JSON object on stdout
#
# Pipeline (read-only):
# 1. Find the message (in the given folder, default /Inbox/books).
# 2. List attachments via /attachmentinfo.
# 3. For each PDF attachment: download, run pdftotext, extract supplier-side
# heuristics (name, totals, dates, ref).
# 4. Emit a draft "Dolibarr-ready" record per attachment so the operator can
# hand-create the supplier invoice in the Dolibarr UI.
#
# This skill DOES NOT write to Dolibarr. Auto-creation of supplier invoices is
# V9 candidate.
set -euo pipefail
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
ZOHO_CURL="${SCRIPT_DIR}/zoho-curl.sh"
if [[ $# -lt 1 ]]; then
echo "email-inspect.sh: missing <messageId>" >&2
echo " Hint: bin/arcodange email list to see candidate ids." >&2
exit 2
fi
MID="$1"; shift || true
FOLDER="/Inbox/books"; SAVE_PDF_DIR=""; FMT="text"
while [[ $# -gt 0 ]]; do
case "$1" in
--folder) FOLDER="$2"; shift 2 ;;
--save-pdf) SAVE_PDF_DIR="$2"; shift 2 ;;
--json) FMT="json"; shift ;;
-h|--help) sed -n '2,18p' "$0" | sed 's/^# \{0,1\}//'; exit 0 ;;
*) echo "email-inspect.sh: unknown arg: $1" >&2; exit 2 ;;
esac
done
command -v pdftotext >/dev/null || { echo "email-inspect.sh: pdftotext not found (brew install poppler)" >&2; exit 2; }
WORK="$(mktemp -d -t emailinspect.XXXXXX)"
trap 'rm -rf "${WORK}"' EXIT
# 1. accountId + folderId
"${ZOHO_CURL}" /accounts > "${WORK}/accounts.json"
AID=$(python3 -c "import json,sys; d=json.load(open(sys.argv[1])); print((d.get('data') or [{}])[0].get('accountId',''))" "${WORK}/accounts.json")
"${ZOHO_CURL}" "/accounts/${AID}/folders" > "${WORK}/folders.json"
FID=$(python3 -c "
import json, sys
d = json.load(open(sys.argv[1]))
target = sys.argv[2]
for f in (d.get('data') or []):
if f.get('path') == target:
print(f.get('folderId')); break" "${WORK}/folders.json" "${FOLDER}")
[[ -z "${FID}" ]] && { echo "email-inspect.sh: folder '${FOLDER}' not found" >&2; exit 2; }
# 2. Find the message in the folder listing (to grab metadata: subject, from, date)
"${ZOHO_CURL}" "/accounts/${AID}/messages/view?folderId=${FID}&limit=100&sortorder=false&start=1" > "${WORK}/folder_msgs.json"
python3 - "${WORK}/folder_msgs.json" "${MID}" > "${WORK}/meta.json" <<'PY'
import json, sys
d = json.load(open(sys.argv[1]))
mid = sys.argv[2]
for m in (d.get("data") or []):
if str(m.get("messageId")) == mid:
json.dump(m, sys.stdout); sys.exit(0)
sys.exit(f"messageId {mid} not found in this folder")
PY
# 3. Attachment metadata
"${ZOHO_CURL}" "/accounts/${AID}/folders/${FID}/messages/${MID}/attachmentinfo" > "${WORK}/attachinfo.json"
# 4. Download each attachment — needs raw bytes (Accept: */*), not the JSON
# wrapper's default. We bypass zoho-curl.sh for the attachment download but
# reuse the cached access_token it wrote.
set -a; source "${SCRIPT_DIR}/../../dolibarr/.env"; set +a
: "${ZOHO_DC:=eu}"
TOKEN_CACHE="${TMPDIR:-/tmp}/zoho-access-$(whoami)"
if [[ ! -s "${TOKEN_CACHE}" ]]; then
echo "email-inspect.sh: missing access token cache — run any zoho-curl call first to populate it" >&2
exit 2
fi
ACCESS_TOKEN=$(cat "${TOKEN_CACHE}")
MAIL_BASE="https://mail.zoho.${ZOHO_DC}/api"
mkdir -p "${WORK}/atts" "${WORK}/text"
ATT_IDS=$(python3 -c "
import json, sys
d = json.load(open(sys.argv[1]))
data = d.get('data') or {}
for a in (data.get('attachments') or []):
print(f\"{a.get('attachmentId')}|{a.get('attachmentName','-')}\")" "${WORK}/attachinfo.json")
while IFS='|' read -r aid aname; do
[[ -z "${aid}" ]] && continue
outpath="${WORK}/atts/${aname}"
curl -sS \
-H "Authorization: Zoho-oauthtoken ${ACCESS_TOKEN}" \
-H "Accept: */*" \
--max-time 60 \
-o "${outpath}" \
"${MAIL_BASE}/accounts/${AID}/folders/${FID}/messages/${MID}/attachments/${aid}" || true
# If pdf, extract text (bash 3.2 compatible — no ${var,,})
aname_lc=$(echo "${aname}" | tr '[:upper:]' '[:lower:]')
if [[ "${aname_lc}" == *.pdf ]]; then
pdftotext -layout "${outpath}" "${WORK}/text/${aname%.pdf}.txt" 2>/dev/null || true
fi
done <<< "${ATT_IDS}"
# Optional save
if [[ -n "${SAVE_PDF_DIR}" ]]; then
mkdir -p "${SAVE_PDF_DIR}"
cp "${WORK}/atts/"*.pdf "${SAVE_PDF_DIR}/" 2>/dev/null || true
fi
# 5. Heuristic extract + render
python3 - "${WORK}" "${FMT}" <<'PY'
import json, sys, os, re, datetime, glob
work, fmt = sys.argv[1:3]
meta = json.load(open(os.path.join(work,"meta.json")))
ts = int(meta.get("sentDateInGMT") or meta.get("receivedTime") or 0) // 1000
mail_date = datetime.datetime.fromtimestamp(ts).strftime("%Y-%m-%d") if ts else None
mail_from = (meta.get("fromAddress") or meta.get("sender") or "-").replace("&lt;","<").replace("&gt;",">").replace("<","").replace(">","")
mail_subject = meta.get("subject") or "-"
# Heuristics on PDF text
def extract(text):
out = {}
# First non-empty line is often the supplier name (or the address block first line)
lines = [l.strip() for l in text.splitlines() if l.strip()]
out["pdf_top_line"] = lines[0] if lines else None
# Total TTC / HT / TVA — try multiple French/English patterns
def first_match(*patterns):
for p in patterns:
for line in lines:
m = re.search(p, line, re.IGNORECASE)
if m: return m.group(1).replace(",", ".").replace(" ", "")
return None
def parse_amount(s):
if not s: return None
clean = s.replace(",", ".").replace(" ", "")
try:
v = float(clean)
# Money amounts < 1M EUR; filters out VAT-number false positives (FR12345678901)
return v if 0 <= v < 1_000_000 else None
except: return None
def first_amount(*patterns):
for p in patterns:
for line in lines:
m = re.search(p, line, re.IGNORECASE)
if m:
v = parse_amount(m.group(1))
if v is not None: return f"{v:.2f}"
return None
out["total_ht"] = first_amount(r'(?:total\s*ht|montant\s*ht|net\s*amount|subtotal)[^\d-]*([\d \.,]+)')
# TVA: require currency suffix to avoid matching VAT-number digits
out["total_tva"] = first_amount(r'(?:tva|vat)[^\d-]*([\d \.,]+)\s*(?:€|eur)\b')
out["total_ttc"] = first_amount(r'(?:total\s*ttc|amount\s*due|total\s*due|grand\s*total|montant\s*total|amount\s*paid)[^\d-]*([\d \.,]+)')
# Invoice ref — must contain a digit (filters "umber", "Invoice", etc.)
m = re.search(r'(?:facture|invoice|receipt|reçu)\s*(?:n[°o]?|number|#|:)\s*([A-Za-z0-9][\w\d/-]{2,})', text, re.IGNORECASE)
if m and any(c.isdigit() for c in m.group(1)):
out["invoice_ref"] = m.group(1)
else:
# Fallback: any reasonable ref-shaped token after "Invoice" / "Facture" header
m = re.search(r'\b([A-Z]{2,}[-/]?\d[\w\d/-]{2,})\b', text)
out["invoice_ref"] = m.group(1) if m else None
# Invoice date — try ISO, French DD/MM/YYYY, English MM/DD/YYYY, French long form
out["invoice_date_raw"] = None
for p in (
r'\b(\d{4}-\d{2}-\d{2})\b',
r'(?:date|émise\s*le|invoice\s*date|date\s*de\s*facturation)[:\s]*(\d{1,2}[\s/.-]\d{1,2}[\s/.-]\d{2,4})',
r'(?:date|émise\s*le|invoice\s*date)[:\s]*(\d{1,2}\s+\w{3,9}\.?\s+\d{4})',
):
m = re.search(p, text, re.IGNORECASE)
if m: out["invoice_date_raw"] = m.group(1).strip(); break
# VAT rate (e.g. "20%") — restrict to 0-25% so "100%" / page footers don't match.
vrate = None
for line in lines:
m = re.search(r'\b(\d{1,2}([.,]\d+)?)\s*%', line)
if m:
v = float(m.group(1).replace(",", "."))
if 0 <= v <= 25:
vrate = m.group(1).replace(",", "."); break
out["vat_rate_pct"] = vrate
return out
pdfs = []
for pdf in sorted(glob.glob(os.path.join(work,"atts","*.pdf")) +
glob.glob(os.path.join(work,"atts","*.PDF"))):
name = os.path.basename(pdf)
txt_path = os.path.join(work,"text", os.path.splitext(name)[0] + ".txt")
text = open(txt_path).read() if os.path.isfile(txt_path) else ""
h = extract(text)
h["attachment_name"] = name
h["pdf_size_bytes"] = os.path.getsize(pdf)
h["pdf_text_len"] = len(text)
pdfs.append(h)
result = {
"email": {
"messageId": meta.get("messageId"),
"subject": mail_subject,
"from": mail_from,
"date": mail_date,
"hasAttachment": str(meta.get("hasAttachment","")) == "1",
},
"attachments": pdfs,
"dolibarr_draft_suggestions": [
{
"supplier_hint": p.get("pdf_top_line"),
"invoice_ref": p.get("invoice_ref"),
"invoice_date": p.get("invoice_date_raw"),
"total_ht": p.get("total_ht"),
"total_tva": p.get("total_tva"),
"total_ttc": p.get("total_ttc"),
"vat_rate_pct": p.get("vat_rate_pct"),
"source_email": meta.get("messageId"),
"source_attachment": p.get("attachment_name"),
} for p in pdfs
]
}
if fmt == "json":
print(json.dumps(result, indent=2, ensure_ascii=False))
sys.exit(0)
print("=" * 80)
print(f" Email {meta.get('messageId')}")
print("=" * 80)
print(f" subject : {mail_subject}")
print(f" from : {mail_from}")
print(f" date : {mail_date}")
print(f" attached : {result['email']['hasAttachment']}")
print()
if not pdfs:
print(" (no PDF attachments — try inspecting body or other types)")
for i, p in enumerate(pdfs, 1):
print(f" -- Attachment {i}: {p['attachment_name']} ({p['pdf_size_bytes']} bytes, {p['pdf_text_len']} chars extracted) --")
for k in ("pdf_top_line","invoice_ref","invoice_date_raw","total_ht","total_tva","total_ttc","vat_rate_pct"):
v = p.get(k)
print(f" {k:<16} = {v!r}")
print()
print(" Suggested Dolibarr supplier-invoice draft entries:")
print(json.dumps(result["dolibarr_draft_suggestions"], indent=4, ensure_ascii=False))
PY