Hermes 维护 Trilium Sitemap 方案

一、链路架构图

┌─────────────────────────────────────────────────────────────┐
│                    Google Search Console                      │
│  → 读取 sitemap.xml → 发现 514 条 URL → 爬取/索引           │
└───────────────────────────┬─────────────────────────────────┘
                            │ HTTPS
                            ▼
┌─────────────────────────────────────────────────────────────┐
│              Nginx (反向代理 trilium.atibm.com)               │
│                                                              │
│  location = /sitemap.xml {                                   │
│      proxy_pass https://trilium.atibm.com/share/api/notes/   │
│                  387btDIOMxHp/download;                       │
│  }                                                            │
│                                                              │
│  → 将 /sitemap.xml 代理到 Trilium 的文件型笔记的下载链接      │
└───────────────────────────┬─────────────────────────────────┘
                            │ Docker 内网
                            ▼
┌─────────────────────────────────────────────────────────────┐
│              Trilium 容器 (zadam/trilium:0.63.7)             │
│                                                              │
│  Note 387btDIOMxHp (sitemap.xml)                             │
│  ├─ type: file                                               │
│  ├─ mime: application/xml                                    │
│  ├─ shareAlias: sitemap.xml                                  │
│  └─ content: 514 条 URL 的 sitemap                           │
└───────────────────────────┬─────────────────────────────────┘
                            │ ETAPI (REST)
                            ▼
┌─────────────────────────────────────────────────────────────┐
│         Hermes Agent (cron: sitemap-auto-update)             │
│                                                              │
│  每 6 小时触发一次 (0 */6 * * *)                              │
│  ┌───────────────────────────────────────────────────────┐   │
│  │ ~/.hermes/scripts/trilium-crawl-sitemap.py             │   │
│  │                                                         │   │
│  │  1. curl 获取 /share/hermes 根页 (~12s)                │   │
│  │     - 用 subprocess + curl 而非 urllib                 │   │
│  │     - Python urllib 有 SSL 超时问题，curl 稳定 0.5s/页  │   │
│  │                                                         │   │
│  │  2. 正则提取所有 href="./..." 笔记链接                   │   │
│  │     - is_note_path 过滤器: 8-15 位 ID / 别名 / CVE-     │   │
│  │     - 排除 .js/.css/.png 等静态资源                      │   │
│  │                                                         │   │
│  │  3. 生成 sitemap.xml (514 条 URL)                       │   │
│  │     - root: daily / priority 1.0                        │   │
│  │     - others: weekly / priority 0.6                     │   │
│  │                                                         │   │
│  │  4. ETAPI PUT → Trilium Note 387btDIOMxHp               │   │
│  │     - token: $TRILIUM_ETAPI_TOKEN (环境变量)             │   │
│  │                                                         │   │
│  │  5. stdout 输出 → cron deliver → 通知到 Telegram        │   │
│  └───────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────┘

📌 注意：不采用 BFS，因为根页面 /share/hermes 已经链接到所有分享笔记
（每个子页面也链向全部 650+ 条笔记，无需深层爬取）

二、Sitemap 生命周期

2.1 触发时机

定时触发：每 6 小时 (0 */6 * * *) 自动运行
手动触发：通过 hermes cron run sitemap-auto-update
新增笔记后：无需手动操作，下次 cron 自动发现（根页面会包含新笔记的链接）

2.2 数据流

单页爬取：curl 获取 /share/hermes（~12s，123KB HTML）
提取链接：正则提取所有 href="./xxx" 中的笔记路径
过滤：is_note_path 函数筛选有效笔记（8-15 位 ID / 别名 / CVE-）
生成 XML：按 sitemap 协议组装（loc / lastmod / changefreq / priority）
存储：写入 /tmp/sitemap-latest.xml
同步：curl PUT → Trilium ETAPI 更新文件型笔记
发布：Nginx proxy_pass 使 /sitemap.xml 对外服务

2.3 为什么不 BFS

Trilium 的分享页面是"导航页"结构——每个页面都包含指向所有其他分享笔记的链接。根页面 /share/hermes 已包含全部 514 条链接。BFS 只会重复发现同样的链接，无谓增加耗时和服务器负载。

早期版本尝试了并发 BFS（5~50 workers），发现两个坑：

并发越高越慢：Trilium 容器连接池有限，20+ 并发导致排队降速
所有页面内容相同：每个子页面都返回同样的 650+ 链接，BFS 不产生新信息

2.4 搜索引擎消费

Googlebot 定期轮询 https://trilium.atibm.com/sitemap.xml
发现 514 条分享笔记 URL
逐个爬取页面 HTML，提取标题、正文、元数据
建立索引，出现在 Google 搜索结果中

三、Google Search Console 配置

3.1 提交 Sitemap

Search Console → 属性 (trilium.atibm.com) → Sitemaps
提交 URL：https://trilium.atibm.com/sitemap.xml
验证：返回 200 + 合法 XML + 所有 URL 可访问 (HTTP 200)

3.2 验证文件

Trilium nginx 已配置以下验证文件（用于 Google Search Console 所有权验证）：

/google3eaf0c02c5ae17be.html → Google 所有权验证
/ads.txt → Google AdSense
/robots.txt → 爬虫指引

3.3 robots.txt 问题

当前 robots.txt 的 sitemap 指向了 https://trilium.atibm.com/share（302 跳转页），应改为：

User-agent: *
Sitemap: https://trilium.atibm.com/sitemap.xml

需要去 Docker 宿主机上编辑 /usr/share/nginx/html/trilium/robots.txt

四、脚本代码

文件路径：/opt/data/home/.hermes/scripts/trilium-crawl-sitemap.py

#!/usr/bin/env python3
"""Trilium sitemap crawler — single root-page fetch.

The root /share/hermes contains links to ALL shared notes directly visible.
Fetches once (~12s), extracts 500+ URLs, builds sitemap, updates ETAPI.

Cron-compatible: completes in ~15s, well within 120s limit.
"""
import subprocess, sys, re, os
from xml.dom import minidom
import xml.etree.ElementTree as ET
from datetime import datetime, timezone

BASE = "https://trilium.atibm.com"
ALIAS = "hermes"
OUTPUT_FILE = "/tmp/sitemap-latest.xml"
TIMEOUT = 20
SKIP_SUFFIXES = (".js", ".css", ".png", ".jpg", ".jpeg",
                 ".gif", ".svg", ".ico", ".woff", ".woff2", ".ttf", ".eot")

def fetch(url):
    # 关键: 用 curl subprocess 而非 urllib
    # Python urllib 在此环境下有 SSL 超时问题
    try:
        r = subprocess.run(["curl", "-s", "--max-time", str(TIMEOUT), url],
            capture_output=True, text=True, timeout=TIMEOUT+2)
        return r.stdout if r.returncode == 0 and r.stdout else None
    except Exception:
        return None

def is_note_path(p):
    if p.startswith("share/"): p = p[len("share/"):]
    if p == ALIAS or p.startswith("CVE-"): return True
    if re.match(r'^[A-Za-z0-9]{8,15}$', p): return True
    if re.match(r'^[a-zA-Z][a-zA-Z0-9]{4,30}$', p): return True
    return False

now = datetime.now(timezone.utc).strftime("%Y-%m-%d")

html = fetch(f"{BASE}/share/{ALIAS}")
if not html:
    sys.stderr.write("ERROR: root page fetch failed\n")
    sys.exit(1)

urls = {f"{BASE}/share/{ALIAS}"}
for m in re.finditer(r'href="\./([^"]+)"', html):
    c = m.group(1)
    if c != ALIAS and is_note_path(c):
        urls.add(f"{BASE}/share/{c}")

# Build XML
urlset = ET.Element("urlset",
    xmlns="http://www.sitemaps.org/schemas/sitemap/0.9")
el = ET.SubElement(urlset, "url")
ET.SubElement(el, "loc").text = f"{BASE}/share/{ALIAS}"
ET.SubElement(el, "lastmod").text = now
ET.SubElement(el, "changefreq").text = "daily"
ET.SubElement(el, "priority").text = "1.0"
for u in sorted(urls - {f"{BASE}/share/{ALIAS}"}):
    el = ET.SubElement(urlset, "url")
    ET.SubElement(el, "loc").text = u
    ET.SubElement(el, "lastmod").text = now
    ET.SubElement(el, "changefreq").text = "weekly"
    ET.SubElement(el, "priority").text = "0.6"

xml_bytes = minidom.parseString(
    ET.tostring(urlset, encoding="unicode")
).toprettyxml(indent="  ", encoding="utf-8")

with open(OUTPUT_FILE, "wb") as f: f.write(xml_bytes)
sys.stdout.buffer.write(xml_bytes)

# ETAPI update
token = os.environ.get("TRILIUM_ETAPI_TOKEN")
if not token:
    for ep in ["/opt/data/.env", os.path.expanduser("~/.env")]:
        if os.path.exists(ep):
            with open(ep) as f:
                for line in f:
                    if line.startswith("TRILIUM_ETAPI_TOKEN="):
                        token = line.split("=", 1)[1].strip("\"'")
                        break
            if token: break

nid = os.environ.get("TRILIUM_SITEMAP_NOTE_ID", "387btDIOMxHp")
if token:
    r = subprocess.run(["curl", "-s", "-X", "PUT",
        f"https://trilium.atibm.com/etapi/notes/{nid}/content",
        "-H", f"Authorization: {token}",
        "-H", "Content-Type: text/plain",
        "--data-binary", "@" + OUTPUT_FILE],
        capture_output=True, text=True, timeout=20)
    sys.stderr.write(f"ETAPI: exit={r.returncode}\n")

sys.stderr.write(f"DONE: {len(urls)} URLs -> {OUTPUT_FILE}\n")

五、Cron Job 配置

hermes cron list
────────────────────────────────────────────────
sitemap-auto-update
  schedule: 0 */6 * * *           (每天 00:00, 06:00, 12:00, 18:00)
  no_agent: true                  (直接跑脚本，不经过 LLM)
  script:   trilium-crawl-sitemap.py
  deliver:  origin                (结果发回 Telegram 群)
  state:    scheduled
  next:     2026-05-15 18:00
────────────────────────────────────────────────

六、运营指标

当前 sitemap URL 数：514（全部 HTTP 200 ✅）
爬取方式：单根页 curl（无需 BFS）
爬取耗时：~15 秒（根页 12s + ETAPI 更新 3s）
失败率：0%（单次请求，无并发问题）
更新频率：每 6 小时（满足 Google 推荐频率）
sitemap 文件大小：约 90KB（远低于 50MB 限制）

七、故障排查

症状	可能原因	解决
sitemap 返回 404	Nginx proxy_pass 配置错误，或笔记被删除	检查 /etc/nginx/conf.d/trilium.conf 中的 proxy_pass URL
ETAPI 401	Token 过期	在 Trilium 设置中重新生成 ETAPI token，更新 .env
cron 超时	no_agent 默认 120s 限制，根页加载超时	检查 curl 能否正常访问 /share/hermes；调大 --max-time
URL 数减少	新笔记未配置 #share 标签，或笔记被删除	检查笔记是否已设置 share 标签
Python urllib 超时	urllib 在此环境 SSL 不稳定	脚本已经使用 curl subprocess 替代

八、开发历程（历次尝试）

顺序 BFS：urllib 逐页爬取 637 URL，耗时 3-5 分钟。成功但超时。
并发 BFS（50 workers）：urllib SSL 超时，大量 FAIL。
并发 BFS + HEAD 预检（20 workers）：urllib 仍然超时，预检也超时。
低并发 BFS（5 workers）：可以工作但太慢（513 子页 × 2s = 17 分钟）。
curl subprocess + 单根页：✅ 最终方案。curl 稳定 0.5s/页，根页 = 全量索引。

关键发现：

Python urllib 在此环境 SSL 通信不稳定（间歇性超时），curl 稳定
Trilium 分享页是"导航页"结构，每个页面都包含全量链接，不需要 BFS
并发请求会打满 Trilium 容器连接池，反而降低总吞吐